Skip to main content

One Diagram Explaining the Core Concepts of Alert On-call

Alert integration, label enrichment, grouping and noise reduction, suppression, schedules, acknowledgement, escalation, reassignment, collaboration, notification, and analytics.

Flashcat

On-call process

  • Alert integration: the goal is to handle all alerts in one On-call platform. Most common monitoring tools can connect through Webhooks, so an On-call platform can adapt to different monitoring systems and provide Webhook endpoints with low configuration cost for users. Some less open monitoring tools only send email notifications; if the On-call platform can receive and parse those emails, email integration becomes a fallback.

  • Label enrichment: the richer the labels in an alert, the more efficiently engineers can handle it. Many monitoring tools send alerts with only a few bare fields, such as host name, metric, and threshold. If the platform can connect to external metadata such as CMDB and enrich alert fields, those fields can drive more automated dispatch, and engineers can judge impact and severity faster.

  • Grouping and noise reduction: grouping similar alerts and converging frequent alerts can greatly reduce alert volume and unnecessary interruptions. Rule-based grouping and semantic-similarity grouping are both useful. Grouping can span monitoring sources: alerts from Zabbix and Prometheus can be grouped if they are similar.

  • Alert suppression: high-severity alerts can suppress low-severity alerts, or lower-level infrastructure alerts can suppress upper-layer service alerts. In all cases, suppression introduces some dependency relationship. These dependencies are costly to maintain and difficult to explain, so heavy use at large scale is not recommended.

  • Schedules: the goal is to avoid interrupting the whole team repeatedly. Daily duty, holiday duty, temporary swaps, and fair rotation all need consideration. Shift handoff should have clear notifications. Responders should also have roles, such as primary and backup.

  • Acknowledgement: in theory, every alert should be acknowledged. If an alert is sent, nobody acknowledges it, and nothing bad happens, the alert is meaningless and should not have been sent. MTTA is often used to measure acknowledgement efficiency.

  • Escalation and reassignment: clear escalation paths for different severities reduce responder stress and help solve problems quickly and accurately. Escalation can be manual or automatic. For example, if an alert remains unresolved and unrecovered for more than 30 minutes, it can automatically escalate to a manager or backup responder.

  • Collaboration: during alert handling, relevant people can be involved at any time. When collaborators are added, they need accurate and timely notification, and the handling process and timeline should be preserved so collaborators can quickly understand the whole picture.

  • Notification: overseas teams often use Slack as a collaboration operating system because of its huge ecosystem. In China, WeChat/WeCom, Feishu, and DingTalk dominate. These IM tools support app development, and receiving, acknowledging, closing, reassigning, and handling alerts inside IM apps is a key way to improve the On-call experience.

  • Analytics: compression rate, MTTA, MTTR, acknowledgement rate, and alert count are key metrics for On-call efficiency. Analyzing them by business, team, and individual helps drive alert optimization and governance.

Contact us

Related articles