Skip to main content

How to Choose an Alert Event On-call Platform

How should teams choose an alert event On-call platform? This article compares two open-source options and one commercial option, and explains where each fits.

Flashcat Operations Team

Developers and operations engineers may run into the following problems when handling alerts:

  • Alert storms: the phone keeps ringing, and you cannot even make a call.
  • Alerts are scattered across platforms. A company usually has more than one monitoring system, such as Zabbix, Prometheus, Nightingale, cloud monitoring, Grafana, ElastAlert, and monitoring bundled inside commercial products.
  • Collaboration is difficult. An important incident may need many people, so teams create a group chat, but the messages are scattered and messy.
  • Mobile work is inconvenient. Alerts may arrive through SMS, email, or DingTalk bot messages. To comment, acknowledge, or handle them, you may need to open a laptop, connect VPN, and log in to the platform.
  • Alerts are easy to miss, such as when someone is asleep, has poor signal, or has their phone muted.

This is where a good alert event On-call platform matters. Below are three options: two open-source options and one commercial platform.

Option 1: Zabbix

Zabbix's built-in PROBLEM management is actually quite smooth and complete. It supports conditional dispatch, notification media management, PROBLEM Update, and multi-step escalation. The design is solid. The following image shows Zabbix dispatch rules:

Alert dispatch rules

In Zabbix, these are called Trigger Actions. You can filter by alert event attributes, send different events to different people through different notification media, and even configure multi-step escalation and acknowledgement. If your company only uses Zabbix as the monitoring system, Zabbix's built-in PROBLEM management can solve many problems.

However, the Zabbix ecosystem is relatively closed and cannot ingest events from other monitoring systems. That is part of its design. If your company runs multiple monitoring systems, you need another solution.

Option 2: Alertmanager

Alertmanager was not designed only for Prometheus. It provides a unified API for receiving alert events, so different monitoring systems can push events to Alertmanager. Alertmanager can then handle silencing, inhibition, grouping, and dispatch. It handles the event itself reasonably well, but the human-side design is rough: Alertmanager has no people model, so it cannot naturally handle schedules, acknowledgement, escalation, collaborative comments, and other people-centric operations.

Alertmanager is more like a middle component. Monitoring systems send alert events to it; Alertmanager does preliminary processing and then delivers to components such as PagerDuty. As a company-wide unified On-call platform, it is still insufficient. But in the open-source community, there are few better options, so Alertmanager is one of the few relevant choices.

Option 3: Flashduty

Flashduty is a SaaS product with a free plan, built specifically as a unified alert event On-call platform. It is similar to PagerDuty overseas. Because Flashduty is built for Chinese users, it fits domestic cloud monitoring products and IM tools such as DingTalk, WeCom, and Feishu well. As a unified On-call platform, Flashduty mainly provides:

  • Integrations with monitoring systems to collect alert events in one place.
  • Event processing such as filtering, enrichment, and relabeling.
  • Event dispatch by condition.
  • Schedule management.
  • Permission management.
  • Collaborative comments.
  • Mobile handling through IM integrations and App support.
  • Alert event analytics.
  • Acknowledgement, escalation, reassignment, and closure.
  • Alert noise reduction.
  • Integration with external event management systems such as Jira.

Flashduty is dedicated to On-call, so it is more complete than Zabbix and Alertmanager for this job. The positioning is different: Zabbix is mainly a monitoring system, Alertmanager is mainly an alert event aggregation component, and Flashduty is a professional On-call platform.

Summary

An alert event On-call platform is important. Google SRE has a strong On-call culture, and landing that culture requires an On-call platform. This article listed two community options and one commercial option for reference. At minimum, your company should have a schedule mechanism. Having everyone woken up by every alert is not sustainable.

Related articles