Prometheus users often start alert notification with Alertmanager.
That is reasonable. Alertmanager is the standard component in the Prometheus alerting stack. It receives alerts from Prometheus and handles deduplication, grouping, routing, silencing, inhibition, and notification delivery. For a single team with clear rules and a simple notification path, Alertmanager solves a lot.
But Alertmanager is not a complete On-call platform.
It is closer to a scheduler in the alert-notification pipeline: which alert should be sent to which receiver under which conditions. A professional On-call platform solves another set of problems: who is on call, who acknowledged the incident, what happens when nobody responds, whether the response has a timeline, and whether leaders can see MTTA, MTTR, response rate, and responder load.
So the real question is not whether Alertmanager is good. It is whether your team's problem is still alert routing, or whether it has become incident response management.
What Alertmanager Is Good At
Alertmanager has five core strengths.
First, deduplication. Multiple Prometheus instances or repeated copies of the same alert do not endlessly notify the team.
Second, grouping. Similar alerts can be merged into one notification. For example, if many instances in the same cluster cannot reach a database, grouping by labels such as cluster and alertname reduces notification volume.
Third, routing. Alertmanager uses a routing tree to send alerts with different labels, severities, or business ownership to different receivers.
Fourth, silencing. Maintenance windows, known issues, and temporary cases that do not need notification can be silenced for a defined period.
Fifth, inhibition. When a higher-level failure already exists, derived alerts can be suppressed. For example, if a whole cluster is unreachable, connectivity alerts from services inside that cluster can be inhibited.
These capabilities are extremely useful.
If you have one SRE team, a moderate number of Prometheus rules, consistent alert labels, and mostly fixed notification targets, Alertmanager may be enough.
When Alertmanager Starts to Struggle
The problem usually does not appear on day one.
At first, the team may have dozens of alert rules, one default receiver, one IM group, and a few silences. Alertmanager configuration is clear, and everyone knows who should react.
As the business grows, the situation changes:
- One Prometheus becomes many Prometheus instances.
- One team becomes multiple business teams.
- One IM group becomes many groups, owners, and notification policies.
- Alerts expand from infrastructure to applications, databases, middleware, Kubernetes, and cloud services.
- Alert handling shifts from "someone checks it" to "we need schedules, escalation, postmortems, and metrics."
Alertmanager can still route alerts, but it does not own the complete response workflow.
It does not naturally answer:
Who is on call tonight?
When the primary responder does not react, when does it escalate to the backup?
Has this incident been acknowledged?
If the responder snoozes the incident, should escalation continue?
Who was notified during the incident, and did notification succeed?
Which service created the most alert noise this month?
Which team's MTTA became longer?
Which alerts interrupt responders during sleep time?
When these questions affect team efficiency, alerting is no longer just a notification configuration problem. It is an On-call process problem.
Boundary 1: From Receivers to Schedules
An Alertmanager route ultimately sends an alert to a receiver. A receiver can be email, Webhook, PagerDuty, Opsgenie, Slack, or another notification target.
In On-call, the important question is who should be notified right now.
That is not a static list.
The same service may notify the application team during business hours, the primary/backup SRE at night, and a special weekend rotation on holidays. Critical businesses may require both primary and backup responders. Leave requests require temporary swaps. Rotation should also be fair so the same person is not always on weekends or nights.
Flashduty schedules connect incidents with people. Schedule rules support hourly, daily, weekly, and monthly rotations; day/night shifts; primary and backup roles; date masks; temporary swaps; and fair rotation. Dispatch policies can notify the current person in a schedule or only a role such as primary or backup.
This is not complexity for its own sake. It keeps "who is responsible today" out of static Alertmanager receiver configuration.
Boundary 2: From Notification Delivery to Automatic Escalation
Alertmanager can send an alert.
Sending an alert does not mean someone responded.
In real incidents, the dangerous case is often not that there is no alert. It is that the alert was sent, everyone assumed someone else would handle it, and nobody took ownership.
A professional On-call platform makes response actions explicit.
Flashduty dispatch policies can configure trigger conditions, notification targets, notification methods, delay windows, templates, and escalation rules. Escalation can be based on "not closed" or "not closed and not acknowledged," and can move to the next step after a defined time. Targets can be schedules, teams, individuals, or combinations. Notification methods include phone, SMS, email, App push, IM direct message, and group chat.
This solves a core problem: critical incidents cannot rely on "people should see the message."
For example:
For a severity=critical payment alert, notify the primary payment SRE first; escalate to the backup if it is not acknowledged in 10 minutes; escalate to the payment development owner if it is not closed in 30 minutes; send lower-level Warning alerts only to the IM group without phone calls.
If this entire policy lives in Alertmanager YAML, it quickly becomes a hard-to-audit convention. In an On-call platform, dispatch, notification, acknowledgement, and escalation become one traceable workflow.
Boundary 3: From Alert to Incident
Alertmanager handles alerts.
An On-call platform handles incidents.
That difference is important.
A real incident is rarely one alert. API error rate may rise first, database connection pools may be exhausted next, service availability may fail after that, and recovery notifications may arrive later. Responders need to handle "payment service unavailable," not every single alert event.
Flashduty treats raw notifications from monitoring systems as events. Events trigger alerts, and similar alerts can be grouped into one incident. The incident is the main object: it can be dispatched, notified, acknowledged, snoozed, closed, merged, and reviewed.
Noise reduction happens between alert and incident.
Flashduty supports rule-based grouping and intelligent grouping. Rule-based grouping exactly matches specified labels or attributes. Intelligent grouping calculates similarity from title, description, labels.service, labels.resource, and other fields. Later alerts that enter an existing incident do not retrigger notifications.
Flashduty also supports storm warnings, flapping detection, silences, and suppression policies. Flapping detection identifies incidents that repeatedly trigger and recover, and reduces interruption according to configuration. A delay window can wait before the first notification; if the incident recovers automatically during that window, no notification is sent.
Alertmanager also has grouping, silencing, and inhibition. The difference is that a professional On-call platform stores the noise-reduction result as an incident object and continues with dispatch, acknowledgement, escalation, timeline, and metrics around that incident.
Boundary 4: From Group Messages to Actionable Collaboration
Many teams send Alertmanager alerts to Feishu, DingTalk, WeCom, or Slack groups.
That is easy to set up, but it can easily get out of control.
Group messages notify people, but they are poor at managing state. Once messages scroll away, nobody knows whether someone is handling the issue. Someone may reply "I'll check," but the system may not record an acknowledged state. After recovery, the response process may not leave a complete record.
Flashduty incidents can be acknowledged in the console, IM app messages, and voice calls. After a voice alert finishes reading the message, the responder can acknowledge by pressing a key. IM app cards can provide actions such as acknowledge and close. The incident timeline records trigger, dispatch, notification, acknowledgement, snooze, closure, comments, and other actions.
This matters because incident response is not just "who received a message." It also needs to answer "who owns it, what state is it in, who is responsible next, was it escalated, and can we reconstruct the story afterward?"
Boundary 5: From Configuration Files to Data-Driven Management
Alertmanager configuration tells you how alerts are routed.
It does not easily tell you whether team response is improving.
SRE leaders usually need metrics such as:
- How many incidents happened this month?
- Which services, teams, or workspaces generated the most alerts?
- What are MTTA and MTTR?
- Which incidents were never acknowledged?
- Which responders were interrupted most by SMS, phone, or App push?
- Did alerts concentrate during work time, rest time, or sleep time?
- Which alert checks and alert objects are most frequent?
Flashduty analytics show incident data by team, workspace, individual, and other dimensions. Metrics include MTTA, MTTR, response rate, response effort, and interruption count. Time can be split into work, rest, and sleep periods. Global views show top alert checks and objects, with data download and CSV export.
This data is critical for management.
Without data, alert governance is based on feeling.
With data, the team can identify which rule, service, schedule, or alert type consumes the most response time.
When Alertmanager Is Enough
Continue prioritizing Alertmanager if your team has these characteristics:
- Only a few Prometheus instances and alert rules.
- Mostly fixed alert recipients, with no complex schedule needs.
- Alerts can be handled quickly in one shared group.
- No need for primary/backup schedules, automatic escalation, phone/SMS fallback.
- No need for cross-team collaboration, ticket synchronization, status pages, or postmortems.
- No need to measure MTTA, MTTR, and response effort by team, person, or service.
- Alertmanager is maintained by a small group and configuration complexity is still manageable.
In that situation, a well-configured Alertmanager with clean labels and high-quality rules is more practical than introducing another platform too early.
When to Introduce a Professional On-call Platform
Evaluate a dedicated On-call platform when you see these signals:
- Alerts come not only from Prometheus, but also from Zabbix, Nightingale, Grafana, cloud monitoring, logs, APM, or internal systems.
- Multiple teams share one Alertmanager configuration, and routes and receivers are becoming hard to maintain.
- Alerts often enter a group but nobody clearly acknowledges them.
- Critical alerts need phone, SMS, App, and IM notification paths.
- Schedules are maintained in Excel, group announcements, or memory.
- If the primary responder does not react, alerts must escalate to a backup or manager.
- Alert storms generate a large number of repeated notifications.
- The team needs MTTA, MTTR, response rate, interruption count, and responder-load metrics.
- After incidents, the team needs a timeline, postmortem, ticket, or audit record.
At that point, pushing all logic into Alertmanager is not always simpler. It turns the response process into hidden YAML, increasing maintenance cost and incident risk.
A safer model is to keep Prometheus and Alertmanager for monitoring and alert routing, and move On-call response into a dedicated platform.
How to Connect Alertmanager to Flashduty
The integration path is direct: Alertmanager sends alerts to Flashduty through a Webhook receiver.
Flashduty supports Prometheus alert integration and Alertmanager 0.16.0 or later. Create a Prometheus integration in Flashduty, copy the integration push URL, and add a Webhook receiver to Alertmanager:
receivers:
- name: 'flashcat'
webhook_configs:
- url: '<your integration push URL>'
send_resolved: true
If you want Alertmanager to send alerts to Flashduty by default, reference the receiver in route:
route:
receiver: 'flashcat'
If you do not want to affect existing notification channels, place the Flashduty receiver in a child route and trial only selected services or severities.
Flashduty supports dedicated integrations and shared integrations. A dedicated integration sends alerts directly into one workspace. A shared integration uses payload data and routing rules to send different alerts to different workspaces.
Prometheus alert severity maps to Flashduty severity. Flashduty checks labels such as severity, priority, and level in order. For example, critical maps to Critical, warning or warn maps to Warning, info maps to Info, and ok maps to Ok.
This means you do not need to replace Prometheus or abandon Alertmanager. Use Alertmanager as the upstream alert source, and let Flashduty complete the response loop.
What to Validate in a 14-Day Trial
Do not evaluate an On-call platform only from a feature table.
Run real Prometheus alerts through it.
Use this sequence:
- Choose one business service and connect the Alertmanager Webhook.
- Confirm that Critical, Warning, and Info severities map correctly.
- Create a workspace and schedule for the service.
- Configure a dispatch policy: Critical alerts use phone or SMS; Warning alerts only go to IM.
- Configure unacknowledged escalation, such as escalating to the backup if nobody acknowledges within 10 minutes.
- Enable rule-based or intelligent grouping and check whether repeated alerts enter the same incident.
- Trigger a test alert and validate IM, phone, SMS, and App paths.
- Acknowledge, snooze, and close the incident in IM or the console.
- Check the incident timeline to confirm notification, acknowledgement, and closure are recorded.
- Review analytics for incident count, MTTA, MTTR, response rate, and interruption count.
If these ten steps work, the team can judge whether Alertmanager is still enough or whether incident response belongs in a professional On-call platform.
Conclusion: Do Not Replace Alertmanager. Complete the Response Loop.
Prometheus Alertmanager is an excellent alert routing component and should not be casually replaced.
It is good at alert grouping, routing, silencing, inhibition, and notification delivery. The missing part is the response workflow after Alertmanager: schedules, dispatch, escalation, acknowledgement, collaboration, timeline, analytics, and postmortems.
For small teams, Alertmanager may be exactly the right tool.
For multi-team environments, 7x24 schedules, critical services, complex notification paths, and data-driven management, a professional On-call platform is usually a better fit.
The safest way to decide is not to argue about tool boundaries. Take a real set of alerts and run a 14-day trial.
Keep Prometheus and Alertmanager.
Send alerts to Flashduty.
Let the team run the complete flow from alert trigger to notification, acknowledgement, escalation, resolution, and analytics.
Sources:
- Prometheus Alertmanager documentation: deduplication, grouping, routing, silencing, inhibition, and HA. https://prometheus.io/docs/alerting/latest/alertmanager/
- Prometheus Alerting overview: the boundary between Prometheus alerting rules and Alertmanager. https://prometheus.io/docs/alerting/latest/overview/
- Flashduty Prometheus alert integration: connect Alertmanager Webhook to Flashduty. https://docs.flashcat.cloud/zh/on-call/integration/alert-integration/alert-sources/prometheus
- Flashduty noise reduction: event, alert, and incident model; grouping, storm warnings, flapping detection, silences, and suppression. https://docs.flashcat.cloud/zh/on-call/channel/noise-reduction
- Flashduty dispatch policies and schedule management: schedules, notification methods, delay windows, escalation rules, and primary/backup on-call. https://docs.flashcat.cloud/zh/on-call/channel/escalation-rule
- Flashduty analytics: MTTA, MTTR, response rate, response effort, interruption count, and data export. https://docs.flashcat.cloud/zh/on-call/analytics/insights