Is Alertmanager enough?

Prometheus users often start alert notification with Alertmanager.

That is reasonable. Alertmanager is the standard component in the Prometheus alerting stack. It receives alerts from Prometheus and handles deduplication, grouping, routing, silencing, inhibition, and notification delivery. For a single team with clear rules and a simple notification path, Alertmanager solves a lot.

But Alertmanager is not a complete On-call platform.

It is closer to a scheduler in the alert-notification pipeline: which alert should be sent to which receiver under which conditions. A professional On-call platform solves another set of problems: who is on call, who acknowledged the incident, what happens when nobody responds, whether the response has a timeline, and whether leaders can see MTTA, MTTR, response rate, and responder load.

The useful question is narrower: is your team's problem still alert routing, or has it become incident response management?

First, look at the open-source options together

Before discussing Alertmanager's boundaries, it helps to put the two open-source options that people usually compare it against side by side.

One is Zabbix. Its built-in PROBLEM management is surprisingly complete: through Trigger Actions you can filter alert events by attribute, send different events to different people, and even support multi-step escalation and acknowledgement. If your company runs only Zabbix, its built-in alert management already solves most of the problem. The catch is that the system is relatively closed — by design it does not ingest events from other monitoring systems. Once you have multiple monitoring sources, Zabbix is no longer the right candidate for a single point of entry.

The other is Alertmanager. From the start it was designed to serve more than just Prometheus: it offers a unified alert-receiving interface and handles the event itself (silencing, inhibition, grouping, routing) quite well. But it is coarse about "who should this go to next." Alertmanager has no model of people, so it cannot do schedules, acknowledgement, escalation, or collaborative comments — anything that involves humans. It is more of a middle component in the alert pipeline than a company-wide On-call platform.

In other words, in the open-source world Zabbix leans toward monitoring and Alertmanager toward alert aggregation. Neither is purpose-built for incident response. That is exactly the boundary this article is about.

What Alertmanager Is Good At

Alertmanager is good at the alert-notification layer.

It deduplicates repeated copies of the same alert so the team is not notified endlessly. It groups similar alerts into one notification; for example, if many instances in the same cluster cannot reach a database, grouping by cluster and alertname reduces notification volume. It routes alerts through a tree based on labels, severity, or business ownership.

It also handles the two common "do not wake people up" cases: silencing known maintenance windows or temporary issues, and inhibiting derived alerts when a higher-level failure already explains them. If a whole cluster is unreachable, connectivity alerts from services inside that cluster can be inhibited.

If you have one SRE team, a moderate number of Prometheus rules, consistent alert labels, and mostly fixed notification targets, Alertmanager may be enough.

When Alertmanager Starts to Struggle

The problem usually does not appear on day one.

At first, the team may have dozens of alert rules, one default receiver, one IM group, and a few silences. Alertmanager configuration is clear, and everyone knows who should react.

As the business grows, the situation changes:

One Prometheus becomes many Prometheus instances.
One team becomes multiple business teams.
One IM group becomes many groups, owners, and notification policies.
Alerts expand from infrastructure to applications, databases, middleware, Kubernetes, and cloud services.
Alert handling shifts from "someone checks it" to "we need schedules, escalation, postmortems, and metrics."

Alertmanager can still route alerts, but it does not own the complete response workflow.

It does not naturally answer:

Who is on call tonight?
When the primary responder does not react, when does it escalate to the backup?
Has this incident been acknowledged?
If the responder snoozes the incident, should escalation continue?
Who was notified during the incident, and did notification succeed?
Which service created the most alert noise this month?
Which team's MTTA became longer?
Which alerts interrupt responders during sleep time?

When these questions affect team efficiency, alerting is no longer just a notification configuration problem. It is an On-call process problem.

Boundary 1: From Receivers to Schedules

An Alertmanager route ultimately sends an alert to a receiver. A receiver can be email, Webhook, PagerDuty, Opsgenie, Slack, or another notification target.

In On-call, the important question is who should be notified right now.

That is not a static list.

The same service may notify the application team during business hours, the primary/backup SRE at night, and a special weekend rotation on holidays. Critical businesses may require both primary and backup responders. Leave requests require temporary swaps. Rotation should also be fair so the same person is not always on weekends or nights.

Flashduty schedules connect incidents with people. Schedule rules support hourly, daily, weekly, and monthly rotations; day/night shifts; primary and backup roles; date masks; temporary swaps; and fair rotation. Dispatch policies can notify the current person in a schedule or only a role such as primary or backup.

This is not complexity for its own sake. It keeps "who is responsible today" out of static Alertmanager receiver configuration.

Boundary 2: From Notification Delivery to Automatic Escalation

Alertmanager can send an alert.

Sending an alert does not mean someone responded.

In real incidents, the dangerous case is often not that there is no alert. It is that the alert was sent, everyone assumed someone else would handle it, and nobody took ownership.

A professional On-call platform makes response actions explicit.

Flashduty dispatch policies can configure trigger conditions, notification targets, notification methods, delay windows, templates, and escalation rules. Escalation can be based on "not closed" or "not closed and not acknowledged," and can move to the next step after a defined time. Targets can be schedules, teams, individuals, or combinations. Notification methods include phone, SMS, email, App push, IM direct message, and group chat.

This solves a core problem: critical incidents cannot rely on "people should see the message."

For example:

For a severity=critical payment alert, notify the primary payment SRE first; escalate to the backup if it is not acknowledged in 10 minutes; escalate to the payment development owner if it is not closed in 30 minutes; send lower-level Warning alerts only to the IM group without phone calls.

If this entire policy lives in Alertmanager YAML, it quickly becomes a hard-to-audit convention. In an On-call platform, dispatch, notification, acknowledgement, and escalation become one traceable workflow.

Boundary 3: From Alert to Incident

Alertmanager handles alerts.

An On-call platform handles incidents.

That difference is important.

A real incident is rarely one alert. API error rate may rise first, database connection pools may be exhausted next, service availability may fail after that, and recovery notifications may arrive later. Responders need to handle "payment service unavailable," not every single alert event.

Flashduty treats raw notifications from monitoring systems as events. Events trigger alerts, and similar alerts can be grouped into one incident. The incident is the main object: it can be dispatched, notified, acknowledged, snoozed, closed, merged, and reviewed.

Noise reduction happens between alert and incident.

Flashduty supports rule-based grouping and intelligent grouping. Rule-based grouping exactly matches specified labels or attributes. Intelligent grouping calculates similarity from title, description, labels.service, labels.resource, and other fields. Later alerts that enter an existing incident do not retrigger notifications.

Flashduty also supports storm warnings, flapping detection, silences, and suppression policies. Flapping detection identifies incidents that repeatedly trigger and recover, and reduces interruption according to configuration. A delay window can wait before the first notification; if the incident recovers automatically during that window, no notification is sent.

Alertmanager also has grouping, silencing, and inhibition. The difference is that a professional On-call platform stores the noise-reduction result as an incident object and continues with dispatch, acknowledgement, escalation, timeline, and metrics around that incident.

Boundary 4: From Group Messages to Actionable Collaboration

Many teams send Alertmanager alerts to Feishu, DingTalk, WeCom, or Slack groups.

That is easy to set up, but it can easily get out of control.

Group messages notify people, but they are poor at managing state. Once messages scroll away, nobody knows whether someone is handling the issue. Someone may reply "I'll check," but the system may not record an acknowledged state. After recovery, the response process may not leave a complete record.

Flashduty incidents can be acknowledged in the console, IM app messages, and voice calls. After a voice alert finishes reading the message, the responder can acknowledge by pressing a key. IM app cards can provide actions such as acknowledge and close. The incident timeline records trigger, dispatch, notification, acknowledgement, snooze, closure, comments, and other actions.

This matters because incident response is not just "who received a message." It also needs to answer "who owns it, what state is it in, who is responsible next, was it escalated, and can we reconstruct the story afterward?"

Boundary 5: From Configuration Files to Data-Driven Management

Alertmanager configuration tells you how alerts are routed.

It does not easily tell you whether team response is improving.

SRE leaders usually need metrics such as:

How many incidents happened this month?
Which services, teams, or workspaces generated the most alerts?
What are MTTA and MTTR?
Which incidents were never acknowledged?
Which responders were interrupted most by SMS, phone, or App push?
Did alerts concentrate during work time, rest time, or sleep time?
Which alert checks and alert objects are most frequent?

Flashduty analytics show incident data by team, workspace, individual, and other dimensions. Metrics include MTTA, MTTR, response rate, response effort, and interruption count. Time can be split into work, rest, and sleep periods. Global views show top alert checks and objects, with data download and CSV export.

This data gives managers a better starting point than feeling. With data, the team can identify which rule, service, schedule, or alert type consumes the most response time.

The Boundary in One Table

Put those five boundaries into a single selection table and the difference becomes concrete:

Dimension	Alertmanager	Professional On-call Platform
Multi-source ingestion	Prometheus-first; everything else via your own Webhook plumbing	Native connectors for many monitoring tools, cloud monitoring, logs, APM
Scheduling	No model of people; targets hard-coded in receivers	Rotating schedules with primary/backup, day/night shifts, swaps, fairness
Automatic escalation	None; an unanswered alert needs a human watching	Auto-escalate on unacknowledged/unclosed to backup or owner after N minutes
Noise reduction	Grouping, silencing, inhibition (stays at the alert layer)	Result lands on an incident object, plus intelligent grouping, storm warning, flapping detection
Collaboration & ack	Group-message notification; no managed state	Acknowledge from console / IM / voice call with a full incident timeline
Mobile	Depends on IM bot messages; actions require a laptop	App push plus acknowledge, comment, and close directly on mobile
Data-driven management	Config as documentation; no response metrics	MTTA/MTTR, response rate, interruption count, and responder-load dashboards
Cost & positioning	Open source, no license fee; positioned as alert aggregation	SaaS with a free tier; positioned as a closed-loop response platform

The table is not meant to prove who is "stronger." The point is this: when your needs land in those right-hand columns, pushing more logic into Alertmanager's config files rarely makes life simpler.

When Alertmanager Is Enough

Continue prioritizing Alertmanager if your team has these characteristics:

Only a few Prometheus instances and alert rules.
Mostly fixed alert recipients, with no complex schedule needs.
Alerts can be handled quickly in one shared group.
No need for primary/backup schedules, automatic escalation, phone/SMS fallback.
No need for cross-team collaboration, ticket synchronization, status pages, or postmortems.
No need to measure MTTA, MTTR, and response effort by team, person, or service.
Alertmanager is maintained by a small group and configuration complexity is still manageable.

In that situation, a well-configured Alertmanager with clean labels and high-quality rules is more practical than introducing another platform too early.

When to Introduce a Professional On-call Platform

Evaluate a dedicated On-call platform when you see these signals:

Alerts come not only from Prometheus, but also from Zabbix, Nightingale, Grafana, cloud monitoring, logs, APM, or internal systems.
Multiple teams share one Alertmanager configuration, and routes and receivers are becoming hard to maintain.
Alerts often enter a group but nobody clearly acknowledges them.
Critical alerts need phone, SMS, App, and IM notification paths.
Schedules are maintained in Excel, group announcements, or memory.
If the primary responder does not react, alerts must escalate to a backup or manager.
Alert storms generate a large number of repeated notifications.
The team needs MTTA, MTTR, response rate, interruption count, and responder-load metrics.
After incidents, the team needs a timeline, postmortem, ticket, or audit record.

At that point, pushing all logic into Alertmanager is not always simpler. It turns the response process into hidden YAML, increasing maintenance cost and incident risk.

A safer model is to keep Prometheus and Alertmanager for monitoring and alert routing, and move On-call response into a dedicated platform.

How to Connect Alertmanager to Flashduty

The integration path is direct: Alertmanager sends alerts to Flashduty through a Webhook receiver.

Flashduty supports Prometheus alert integration and Alertmanager 0.16.0 or later. Create a Prometheus integration in Flashduty, copy the integration push URL, and add a Webhook receiver to Alertmanager:

receivers:
- name: 'flashcat'
  webhook_configs:
  - url: '<your integration push URL>'
    send_resolved: true

If you want Alertmanager to send alerts to Flashduty by default, reference the receiver in route:

route:
  receiver: 'flashcat'

If you do not want to affect existing notification channels, place the Flashduty receiver in a child route and trial only selected services or severities.

Flashduty supports dedicated integrations and shared integrations. A dedicated integration sends alerts directly into one workspace. A shared integration uses payload data and routing rules to send different alerts to different workspaces.

Prometheus alert severity maps to Flashduty severity. Flashduty checks labels such as severity, priority, and level in order. For example, critical maps to Critical, warning or warn maps to Warning, info maps to Info, and ok maps to Ok.

This means you do not need to replace Prometheus or abandon Alertmanager. Use Alertmanager as the upstream alert source, and let Flashduty complete the response loop.

What to Validate in a 14-Day Trial

Do not evaluate an On-call platform only from a feature table.

Run real Prometheus alerts through it.

Use this sequence:

Choose one business service and connect the Alertmanager Webhook.
Confirm that Critical, Warning, and Info severities map correctly.
Create a workspace and schedule for the service.
Configure a dispatch policy: Critical alerts use phone or SMS; Warning alerts only go to IM.
Configure unacknowledged escalation, such as escalating to the backup if nobody acknowledges within 10 minutes.
Enable rule-based or intelligent grouping and check whether repeated alerts enter the same incident.
Trigger a test alert and validate IM, phone, SMS, and App paths.
Acknowledge, snooze, and close the incident in IM or the console.
Check the incident timeline to confirm notification, acknowledgement, and closure are recorded.
Review analytics for incident count, MTTA, MTTR, response rate, and interruption count.

If these ten steps work, the team can judge whether Alertmanager is still enough or whether incident response belongs in a professional On-call platform.

Keep Alertmanager, complete the response loop

Prometheus Alertmanager is an excellent alert routing component. Do not replace it casually.

It is good at alert grouping, routing, silencing, inhibition, and notification delivery. The missing part is the response workflow after Alertmanager: schedules, dispatch, escalation, acknowledgement, collaboration, timeline, analytics, and postmortems.

For small teams, Alertmanager may be exactly the right tool.
For multi-team environments, 7x24 schedules, critical services, complex notification paths, and data-driven management, a professional On-call platform is usually a better fit.

The safest way to decide is to run a 14-day trial with real alerts: keep Prometheus and Alertmanager, send the alerts to Flashduty, and let the team run the full flow from trigger to notification, acknowledgement, escalation, resolution, and analytics.

Sources:

Prometheus Alertmanager documentation: deduplication, grouping, routing, silencing, inhibition, and HA. https://prometheus.io/docs/alerting/latest/alertmanager/
Prometheus Alerting overview: the boundary between Prometheus alerting rules and Alertmanager. https://prometheus.io/docs/alerting/latest/overview/
Flashduty Prometheus alert integration: connect Alertmanager Webhook to Flashduty. https://docs.flashduty.com/en/on-call/integration/alert-integration/alert-sources/prometheus
Flashduty noise reduction: event, alert, and incident model; grouping, storm warnings, flapping detection, silences, and suppression. https://docs.flashduty.com/en/on-call/channel/noise-reduction
Flashduty dispatch policies and schedule management: schedules, notification methods, delay windows, escalation rules, and primary/backup on-call. https://docs.flashduty.com/en/on-call/channel/escalation-rule
Flashduty analytics: MTTA, MTTR, response rate, response effort, interruption count, and data export. https://docs.flashduty.com/en/on-call/analytics/insights