Google's SRE book explains that OnCall is a critical part of SRE work. SRE teams are responsible for production stability, and OnCall is an important mechanism for protecting that stability. Overseas teams take alert OnCall seriously. Dedicated products such as PagerDuty are widely used and have become close to an industry standard. Even Netflix uses PagerDuty.
For teams in China, PagerDuty can be inconvenient. Network access can be an issue, and the English-only interface and support create friction for many operations engineers. PagerDuty also does not integrate well with domestic IM tools such as DingTalk, Feishu, and WeCom. But alert OnCall response is still a high-frequency, high-value need, so domestic alternatives have emerged. Flashduty is one of them: a lighter-weight alert collaboration platform that understands Chinese teams better than PagerDuty.
Why Is Alert OnCall So Important?
PagerDuty is a company with a market capitalization of about $2 billion. It focuses on one thing: alert OnCall. Why is this direction worth that level of investment? Because the pain is real and the ceiling is high. SREs and developers who participate in OnCall often face these problems:
- Alert events are scattered everywhere. Most companies run multiple monitoring systems for metrics, logs, cloud resources, self-built systems, open source tools, and commercial products. Alert events are scattered, alert rules are scattered, and personnel information is maintained repeatedly. Unified analysis and handling become very difficult.
- Alert storms. A failure in an underlying infrastructure component or service can trigger a broad alert storm. Phones may ring nonstop, and if voice alerts are involved, the phone can become unusable. This makes response extremely painful.
- Missed alerts. OnCall responders may miss alerts for many reasons: the phone runs out of battery, loses signal, is muted, is lost, is damaged, gets soaked, or the responder simply sleeps too deeply at night. Many strange situations can lead to missed alerts.
Products such as PagerDuty were created to solve these problems, and Flashduty is built for the same purpose. It focuses on alert OnCall, helping teams avoid alert bombardment, prevent missed alerts, and make alert response easier and calmer. Below, we use Flashduty as an example.
Alert Integration
Flashduty integrates with many types of monitoring systems. After those systems generate alert events, they send them to Flashduty for unified downstream handling:

Most monitoring systems provide webhooks for third-party integration, and Flashduty receives alert events through webhooks. This lets systems work together. However, different monitoring systems produce alert events in different formats, so a unified formatting process is required. This is tedious work, and Flashduty adapts to each event source.
Some monitoring systems are less capable and do not provide webhooks, but at least provide email alerts. In that case, Flashduty generates a dedicated email address. The monitoring system sends alert emails to that address, and Flashduty parses the email content into alert events. This integration method is basic, but Flashduty can still handle it.
Unified Alert Handling
After alert events enter Flashduty, they pass through a processing pipeline, including:
- Alert filtering. Drop alerts that are not needed.
- Alert muting. Mute certain alerts based on rules. They are still persisted, but no notification is sent.
- Enrichment. Alert events contain various labels, and teams usually need relabeling. Flashduty can also integrate with internal CMDB data. For example, when a machine alert occurs, Flashduty can use the IP address to retrieve additional information such as owning team and owner, then attach that information to the alert event.
- Persistence. Store alert events in a database for later querying.
- Convergence and grouping. Group similar alert events into an incident to avoid alert storms. Users can acknowledge at the incident level, effectively acknowledging a batch of alerts at once.
- Acknowledgment and escalation. Users can acknowledge incidents. If an incident is not acknowledged, Flashduty continues through the escalation process and notifies backup responders until someone acknowledges it, preventing missed alerts.
- Alert collaboration. Team members can comment on incidents and close incidents. When closing an incident, they can also write the handling method and lessons learned for future reference.
- Mobile support. Alert response is highly time-sensitive, so Flashduty provides a mobile app that lets users handle alerts anywhere.
- And more.
The following diagram shows Flashduty's alert handling flow:

The above is a logical explanation. Next, let's look at the Flashduty interface.
Flashduty Interface
First, here is the most commonly used page: the incident summary page, where incidents are collections of alerts.

An alert event response platform embodies OnCall culture. Anyone who has read Google's SRE book will be familiar with this concept. Without OnCall scheduling, SRE practice cannot really land. Scheduling requires a flexible tool. Below is Flashduty's schedule page:

Managers want to quantify alert and incident handling efficiency. Frontline engineers also want their workload to be measurable for year-end reporting. Analytics are therefore essential:

Being woken up by an alert call at night, getting out of bed, opening a laptop, connecting to the VPN, and opening the monitoring system is a painful experience. A mobile incident OnCall app greatly improves that experience, and Flashduty provides one:

Flashduty can do more than receive alert events. It also provides alert engine capabilities and can connect directly to time series databases and other storage systems for alert evaluation. In short, Flashduty can handle alert-related work broadly. Below is the alert rule management page:

Trial
Flashduty product introduction and trial links:
- Introduction: https://flashcat.cloud/product/flashduty/
- Trial: https://console.flashcat.cloud/
Conclusion
A real operations revolution does not mean keeping engineers on standby 24 hours a day. It means building systems that stay awake for you.