For frontline OnCall responders, usually operations engineers, and sometimes developers from business teams when operations staffing is limited, being woken up by alerts at night is one of the most painful experiences. This article offers several improvements that can help OnCall responders sleep better.
Treat It Seriously in Culture and Mindset
Culture and mindset always come first. If you are a frontline OnCall responder and you are suffering, but your manager and your manager's manager do not care, then communication breaks down and nothing improves.
Solving this problem can greatly improve employee well-being, sometimes more effectively than a salary increase. Many domestic companies do not pay enough attention to employees' experience. Sometimes this is not intentional; leaders are simply focused elsewhere and have not noticed the problem. Try discussing it with your manager. In many cases, it can get their attention.
It is said that some operations engineers overseas ask during interviews whether the company uses a product like PagerDuty. If the company does not, they simply decline the opportunity, because failing to adopt such a product essentially means neglecting OnCall engineers.
Rotate Duty
OnCall can help engineers grow, but nobody should do it continuously for a long time. This work is relatively fragmented, and repeatedly waking up at night can destroy physical health over time.
OnCall usually requires rotation, such as one group this week and another group next week. OnCall-focused products such as PagerDuty and Flashduty can configure schedules, support shift swaps, integrate with national holiday calendars, and support temporary changes.
Some engineers are willing to do OnCall because it can bring double pay or compensatory leave. But if everyone is OnCall, company cost becomes too high. Rotating duty is, in a sense, a way to save company cost because it reduces the number of people on duty and therefore the cost the company needs to pay.
What if your company provides no extra compensation for OnCall? Find a tactful way to suggest reform to your manager. If the manager pretends not to understand, there may be little you can do.
Optimize Alert Rules
There are established best practices for alert rule configuration. Typical examples include:
- Every alert should be actionable, meaning it corresponds to a loss-reduction action and an SOP.
- Alert on symptoms rather than causes. See this article for reference.
In many companies, after any incident, the first postmortem question is whether alert rules were complete. Employees then prefer adding more alert rules rather than being criticized, regardless of whether the rules are reasonable. Over time, large numbers of useful and useless alert rules accumulate. The work becomes painful, people eventually resign or transfer, and new hires cannot tolerate it either. In the long run, this is bad for both the team and the company.
Every alert should be actionable and correspond to a loss-reduction action or SOP. If an alert is only a reminder and requires no action, it is usually unnecessary. If you are worried about missing something important, let this type of alert automatically create a ticket and review it periodically. Do not call, send SMS, email, or DingTalk notifications for it.
Alerting on symptoms rather than causes becomes clear after reading the article above. In short, symptom metrics measure whether your service is healthy from the user's perspective. For example, if you provide an HTTP service, interface success rate and latency are typical symptom metrics and should be primary alert targets. CPU utilization for that service is usually a cause metric. Cause metrics are useful on dashboards, but usually should not be alert rules, because if a cause metric is truly affecting the business, symptom metrics will usually show it. If symptom metrics are fine, then the cause metric has probably not affected the business yet. At most, create a ticket and review it periodically.
Converge and Reduce Noise
This is one of the core capabilities of OnCall products such as PagerDuty and Flashduty. To reduce alert interruptions, many simultaneously triggered alerts can be grouped into a limited number of incidents. Notifications are sent at the incident level, preventing your phone from being overwhelmed and improving incident-handling efficiency.
In addition to large-scale instance alerts, other scenarios can also be converged. For example, alerts that repeatedly trigger, recover, trigger again, and recover again can be grouped by OnCall products. Teams can also define custom labels so alerts with the same labels are grouped into one incident, greatly reducing alert frequency.
Flexible Dispatch
Monitoring systems usually focus on monitoring data collection, storage, visualization, and alerting. After alerts are generated, their handling and dispatch are usually simple. OnCall products are much more flexible. They can define different delivery methods for different alerts, specify which alerts mention people and which do not, set different rules by time period, integrate with work calendars, define different templates, and configure repeated notification and escalation strategies.
If you have used Flashduty, you will notice that OnCall products can integrate well with DingTalk, WeCom, Feishu, Slack, and Microsoft Teams, enabling mobile work. Sometimes an alert arrives when you are half asleep. For alerts that do not need immediate short-term handling, a few taps on the phone are enough, and you do not need to get up and open a laptop. That can greatly improve employee well-being.
Operations Governance
From the company's overall perspective, bringing all alerts into one OnCall platform makes it easier to calculate MTTR and other metrics across different dimensions, then optimize weak points. It also reveals which alert rules trigger frequently, often indicating unreasonable rules that waste phone and SMS cost, frustrate employees, and do little to improve service stability.
No data, no decisions. For companies that are further ahead, operations governance deserves serious investment. Sending a weekly report to show managers the team's effort and results is also important.
References
- PagerDuty official website: https://www.pagerduty.com/
- Flashduty introduction: https://flashcat.cloud/product/flashduty/
