Skip to main content

How to Run On-call Without Burning People Out

Some teams claim to run DevOps and put everyone on On-call, but the work often falls on the most responsible person. Here is how leaders can make On-call fair, sustainable, and useful.

Buffett Ba

I recently had dinner with a friend, let's call him Zhang San. He told me he wanted to resign. He had been On-call every day for a year, carrying a laptop everywhere, and he was exhausted. I asked, "Is your team only you? Why are you the only person on call?" He laughed bitterly: "There are four people including my boss. The boss says the other two are unreliable, so they don't do On-call. He keeps telling me On-call will improve my skills. At year end, my performance rating was average, while the other two developers got better ratings. I'm done." The image below supposedly shows his lonely back while answering alert calls.

SRE exhausted by On-call

Clearly, Zhang San was in a bad management situation. Another common case is just as unpleasant: the boss says the team is adopting DevOps culture, "You build it, you run it, you monitor it," and then everybody is nominally On-call. In reality, the responsible person carries the laptop at all times, while less responsible people ignore incidents and later say they were asleep, had no signal, or were in the shower. The final result is that the honest person does most of the work, burns out, resigns, and the boss hires the next person to repeat the cycle.

This is a classic lose-lose:

  • For the responsible person, the result is frustration and resignation.
  • For the team, it exposes poor management, high turnover, and poor stability.

It proves the old line:

Every problem is a management problem.

Blaming management, culture, or the founder does not immediately solve the problem. So let's look at what first-line and second-line leaders can do. If you are an individual contributor, you can at least discuss these ideas with your manager or share this article where they might see it.

I cannot keep doing alert duty every day

How Should On-call Be Done So People Do Not Quit?

1. Rotate the duty.

Nobody wants to do On-call continuously. It needs scheduling, rotation, and shared responsibility. Fairness matters. Build a schedule, put everyone on it, and account for temporary swaps and holiday rotations. Excel can work at first, but it becomes cumbersome. Tools such as PagerDuty, OpsGenie, and Flashduty all provide schedule management, shift-change notifications, temporary swaps, and holiday scheduling, which greatly reduce management overhead.

2. Reward On-call work.

If I am on duty this week and my daily pay doubles, motivation changes immediately. Many overseas companies do this with dedicated On-call bonuses or extra time off. People are willing to rotate because the work is recognized, instead of being like Zhang San, who worked for a year without rest. The company pays some cost, but employee efficiency, stability, and happiness improve. That is a win-win.

3. Hold On-call responders accountable.

Accountability must be built on top of rewards. Clear rewards and penalties are the sustainable path; punishment without reward does not work. If the On-call responder deliberately ignores calls or alerts and causes an incident, there should be consequences, such as reduced bonus, performance impact, or even termination. The policy must be fair and widely accepted, or it will not land.

If an alert goes unhandled for too long, it should escalate to the backup or directly to a leader to make sure every alert is eventually handled. This requires the alert event distribution platform to support escalation. PagerDuty, OpsGenie, and Flashduty all provide escalation mechanisms so alerts are not missed.

4. Treat stability risks found during On-call as high priority.

When the On-call engineer discovers a stability risk, the team should discuss it promptly and assign high priority to the fix. As the system becomes more stable, alert volume falls and On-call becomes easier.

5. Use a unified alert convergence and noise-reduction platform.

If a network failure triggers thousands of alert calls, the On-call phone becomes unusable and the responder will be overwhelmed. An alert platform should converge and reduce noise: merge similar alerts, suppress low-value alerts, and make response manageable. Monitoring systems usually focus on data collection, storage, visualization, and alert generation. They are often weak at downstream convergence, noise reduction, dispatch, and escalation. Consider building a dedicated On-call platform or using a mature product so people can spend time on more valuable work.

6. Provide complete SOP documentation.

Alert handling is stressful, and mistakes are easy. Responders need complete SOPs that explain what to do in each situation. This improves efficiency and reduces error probability. If SOP logic can be turned into scripts, even better: the script can run automatically when the alert fires, without human intervention. This is why Nightingale has an alert self-healing feature.

Every new module should provide the basics before launch:

  • Define SLI metrics and configure alert rules.
  • Write SOP documentation and train the team.
  • Define emergency contacts, usually module owners. If they do not want to be interrupted frequently, they must make SLI and SOP work solid.

Leaders should take On-call seriously. Do not burn out the people who are responsible enough to care. If you have better ideas, share them with your team.

Contact us

Appendix

Public article platforms sometimes make links inconvenient, so here are the links from the article:

Related articles