Don't let on-call burn out your best people

I recently had dinner with a friend, let's call him Zhang San. He wanted to resign and asked me out for a drink. He told me he had been on-call every single day for a year, carrying his laptop everywhere, and he was at his breaking point.

I asked: "Is your team just you? Why are you the only one on call?"

He gave me a bitter smile: "There are four of us counting the boss. He says the other two aren't reliable, so they don't do on-call. He keeps telling me it'll sharpen my skills. Come year end, my review was average, while the two who just wrote code in peace scored higher than me. It doesn't sit right. I'm done."

Clearly, Zhang San landed under a bad manager. There's another version that's just as galling. The boss announces the team is adopting DevOps culture, "You build it, you run it, you monitor it," and so on paper everyone is on-call. In practice, the responsible person hauls the laptop around at all hours, while the less responsible ones ignore incidents entirely. Press them on it and the excuses never run out: I was asleep, I had no signal, I was in the shower. The ending is always the same. The team isn't small, but one honest person does the work until they burn out and leave, the boss hires a replacement, and the cycle repeats.

This is a textbook lose-lose:

For the responsible person: chronic resentment, then resignation.
For the team: anyone watching can see weak management, high turnover, and shaky stability.

Which proves the old line:

In the end, every problem is a management problem.

Blaming "management," "culture," or the founder personally rarely fixes the problem in front of you. A first- or second-line leader still has levers to pull. If you're an individual contributor, you can at least raise these ideas with your manager. And if that fails, share this article in the group chat and hope the boss reads it.

How to run on-call without driving people away

1. Rotate the duty. Don't let one person carry it.

Nobody wants to do this forever, so you schedule it, rotate it, and share the load. What matters most is fairness. Build a rotation, put everyone on it, and account for the messy parts: ad-hoc swaps and holiday coverage.

A spreadsheet works at first, but it gets painful fast. Shift-change notices depend on someone remembering to shout, and one last-minute swap throws the whole grid off. This is exactly the kind of thing a dedicated tool handles better. Products like PagerDuty, OpsGenie, and Flashduty come with built-in scheduling: automatic rotation, shift-change notifications, temporary swaps, and holiday coverage, which takes the management overhead off your plate.

2. Reward the people on call.

If my daily pay doubles the week I'm on duty, my enthusiasm changes instantly. Many companies outside China do exactly this, with dedicated on-call bonuses or extra time off. That's why people are willing to take turns, instead of ending up like Zhang San, who didn't take a single day off in a year.

Yes, the company pays for it. But what it buys is faster response, a more stable system, and most importantly, happier engineers. Any clear-eyed leader can do this math.

3. Pair rewards with accountability, in that order.

Clear rewards and consequences, applied fairly, are the only sustainable path. Punishment without reward isn't management; it's exploitation.

That said, if someone on duty deliberately ignores calls and alerts until an incident results, there have to be consequences too, whether that's a hit to their review, their bonus, or something more serious in egregious cases. The policy has to be credible and broadly accepted, or people won't buy in and it won't stick no matter how hard you push it.

And when an alert really does sit unhandled for too long, the system should escalate it automatically, to the backup or straight to a leader, so every alert eventually has someone backstopping it. That requires an alert distribution platform with escalation built in. PagerDuty, OpsGenie, and Flashduty all support this, so an alert never slips through just because one person dropped the ball.

4. Treat risks found on-call as high priority.

When an on-call engineer spots a stability risk, bring it to the team promptly, set a priority, and fix it soon. The more stable the system gets, the fewer alerts fire, and the easier on-call becomes. It's a virtuous cycle. The worst outcome is a known risk left hanging while the same alert fires over and over, grinding the responder's patience down to nothing.

5. Use a platform that converges and de-noises alerts.

When one network failure fires a thousand alert calls in a row, the on-call phone gets so flooded that even a normal call won't go through. People get frustrated, fast.

What you need is a platform that converges and reduces noise: collapse alerts sharing a root cause into one, and suppress the ones that don't matter. Most monitoring systems focus on collection, storage, visualization, and alert generation. They tend to be weak at what comes after the alert: convergence, noise reduction, dispatch, and escalation. Rather than building all of that from scratch, use a mature product and free your people up for more valuable work.

6. Equip on-call with solid SOPs.

Handling alerts is stressful, and stress is when mistakes happen. A solid SOP matters here: spell out, situation by situation, exactly what to do step by step. That both speeds up response and lowers the odds of a fat-fingered mistake.

Better still, encode the SOP logic into scripts, so the moment an alert fires the fix runs automatically and nobody has to crawl out of bed to type commands. That's precisely why Nightingale built alert self-healing.

Before any new module ships, line up a few basics first:

Define the SLI metrics and configure the alert rules.
Write the SOPs and train the team on them.
Name an emergency contact, usually the module owner. And if that owner doesn't want to be paged constantly, that's all the more reason to get the SLIs and SOPs right, or they're simply asking for pain.

Anti-patterns worth checking

Here are the on-call anti-patterns I've watched play out over the years. Count how many describe your team.

Severity levels are a mess, so nothing can be ignored

Nobody thought carefully about severity when writing the rules, so everything blurs into one pile. No alert feels safe to skip, which means real high-priority issues get buried while low-priority ones eat your attention.

Keep it to three levels, P1, P2, and P3. Too many levels add cognitive load when you're configuring rules, so people pick one carelessly and the whole scheme gets muddier, not clearer:

P1: the most severe, usually something hitting a core user-facing flow. Handle it immediately, via high-interrupt channels like a phone call or SMS.
P2: affects non-core functionality, or will escalate into an incident if not handled soon. Notify via medium-interrupt channels like SMS or chat.
P3: usually no user impact, but carries latent risk (let it sit a week and it might bite). No need to act immediately. A once-a-day review before you log off is enough, over a low-interrupt channel like email, or no notification at all.

Over-reliance on individual experience, with no standard process

Incident handling runs entirely on a few senior people's intuition, with the knowledge passed down by word of mouth. The moment the team changes, problems surface. And knowledge that lives only in someone's head is exactly what nerves scramble in an emergency.

The fix is to capture that experience as detailed SOPs that a newcomer can follow. Writing them isn't enough, though. Run drills to keep the SOPs fresh, because an SOP left untouched for a long time is probably already stale. Catch the gaps during the drill and fix them so it stays useful.

Postmortems as box-checking

The issue gets resolved and everyone moves on without a postmortem, so the same pitfall gets stepped on again. Or the postmortem is so casual that it only treats the symptom in front of it.

A few rounds of asking "why" can add depth, but don't take it to the extreme. Push far enough and the answer always lands on organizational or cultural problems, which a front-line team can't solve anyway. What matters more is generalizing: one of the biggest gaps between a strong engineer and an average one is the ability to go from a single incident to a whole class of problems.

On-call load distributed unevenly

The whole team takes alerts in name only. In reality one honest person takes most of them, which is the heart of Zhang San's story. Concentrate the pressure on a few people and burnout and resignation are only a matter of time.

Beyond the fair rotation and compensation discussed above, set up a primary/backup on-call structure so no single person becomes a load-bearing wall.

A word to newcomers, too: don't be too quick to dodge on-call. In my experience it's a great way to grow. But if you need to make a change while on duty, get guidance from a senior engineer first rather than improvising, or a small problem turns into a big one. A lightweight approval step helps. Follow the process, and if something goes wrong, the responsibility is clear.

Neglecting tooling and automation

Incident handling is all manual, which is slow and error-prone. Senior engineers in particular often get comfortable with the manual routine and can't be bothered to learn new tools or build automation.

Invest where it counts: stand up the monitoring, alerting, and automation tooling to cut down manual steps, like scripts that roll a service back or fail over to another cluster fast. A one-click mitigation tool is essentially crystallized experience, and far more reliable than an SOP doc that can only be read.

No real communication or escalation mechanism

When an incident hits, information doesn't flow. The people who should be pulled in aren't, the business side and the leaders aren't told, and progress updates go out to no one. The result is delayed response and badly allocated effort.

Build a clear communication flow (a war-room mechanism, for instance) and an escalation path so information reaches the right people quickly and gets escalated by severity. The Google SRE book makes this point: external communication during an incident is a critical role, often owned by a dedicated person. Someone coordinates, someone fixes, someone communicates. With that division of labor, it doesn't descend into chaos.

Start before the best person leaves

Leaders, take on-call seriously before the most reliable engineer burns out. Start with one fair rotation, one clear reward rule, one escalation path, and one place where alerts converge. That is enough to stop the worst pattern: one person carrying the team until they quit. Flashduty can tie scheduling, rewards and accountability, escalation, convergence, and SOPs into one on-call flow.