Skip to main content

Bad Alert On-call Practices: How Many Are You Doing?

The core of alert On-call practice is fast response, efficient collaboration, and continuous improvement. Avoiding these bad practices improves incident handling and reduces responder stress.

Flashcat Operations Team

On-call

Alert On-call has several common bad practices. They reduce incident-handling efficiency, hurt collaboration, and weaken system stability. Here are typical anti-patterns and how to improve them.

No Clear Alert Severity or Response Mechanism

If alert severity is not carefully designed, everything becomes messy. Teams become afraid to ignore alerts at any level, high-priority issues get delayed, and low-priority issues consume too many resources.

Recommendations:

  • Use three levels such as P1, P2, and P3. Too many levels create cognitive load during rule configuration, and people may choose levels casually.
  • P1 is the highest and most serious level. It usually affects the user's core path and requires immediate handling through disruptive channels such as phone and SMS.
  • P2 affects non-core features or may become an incident if not handled soon. Use medium-disruption channels such as SMS or IM.
  • P3 usually does not affect users directly but carries potential risk. It does not need immediate handling; reviewing it before the end of each day is often enough. Use no notification or low-disruption channels such as email.

Over-Reliance on Individual Experience and No Standard Process

Incident handling sometimes depends entirely on the intuition and experience of a few senior members, without a standardized SOP. Knowledge is passed verbally, which becomes risky when the team changes. In emergencies, relying on memory also increases stress and mistakes.

Recommendations:

  • Build detailed incident-handling playbooks and SOPs so new team members can get started quickly. Update them regularly for new incident patterns.
  • Run drills to keep SOPs fresh. Long-unmaintained SOPs are often outdated; drills reveal problems and help keep procedures accurate.

Ignoring Post-Incident Review

If no review happens after an incident, similar problems will repeat. If the review is too casual, the team only treats symptoms.

Recommendations:

  • Ask "why" several times to add depth. Do not go so deep that every answer becomes "organization" or "culture," because a frontline team cannot directly fix that.
  • Learn by analogy. The ability to generalize from one incident to similar risks is one of the biggest differences between excellent engineers and average engineers.

Uneven On-call Pressure

On-call work often concentrates on a few people, causing fatigue and resignation. Officially, the whole team receives alerts; in reality, one responsible person handles most of them.

Recommendations:

  • Create a fair rotation so every member participates in On-call, and provide proper compensation such as time off or bonuses.
  • Set primary and backup On-call roles to avoid single-person pressure. Dedicated On-call tools can help; Flashduty includes built-in schedule management.

New team members should not reject On-call completely. In my experience, On-call is a good way to grow. But if a new engineer needs to operate production systems, they must get guidance from senior engineers and avoid risky actions. An approval flow helps protect both the system and the newcomer.

Ignoring Tools and Automation

If incident handling depends on manual operations, it is slow and error-prone. Some experienced engineers are familiar with manual workflows and may resist new tools or automation.

Recommendations:

  • Invest in monitoring, alerting, and automation tools. For example, scripts can quickly roll back services or switch clusters.
  • One-click mitigation tools are essentially accumulated experience and are often more reliable than SOP documents alone.

No Effective Communication or Escalation Mechanism

During incidents, poor information flow delays response and resource coordination. Relevant people may not be involved quickly, business teams and leaders may not be notified, and progress updates may be missing.

Recommendations:

  • Establish a clear communication process, such as a War Room, and a clear escalation path. Make sure incident information reaches relevant people quickly and escalates by severity.
  • The Google SRE book emphasizes external communication as a key part of incident response. During an incident, one person coordinates, one or more people handle the technical work, and another person communicates. Multiple roles collaborate.

Summary

The core of On-call is fast response, efficient collaboration, and continuous improvement. By avoiding these bad practices, teams can improve incident-handling efficiency, reduce system risk, and lower On-call pressure.

Related articles