Skip to main content

Efficient On-call: From Concept to Practice

On-call is an important mechanism for operations teams to protect business continuity. This article covers why schedules matter, how to design them, how to use Flashduty schedules, and a practical example.

Flashcat

Imagine running a company without any duty or on-call system. For operations and reliability teams, service failures could create several problems:

  • No clear owner, or constant dependence on the same person while others stay behind.
  • No backup plan when key people take leave.
  • Long response time that affects business continuity.
  • Uneven workload across team members.

These are exactly the problems a healthy schedule system should prevent. A scientific duty system is essential for stable business operations.

1. Why You Need a Schedule System

  1. Protect business continuity
  • Respond quickly to outages, downtime, performance degradation, and other unexpected issues.
  • Reduce business loss caused by system failures.
  • Meet compliance requirements in industries such as internet, finance, and healthcare.
  1. Improve team collaboration
  • Clarify responsibility and avoid finger-pointing.
  • Establish a standardized issue-handling process.
  • Promote knowledge sharing and experience accumulation.
  1. Balance workload
  • Avoid overloading a small number of people.
  • Allocate working time and rest time reasonably.
  • Provide On-call allowance or compensatory time off.
  1. Improve customer satisfaction
  • Shorten response and resolution time.
  • Provide stable and reliable service.
  • Build customer trust.

A reasonable schedule system not only reduces individual burden, but also improves team cohesion and long-term operating efficiency.

2. Core Elements of an Efficient Schedule System

Designing an efficient schedule system requires several considerations:

  1. Reasonable rotation
  • Fairness: distribute shifts evenly so one person is not always on weekends or holidays.
  • Flexibility: allow planned or temporary shift swaps for leave and personal needs.
  • Rotation cycle: choose hourly, daily, weekly, or monthly rotations based on business needs.
  1. Connect with the alert platform
  • Match alerts to the best responder based on alert type, severity, and required expertise, instead of notifying everyone.
  • Set different priorities for different alert types so urgent issues get immediate attention and non-critical issues can be handled at a suitable time.
  1. Define on-call roles

Schedules often include primary and backup roles:

  • The primary responder handles all routine tasks and issues during the shift and is the first responder.
  • If the primary responder is temporarily unavailable, the backup can take over to keep the schedule continuous.
  • If the primary responder is overloaded or does not respond, the issue can escalate to the backup.
  1. Scheduling and notification
  • Scheduling system: use an on-call management tool to reduce errors from manual scheduling.
  • Instant messaging: integrate with IM tools such as Feishu and DingTalk so information is delivered quickly and accurately.
  • Multi-channel notification: use email, SMS, in-app notifications, and other channels to reach the right people.

3. Schedule Practice in Flashduty

This section explains how to create a schedule in Flashduty and introduces several core concepts. See the documentation for more.

  1. Create a schedule
  • Managing team: the team that owns the schedule and has read/write permission.
  • Shift-change notification
    • Advance notification: notify the next responder N minutes before shift handoff.
    • Scheduled notification: notify at fixed times, such as every day at 8:00.
  • Notification channels
    • Direct message: one-to-one push, such as email, SMS, voice, or some IM apps.
    • Group chat: group messages, including Feishu, DingTalk app cards, and Webhook bots.

Create schedule

  1. Configure schedule rules
  • Fair rotation: ensure every member has opportunities to participate across different periods and rounds, so one person is not always on holidays.
  • Date mask: rotate only on selected dates, such as Monday to Friday and not weekends.
  • On-call time: configure active periods by day, week, or month, such as all-day duty or 08:00-18:00.
  • Responders: members participating in the rule. You can configure roles, multiple people per group, and multiple rotating groups.

Configure schedule rules

  1. Temporary shift swaps

Temporary swaps are used when a responder takes leave or has a short-term absence and another member covers the shift.

Temporary shift swap

  1. Preview schedules

You can preview the schedule by week, two weeks, or calendar view.

Schedule preview

4. Example

  1. Background

An internet company's operations team maintains all online services. Because of the order business, the team needs 7x24 online support. If an important incident is not handled in time, it must escalate to business owners.

  1. Alert handling flow
  • Response mechanism
Response teamNotification methodOrder business respondersNotes
L1Voice, SMS, Feishu groupDaytime 09:00-23:00: Zhang San. Night 23:00-09:00: Li Si, Wang Wu, Xiao Liu, rotating weekly at Monday 23:00.Requires dispatch-policy escalation to implement.
L2Voice: development owner / reliability owner / operations owner7*24h: A, B, C, notify all
L3Voice: emergency response representative7*24h: X
  • Notification / escalation flow

Escalation flow:

Notification and escalation flow

  1. Implementation
  • Create three schedules: L1, L2, and L3, corresponding to notification targets at different response stages.
  • Configure schedule rules according to the order team's actual operating model.
  • Configure three stages in the dispatch policy, with each stage notifying the corresponding schedule.

L1 schedule rule:

L1 schedule rule

L2 schedule rule:

L2 schedule rule

L3 schedule rule:

L3 schedule rule

5. Summary

A scientific schedule system is an important foundation for efficient operations, especially for teams responsible for service reliability. Connecting schedules with alert systems and designing a reasonable escalation flow ensures that alerts reach the right people and critical alerts are not missed. With careful design and execution, teams can significantly improve business continuity and service quality. For companies considering or already practicing On-call, continuous improvement of the schedule system is a long-term requirement. We hope this article provides useful reference and inspiration.

Related articles