Skip to main content

How a Leading SaaS Company Like Datadog Runs OnCall

Datadog is a leading player in monitoring and observability, with a market capitalization in the tens of billions of dollars and many SaaS customers. Its services have very high stability and availability requirements, and its OnCall practice is mature. This article introduces Datadog's approach to OnCall and what teams can learn from it.

Flashcat Operations Team

Introduction

Datadog is a leading player in monitoring and observability, with a market capitalization in the tens of billions of dollars and many SaaS customers. Its services have very high stability and availability requirements, and its OnCall practice is mature. This article introduces Datadog's approach to OnCall and what teams can learn from it.

What Goes Wrong Without an OnCall Rotation

  • The work often falls to the most responsible person on the team. That person is repeatedly woken up by phone calls at night, becomes exhausted over time, has no time for more valuable work, receives poor performance results, and sees their health decline.
  • The responder stays tense every day. Work efficiency drops, mistakes become more likely during duty, larger incidents can occur, service availability suffers, and the operations team, or even the entire engineering team, takes the blame.
  • Work and life lose their boundaries. Loyalty drops and attrition rises. The company then has to hire another reliable person, but in a small professional circle, word spreads quickly about how the previous responder was treated. Recruiting becomes harder.

In practice, ignoring OnCall is essentially ignoring stability.

Datadog's OnCall Rotation Model

  • Datadog considers workload, holiday plans, and temporary schedule swaps at the same time. A duty group usually has six to eight people rotating through shifts. If the team is short on people, it should still have at least three to four people on the schedule. Each shift is usually eight to twelve hours.
  • Engineers should not be on duty too frequently, or burnout becomes likely. But they should not avoid duty entirely either, because then they lose the incentive to improve the OnCall process.
  • During a shift, engineers focus only on OnCall-related work, such as receiving alerts, performing inspections, maintaining alert rules, updating dashboards, and improving SOPs. They do not develop new features.

The image above shows Datadog's OnCall schedule. It is already fairly complex. In practice, domestic teams often have even more varied and intricate scheduling requirements.

Supporting OnCall Responders

  • Training. Putting people on duty without training or reference material is not a workable approach.
  • Tools. Datadog naturally uses its own OnCall tool. Teams in China can use Flashduty. It supports schedule configuration and swaps, receives alerts from all monitoring systems in one platform, and supports acknowledgment, claiming, escalation, and dispatch. It also provides a mobile app and deep IM integrations.
  • Backup. Each shift should have a primary and a backup responder. If the primary responder misses the alert, the system automatically notifies the backup. If that still does not work, the alert escalates to the direct manager.

Direct Manager Participation

Datadog has direct managers participate in OnCall. This sets an example for the team and, more importantly, lets managers feel the pressure and pain of frontline response. Only then can they better optimize OnCall processes and tools.

Many managers in domestic teams stop doing frontline work after becoming managers. Over time, they lose touch with what frontline engineers actually experience, and the OnCall process gets worse.

Finally, we recommend our own OnCall product. You are welcome to register and try it for free:

https://flashcat.cloud/product/flashduty/

Flashduty

References:

Related articles