Skip to main content

Google SRE's OnCall Methods and Tools

Kubernetes is the open source implementation of Google's internal Borg container orchestration system, and Prometheus is the open source counterpart of Borgmon. Does Google's internal OnCall tool, Outalator, have comparable products available today?

Flashcat

In Google's SRE book, Site Reliability Engineering: How Google Runs Production Systems, key members of Google's SRE organization spend almost three full chapters explaining how Google runs OnCall.

One of the best-known ideas in Google SRE practice is reducing toil: using software engineering to solve operations problems. In practical terms, Google SRE has a clear and public goal: keep toil below 50% of each SRE's working time. SREs should spend at least 50% of their time on engineering projects that reduce future toil or add new capabilities to services.

Google SRE teams believe that too much toil leads to several negative outcomes:

  1. Career stagnation. If too little time is spent on engineering projects, career growth slows or stalls. Google does reward people who do unpleasant but necessary work when that work is unavoidable and has a large positive impact, but nobody can build a career by doing only repetitive manual work.
  2. Low morale. Everyone has a different tolerance for toil, but everyone has a limit. Excessive toil leads to burnout, boredom, and dissatisfaction.
  3. Confusion. Google works hard to make sure every SRE, and everyone who works with SREs, understands that SRE is an engineering organization. Excessive toil can undermine that identity.
  4. Slower progress. Too much toil reduces team productivity. If SRE teams are always firefighting manual operations and data exports, new feature delivery slows down.
  5. Bad precedent. If SREs are too willing to absorb toil, development teams are more likely to add more of it, and may even shift operations work that should stay with development teams onto SRE. Other teams then begin to expect SRE to take this work, which is unhealthy.
  6. Friction. Even if one person does not object to toil, current or future teammates may feel very differently. Too much toil encourages the team's strongest engineers to look elsewhere for more valuable work.
  7. Broken expectations. New hires who joined for engineering projects, and experienced engineers who transferred into SRE, may feel misled. That is damaging to morale.

Statistics show that the largest source of toil is interrupt-driven work, and another major source is OnCall. The former usually refers to non-urgent service-related interruptions, while the latter covers urgent incident response. At Google, an SRE team generally needs at least six to eight people to keep OnCall toil below 30%.

This gives us a glimpse into Google SRE's way of working. It is not easy for every company to copy, and not every organization has the conditions to do so. It requires culture, mechanisms, and tools to work together. In the current state of operations in many Chinese companies, there are practical obstacles.

Culture

At the cultural level, Google SRE advocates a people-centered approach, attention to people's development, and a focus on long-term outcomes. In many domestic environments where overtime culture is common, the IT operations reality often looks different:

  1. Engineers struggle to balance work and life. There is no clear boundary between work hours and personal time, and the result becomes duty without rotation: 24/7 response from the same people.
  2. Work planning is driven by short-term goals. Without long-term thinking, engineers spend every day on tactical work and toil, with little time to solve recurring problems through software engineering. Over time, this becomes technical debt.
  3. The mid-career IT worker problem is common. Fundamentally, personal development is not valued enough. Because new people keep entering the IT industry, many companies choose to replace people rather than develop them. Under that model, serious toil reduction is hard to sustain.

Mechanisms

At the mechanism level, Google SRE explicitly enforces the rule that toil should not exceed 50%. It also keeps an independent SRE team at a minimum size of six people to support OnCall rotation, and provides additional compensation for OnCall work outside normal hours.

This is difficult for many domestic companies. In many organizations, the ratio of SREs to developers is close to 1:100. Maintaining a six-person SRE team is often unrealistic.

Tools

At the tooling level, Google's internal OnCall tool is Outalator. In Outalator, SREs manage the full lifecycle of alerts on a centralized platform. Its capabilities include:

  1. Alert grouping. Multiple alerts are grouped into a single incident. SREs follow and process work at the incident level, which greatly reduces notification volume, avoids repetitive work, lowers alert noise, improves handling efficiency, and reduces mistakes.
  2. Labels. Different incidents can be labeled with additional context, making it easier for SREs to filter, analyze, and report by label.
  3. Alert analytics. Outalator analyzes alert volume trends, response efficiency, and resolution efficiency from dimensions such as team, person, service, and data center, helping SREs understand where OnCall work needs improvement.
  4. One-click reports and announcements. A particularly useful feature for frontline SREs is selecting a set of incidents and sending their titles, labels, and important notes by email to the next OnCall engineer, with optional CC recipients or mailing lists. This makes handoff easy. Outalator also supports a report mode for periodic production service reviews, which many teams run weekly.

Outalator looked roughly like this:

Google Outalator 2015
Google Outalator 2015

In summary, a professional OnCall tool can address several common pain points faced by operations and development teams:

  • Technical teams receive large numbers of alerts every day.
  • Many alerts go unacknowledged for long periods.
  • Alerts lack correlation, making handling inefficient.
  • Alert handling lacks collaboration; the process is opaque, information is hard to share, and knowledge is hard to retain.
  • Many alerts do not accurately reflect reality and waste team energy.
  • Customers or users often discover failures before the technical team, and customer satisfaction continues to decline.
  • Teams cannot quantify the current state or efficiency of incident response, making it hard to define an improvement path.

Can We Run OnCall Like Google SRE?

The bad news is that culture and mechanisms are hard to copy. The good news is that, at the tooling level, there are now several options comparable to Google's OnCall tooling.

Kubernetes is the open source implementation of Google's internal Borg container orchestration system, and Prometheus is the open source counterpart of Borgmon. What about Google's internal OnCall tool, Outalator? Below is an introduction and comparison of two representative OnCall products on the market:

  • PagerDuty is a global leader in OnCall products and can be adopted for as little as $21 per user per month.
  • Flashduty is an OnCall product launched by the team behind the open source monitoring tool Nightingale. Compared with PagerDuty, Flashduty better fits many monitoring tools and IM tools used in China, and provides a simpler product experience.

Without measurement, there is no improvement. In practice, operations leaders may see too many alerts and an exhausted team, but they often cannot see the actual workload of alert handling. That makes it hard to plan staffing. Worse, without knowing where alert optimization should start, the situation continues to deteriorate until the team burns out and failures become frequent. A good OnCall tool should expose five key metrics:

  1. Noise reduction ratio. The alert compression ratio after algorithms and rules group related alerts before notifying responders. Alert grouping reduces alert storms, lowers responder workload, and improves information processing efficiency. The higher this metric, the better.
  2. Response ratio. The share of alerts that are acknowledged among all alerts. In alert management, alerts that need a response or acknowledgment are the useful ones. Tracking response ratio helps evaluate whether alerts are effective and useful, and helps teams improve that ratio over time. The higher this metric, the better.
  3. Total alert volume. The number of alerts generated in a time window. A high alert volume means higher duty pressure, more distraction for technical teams, and potentially too much alert noise. Excessive alerts can make a system unmanageable and should be reduced as much as possible. SLO-based alerting, for example, can significantly lower this metric. The lower this metric, the better.
  4. MTTA, or mean time to acknowledge. The interval between alert occurrence and responder acknowledgment. Faster MTTA indicates higher alert-handling efficiency and often higher service stability. MTTA also helps measure team workload so leaders can decide whether additional resources are needed to keep the team sustainable. This metric should be kept at a healthy level.
  5. MTTR, or mean time to resolve. The interval between alert occurrence and problem resolution. Faster MTTR usually means the team has better observability, stronger infrastructure, more mature operational skills, and deeper understanding of business systems. The faster this metric, the better.

Below, we compare Flashduty and PagerDuty across product capabilities, pricing, and service.

Product

Integration Capabilities

As the process hub for incident management, an incident management system stores all alert and incident data. It should provide strong data ingestion and outbound integration capabilities so it can connect with other systems and workflows, accelerate response, and improve collaboration.

Google SRE's OnCall methods and tools - integrations

Incident Handling

Incident handling is the core workflow. This dimension mainly evaluates the richness and flexibility of product features.

Google SRE's OnCall methods and tools - incident handling

Platform Capabilities

Platform capabilities cover member management, duty response, and notification. The system should provide basic audit and single sign-on capabilities. More notification channels are better, richer localization is better, and schedule management should support special organizational scenarios.

Google SRE's OnCall methods and tools - platform capabilities

Pricing

PagerDuty and Flashduty both provide multiple subscription options. When choosing a product, teams need to consider cost-effectiveness, budget control, and the simplicity of the pricing model while meeting their own requirements.

Google SRE's OnCall methods and tools - pricing

Service

The service dimension evaluates how vendors respond to service requests, as well as their professionalism and timeliness.

Google SRE's OnCall methods and tools - service

Registration

Click the dedicated link to register and start handling alerts directly in your IM tools.

Google SRE's OnCall methods and tools

Related articles