AI Accelerates Observability: Building an Intelligent OnCall Copilot for Every Engineer

AIOps can be traced back to 2011, when Netflix's Chaos Monkey practice pioneered fault injection and automated remediation. In 2016, Gartner introduced the concept of AIOps, or Algorithmic IT Operations, describing the importance and broad potential of algorithms, machine learning, and big data analytics in automated operations.

Over the following decade, many innovative companies, internet giants, and research organizations kept pursuing an "AGI moment" for intelligent operations, but none achieved a true breakthrough. Most typical approaches were limited to a vertical domain or a single scenario, such as event correlation analysis, time series prediction, log clustering, and feature analysis. These approaches were hard to transfer effectively across companies, scenarios, and business systems, and were not reproducible enough.

The ChatGPT era began in 2022. The intelligence level and evolution speed of large models changed everyone's expectations, and large models and agents are reshaping every industry. In observability, we now see the feasibility of intelligent OnCall, or AI Autonomous OnCall. Every engineer can have an AI copilot to help handle incident response and problem analysis, doing the tedious work faster and better. Several key factors are required.

First, the model must be intelligent enough, and API call costs must be affordable. DeepSeek has lifted everyone to a new baseline. It has strong reasoning capabilities and is well suited to analyzing unstructured observability data such as labels, log text, and events. In Flashcat, teams can directly connect to large model APIs hosted in public clouds. If data security is a concern, they can also use privately deployed large models inside the enterprise.

Second, sufficiently complete observability data is the foundation of intelligent OnCall. The core work is to expose metrics, logs, traces, events, and other dimensions of enterprise observability data to AI through unified interfaces. In Flashcat, we break this work into five steps.

Integrate data. Register existing Prometheus, ElasticSearch, ClickHouse, Doris, Zabbix, MySQL, and other data systems, as well as cloud observability tools such as Alibaba Cloud SLS, Alibaba Cloud Arms, AWS CloudWatch, and Azure Monitor, as data sources in Flashcat. Flashcat then exposes them uniformly to AI agents. Flashcat currently supports more than 40 common data source integrations.

Collect data. If existing data is incomplete, we recommend using the Categraf collector to fill the gaps. Newly collected data is stored in Flashcat and then exposed to AI agents in the same way.
Enrich data. Having observability data is one thing; having high-quality data is another. In observability, labels are a critical concept. By adding rich and consistent labels to metrics, logs, events, and other data, we can express relationships between observability data, describe entity attributes, and describe interactions between entities. We call this process data enrichment. Enrichment can happen during collection, where collectors interact with CMDB, the Kubernetes API Server, or public cloud controllers to add relevant labels. It can also happen in the data transmission pipeline, where labels are extracted, transformed, and mapped in real time. This makes observability labels richer and more consistent, and easier for AI to understand.
Preprocess data. The first three steps give us complete and relatively high-quality data. The next challenge is practical: observability data is usually massive. Sending everything to a large model and hoping scale alone solves the problem does not work. In Flashcat, we use the Firefighting Map to build a complete environment map covering CI/CD, infrastructure, Kubernetes, metrics, logs, traces, events, and runbooks. It describes the relationships between APIs, services, Pods, components, and other environments. Based on these relationships, Flashcat can send highly relevant data to the large model for analysis. This context is also a key input for the model to understand the problem. In addition, log clustering and similarity-based deduplication reduce the model's data-processing burden.

Export data. Flashcat uses an approach similar to an MCP Server to export the observability data and contextual environment information described above to large models. In the future, this valuable Flashcat data can also be shared with enterprise MCP clients.

Flashcat uses a multi-agent architecture. Each agent focuses on one scenario or one type of data, then the results are aggregated. Preset scenarios include Firefighting Map status, feature analysis, log search, tracing search, event analysis, dashboard analysis, and more. A multi-agent structure helps improve accuracy and intelligence, speeds up analysis, and reduces dependence on very large model contexts.

There is a common misconception about the value and implementation path of knowledge bases in intelligent OnCall. Many people assume that enterprise alert handling records, postmortem records, and runbooks can be combined with RAG to improve model analysis. In our actual experience, most existing enterprise knowledge base data is not good enough in either quality or quantity, and is often unusable.

In Flashcat, we use AI to accumulate the knowledge base. For every AI analysis process and conclusion, engineers judge whether it is useful. If it is effective, it is added to the knowledge base. As AI analysis is used more frequently, high-quality knowledge accumulates like a snowball, becoming larger and more valuable in a positive feedback loop.

As AI analysis speed and accuracy improve, and large model invocation costs become more controllable, most incidents and alerts will be able to enable intelligent OnCall by default in the near future. AI is accelerating observability, and it is already happening. If you are also exploring this direction, we would be glad to talk:

https://flashcat.cloud/contact

AI Accelerates Observability: Building an Intelligent OnCall Copilot for Every Engineer

Related articles

How to Turn an Alert Storm Into Actionable Incidents: A Practical Noise-Reduction Playbook

Is PagerDuty Too Expensive? How to Calculate On-call Cost for a 100-Person Engineering Team

Is Prometheus Alertmanager Enough? When You Need a Professional On-call Platform