Skip to main content

OnCall Practice in the Futures Industry

Operations assurance in the futures industry has unique characteristics and challenges. Landing management mechanisms, technical requirements, and operating processes in a unified monitoring and incident response platform is a key lever for faster alert response, lower operational pressure, and higher system reliability.

Flashcat

At the second CCF Nightingale Observability Innovation Forum, Qingyu Song, a technical expert from a futures company in Shanghai, shared practical experience implementing OnCall in the futures industry.

This futures company was among the first in China to obtain full financial futures clearing business qualification and has received an AA rating in the classification by the China Futures Association. It provides comprehensive financial services, including commodity futures brokerage, financial futures brokerage, and asset management. The company has more than 20 branches and over 300 securities IB offices nationwide, and its business network covers the country. Its operating revenue, net profit, customer count, total customer equity, market share, and other business indicators have ranked in the first tier of the futures industry for many years.

OnCall practice in the futures industry - Figure 1

In his talk, Qingyu Song noted that operations assurance in the futures industry has several special characteristics and challenges:

  1. Futures businesses follow the schedules of different exchanges and involve multiple trading sessions, including day and night sessions. Operations teams participate throughout the process, so engineers need to provide around-the-clock duty coverage.
  2. The continuity, uniqueness, real-time nature, high risk, and peak-load pressure of futures trading impose extremely high requirements on the security and stability of futures information systems.
  3. Operations engineers handle tens of thousands of alert notifications on average each week. At that scale, alert sensitivity can decline, important alerts can be missed, and historically, missed critical alerts have caused production incidents.

Therefore, landing the corresponding management mechanisms, technical requirements, and operating mechanisms in a unified monitoring and incident response platform is a key lever for improving alert response speed, reducing operational pressure, and improving system reliability.

OnCall practice in the futures industry - Figure 2

Implementation Approach

  1. Build the platform
    • Choose a mature platform that follows industry OnCall best practices and supports unified alert ingestion, schedule management, alert escalation, and alert noise reduction, fully covering the company's operations scenarios.
  2. Build the process
    • At the company level, establish relevant systems and roles, and staff a dedicated team, the EEC monitoring role, to build and improve the OnCall process, retain capabilities, and follow up on remaining OnCall issues.
  3. Unify metadata
    • Integrate with internal metadata systems such as CMDB and reuse existing metadata, lowering platform construction cost and improving automation and intelligence.
  4. Operate continuously
    • Quantify each team's operations OnCall work on a regular basis using metrics such as MTTA and MTTR, continuously govern alerts, and improve OnCall efficiency.

Build the Platform

When selecting a platform, the team focused on five factors. First, the platform needed flexible scheduling and duty capabilities as well as alert escalation. Second, it needed a trading calendar to fit the business operations characteristics of the financial industry. Third, it needed to integrate with the company's many alert data sources. Fourth, it needed unified and powerful alert noise reduction and alert suppression strategies. Finally, it needed rich metadata integration capabilities so it could connect with the company's CMDB and other metadata centers. After research and testing, the team selected Flashduty as its OnCall platform.

OnCall practice in the futures industry - Figure 3

When integrating existing monitoring events, Flashduty's combination of email integration and label enrichment was especially helpful. Network analysis platforms such as Tiandan and Colasoft, as well as OceanBase, TdSQL, Qfusion, and SmartX, only provided email-based alerting. They were hard to connect to third parties through webhooks, making IM-based alerts difficult. Flashduty can parse incoming email content, extract key information from email titles and bodies, such as alert details, thresholds, and system status updates, convert that information into structured data, and then use Flashduty's notification capabilities to reach operations engineers through different channels.

Email integration and label enrichment

Build the Mechanism

Establishing mechanisms that fit the business scenario can effectively remove uncertainty from the OnCall process, improve engineers' sense of safety, and increase efficiency.

Flashduty - alert OnCall mechanism

Unify Metadata

Take CMDB as an example. CMDB already stores relationships between resources, as well as relationships between resources and people. If an OnCall platform can directly use this information, alert dispatch becomes more automated, duplicate configuration work is reduced, and the risks caused by inconsistent metadata are lowered.

Flashduty supports CMDB integration. It can obtain asset dependency mapping information from CMDB and use that metadata for alert label enrichment, adding more context to alerts. This brings two benefits. First, enriched labels make automated alert dispatch easier. Second, when engineers receive an alert, they immediately see richer context, which helps them quickly judge impact scope and severity.

Flashduty - CMDB-based label enrichment

Continuous Operations

"Without measurement, there is no improvement." All alerts are centralized in one platform, and every step in each alert's full handling lifecycle is recorded. Flashduty provides statistics across multiple dimensions, including workload statistics, TopK alert statistics, MTTA, and MTTR. Based on this data, managers can drive OnCall optimization scientifically and with clear targets.

Flashduty statistics reports

Read https://flashcat.cloud/product/flashduty to learn more about Flashduty.

Mr. Song's WeChat: giggs06. If you would like an in-depth discussion with him, you can add him on WeChat directly.

Related articles