Skip to main content

Introduction to Flashduty Monitoring and Alert Rules

Flashduty is an On-call platform for alert distribution, noise reduction, schedules, acknowledgement, escalation, dispatch, and collaboration. It also provides an alert engine that queries data sources and generates alert events, similar to vmalert but with broader data-source support.

Buffett Ba

As an On-call platform, Flashduty primarily solves alert distribution: noise reduction, schedules, acknowledgement, escalation, dispatch, and collaboration. It also provides an alert engine. You can manage alert rules in Flashduty; Flashduty queries different data sources according to those rules, evaluates anomalies, and generates alert events. This is similar to vmalert, except vmalert only evaluates VictoriaMetrics data, while Flashduty supports more data sources such as Prometheus-compatible sources, MySQL, PostgreSQL, Oracle, Elasticsearch, Loki, and ClickHouse.

Architecture

Flashduty is a SaaS service deployed in the cloud, so it cannot directly access data inside customer private networks. But threshold-based alerting must query that data. Flashduty therefore separates the alert engine module, monitedge, and deploys it inside the customer environment. monitedge pulls alert rules from Flashduty over the public network, caches them in memory, queries private storage, and performs alert evaluation. The architecture looks like this:

Flashduty alert engine architecture

Customer environments often have multiple data centers, such as US East and South China in the diagram. Each data center may have more than one time-series database, such as Prometheus or VictoriaMetrics. Flashduty can also connect to ElasticSearch, Loki, ClickHouse, and other stores. The diagram uses VictoriaMetrics as an example.

Usually, each data center should deploy one Flashduty alert engine to evaluate monitoring data in that data center. For example, deploy one monitedge in US East for US East monitoring data, and one in South China for South China monitoring data. If network links between data centers are excellent, one monitedge can also handle all data centers.

If a single monitedge instance creates a single-point-of-failure concern, deploy multiple instances as a cluster. For example, deploy two instances in US East with the same cluster name by starting them with --alerter.clusterName meidong; deploy another two in South China with --alerter.clusterName huanan.

Multiple instances in the same alert engine cluster automatically shard alert rules. If the cluster handles 100 rules, the system balances them so each monitedge instance handles 50. If one instance fails, the other takes over all 100 rules, preserving high availability while avoiding duplicate alert events.

Feature Overview

The entry for Flashduty monitoring and alerting is "Monitor Management." It contains five menus:

  • Overview: shows basic statistics. The statistics are not critical, but the system event list at the bottom matters because it shows errors during alert engine execution.
  • Alert Rules: manages alert rules. Because rule count can be large, the left side uses a tree group structure for easier management.
  • Rule Repository: common alert rules curated by Flashduty. You can import them into your own group and adjust them.
  • Node Permissions: manages the tree group structure. Different nodes can be associated with different teams, allowing team members to manage rules under their group and subgroups.
  • Data Sources: manages the data sources the alert engine connects to. Data source addresses must be reachable by the monitedge process and are usually private-network addresses.
  • Alert Engines: shows alert engine instances and installation or upgrade instructions. After installation, the alert engine heartbeats to the server. If it does not communicate with the server for more than 30 seconds, it is marked offline.

Quickstart

Alert Engine

The first step is installing the alert engine. The Alert Engines menu shows installation and upgrade commands:

Install alert engine

Before installation, decide your cluster layout. For example, divide the company network into several partitions and place one alert engine in each partition. Each partition can deploy multiple engine instances as a high-availability cluster. All engines in one partition share the same name, shown in Flashduty as the "Engine cluster name" field. A tooltip beside the field explains the details. Many Flashduty fields have tooltips, and it is worth reading them. The cluster name is custom and usually uses a data-center name such as meidong, huanan, or us01. When the engine name changes, the installation command below updates automatically.

The alert engine process must authenticate when communicating with the SaaS service. This requires an API Key. The second field selects the API Key. If you do not have one yet, click "Manage API Key" to create it. Selecting a different API Key also updates the installation command.

Copy the provided command and install directly. After installation, the alert engine connects to the server automatically. The Alert Engine Status page will show the instance list:

Alert engine status

Data Sources

The second step is configuring data sources. The alert engine must connect to them, so each data source address should be reachable by the monitedge process and is usually an internal address. Open the Data Sources menu, click Add Data Source, and fill in the fields.

Create Prometheus data source

The image shows a Prometheus data source. Even if the type is Prometheus, you can enter a VictoriaMetrics or Thanos address because they support the Prometheus query API. Give the data source a name, such as "Prom-Business-A"; select the owning alert engine, such as US East for a US East Prometheus; enter the Prometheus URL, such as http://10.1.2.3:9090; and click Save.

Alert Rules

The third and most important step is creating alert rules. Open Alert Rules and create a group first:

Create rule group

Click the plus icon to create a top-level group. You can create multiple top-level groups and subgroups. To create a subgroup, right-click a group node and choose "Add subgroup." Grouping is flexible: you can use business groups, project groups, or another structure that fits your company. For initial testing, keep it simple and create one top-level group.

Select the group and click Create on the right to create an alert rule. You can also import from the rule repository, import an exported JSON file, or import Prometheus alert rule YAML.

Create alert rule: basic information

The image shows part of the alert rule creation page. Almost every field has a tooltip. If you are unsure, read the tooltip first. Some key fields:

  • Rule name: similar to Prometheus alertname. It names the alert rule and can later be used for filtering and grouping.
  • Additional labels: labels attached to all alert events generated by this rule. They can be used for multidimensional filtering and event grouping. Enter labels as key=value, one per line. If you add the special label __debug__=1, monitedge prints detailed processing logs for this rule, which helps troubleshooting.
  • Data source type: choose Prometheus, ElasticSearch, Loki, ClickHouse, and other supported types.
  • Data source: select the data sources where this alert rule should run. Wildcards are supported.

Alert rule query and evaluation

For Prometheus data sources, write PromQL in the query condition and choose one of three evaluation modes:

  • Threshold evaluation: the PromQL does not include thresholds. monitedge queries the PromQL and evaluates the result against thresholds configured in the Critical, Warning, and Info fields. In the screenshot, memory usage above 80% triggers Warning and above 90% triggers Critical. If memory jumps directly from below 80% to above 90%, only Critical triggers, not Warning.
  • Data missing: query data by PromQL and keep the result in memory. If the next cycle returns no data, generate an alert event.
  • Data exists: usually write the threshold directly in PromQL, such as mem_used_percent{service="mon"} > 85. If the query returns data, alert. This matches native Prometheus behavior.

In threshold evaluation mode, Recovery supports multiple strategies. For an initial trial, you can ignore the details.

Alert rule effective configuration

The last sections are detection frequency and timing. Important fields have tooltips. A few notes:

  • Detection frequency: a cron expression with seconds support, unlike Linux crontab. Format: second minute hour day month weekday. For example, 1 * * * * * runs at the first second of every minute. @every 60s shorthand is also supported.
  • Custom fields: similar to custom labels, but labels are usually dimensions for filtering, while custom fields are attributes such as Runbook or Dashboard links.
  • Associated query: query related data when alerting. In Data exists mode, associated queries can also capture recovery values.
  • Note description: similar to the description field often configured in Prometheus annotations.
  • Workspace: where generated alert events should be delivered. In the workspace, dispatch and noise-reduction rules decide how to notify people.
  • Repeat notification: if an event has not recovered, monitedge can repeatedly generate alert events every configured number of seconds and send them to Flashduty SaaS. The maximum repeat count is configurable. Not every generated event necessarily triggers notification; notification still depends on Flashduty noise-reduction configuration.

Viewing Alerts

View alerts

After an alert event is generated, the alert rule state becomes Triggered. Click Triggered to see the generated alert. You can also see the related incident in the incident list. Delivery is then controlled by dispatch rules in the Flashduty workspace, so we will not cover it here.

Video Tutorial

If the workflow is still unclear, watch the video tutorial. It contains detailed operation steps and can help you get started quickly.

Flashduty alerting video tutorial

Related articles