You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Alertmanager

From Wikitech-static
Revision as of 09:30, 17 March 2021 by imported>Filippo Giunchedi (Expand on onboarding/creating alerts)
Jump to navigation Jump to search

What is it?

Alertmanager is the service (and software) in charge of collecting, de-duplicating and sending notifications for alerts across WMF infrastructure. It is part of the Prometheus ecosystem and therefore Prometheus itself has native support to act as Alertmanager client. The alerts dashboard, implemented by Karma, can be reached at https://alerts.wikimedia.org/. As of Jan 2021 the dashboard is available for SSO users only, however a read-only version is possible as well.

Alertmanager is being progressively rolled out as the central place where all alerts are sent, the implementation is done in phases according to the alerting infrastructure roadmap. As of Jan 2021 LibreNMS has been fully migrated, with more services to come.

Alertmanager production deployment in Jan 2021

User guide

Onboarding

This section guides you through onboarding on AlertManager. The first step is understanding what you'd like to happen to alerts that come in (alerts are notifications in AM parlance). In other words, alerts are going to be routed according to their team and severity labels. Consider the following routing examples for alerts with a fictional team=a-team label:

  • Alerts with label severity=critical will notify #a-team on IRC, and email a-team@
  • Alerts with label severity=warning will notify #a-team on IRC
  • Alerts with label severity=task will create tasks in the #a-team Phabricator project

The resulting routing configuration will match first team= and then route according to severity:

   # A-team routing
   - match:
       team: a
     routes:
       - match:
           severity: critical
         receiver: a-ircmail
       - match:
           severity: warning
         receiver: a-irc
       - match:
           severity: task
         receiver: a-task

Each receiver instructs Alertmanager on what to do with the alert, for the example above we would have:

   - name: 'a-ircmail'
     webhook_configs:
       - url: 'http://.../a-team'
     email_configs:
       - to: 'a-team@...'
   - name: 'a-irc'
     webhook_configs:
       - url: 'http://.../a-team'
   - name: 'a-task'
     webhook_configs:
       - url: 'http://.../alerts?phid=<phabricator_project_id>'


The routing tree can be explored and tested using the online routing tree editor. The routing configuration is managed by Puppet and changes are relatively infrequent: on/off boarding teams, changing emails, etc.

Creating alerts

With alert routing is set up, you can start creating alerts for Alertmanager to handle. Alerts are defined as Prometheus' alerting rules: the alert's metric expression is evaluated periodically and all metrics matching the expression are turned into alerts. Consider the following example alert on etcd request latencies:

   groups:
   - name: etcd
   rules:
   - alert: HighRequestLatency
       expr: instance_operation:etcd_request_latencies_summary:avg5m > 50000
       for: 5m
       labels:
           severity: critical
           team: sre
       annotations:
           summary: "etcd request Template:$labels.operation high latency"
           description: "etcd is experiencing high average five minutes latency for Template:$labels.operation: Template:$valuems"
           dashboard: https://...
           runbook: https://...

The example defines an alert named HighRequestLatency based on instance_operation:etcd_request_latencies_summary:avg5m metric. When the expression yields results for more than five minutes, then an alert will be fired for each metric returned by the expression. Each alert will have a list of labels attached, in addition to the result's metric labels, and used for routing the alert. The alert's annotations are used to provide guidance to humans handling the alert, by convention using the following:

summary
Short description of the problem, used where brevity is needed (e.g. IRC)
description
A more extensive description of the problem, including more details
dashboard
A link to the dashboard for the service/problem/etc
runbook
A link to the service's runbook to follow, ideally linking to the specific alert

Annotations and labels can be templated as showcased above; this feature is quite useful to make full use of Prometheus' multi-dimensional data model. The Prometheus template examples and template reference are good documentation to get started.

It is worth noting at this point one key difference between Icinga's alert model and Prometheus': Icinga knows about all possible alerts that might fire, whereas Prometheus evaluates an expression. The evaluation might result in one or more alerts, depending on the expression's results; there's no explicit list of all possible labels combinations for all alerts.

Software stack

When talking about the Alertmanager stack as a whole it is useful to list its components as deployed at Wikimedia Foundation, namely the following software is:

Notifications

As of Jan 2021, Alertmanager supports the following notification methods:

  • email - sent by Alertmanager itself
  • IRC - via the jinxer-wm bot on Freenode
  • phabricator - through @phaultfinder user
  • pages - sent via Splunk Oncall (formerly known as VictorOps)

Notification preferences are set per-team and are based on the alert' severity (respectively the team and severity labels attached to the alert)