You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Alertmanager
What is it?
Alertmanager is the service (and software) in charge of collecting, de-duplicating and sending notifications for alerts across WMF infrastructure. It is part of the Prometheus ecosystem and therefore Prometheus itself has native support to act as Alertmanager client. The alerts dashboard, implemented by Karma, can be reached at https://alerts.wikimedia.org/. As of Jan 2021 the dashboard is available for SSO users only, however a read-only version is possible as well.
Alertmanager is being progressively rolled out as the central place where all alerts are sent, the implementation is done in phases according to the alerting infrastructure roadmap. As of Jan 2021 LibreNMS has been fully migrated, with more services to come.
User guide
Onboard
This section guides you through onboarding on AlertManager. The first step is understanding what you'd like to happen to alerts that come in (alerts are notifications in AM parlance). In other words, alerts are going to be routed according to their team and severity labels. Consider the following routing examples for alerts with a fictional team=a-team label:
- Alerts with label severity=critical will notify #a-team on IRC, and email a-team@
- Alerts with label severity=warning will notify #a-team on IRC
- Alerts with label severity=task will create tasks in the #a-team Phabricator project
You'll have a different receiver based on the notifications you'd like to send out. Each receiver
instructs Alertmanager on what to do with the alert, for the example above we would have:
- name: 'a-ircmail'
webhook_configs:
- url: 'http://.../a-team'
email_configs:
- to: 'a-team@...'
- name: 'a-irc'
webhook_configs:
- url: 'http://.../a-team'
- name: 'a-task'
webhook_configs:
- url: 'http://.../alerts?phid=<phabricator_project_id>'
The resulting routing configuration will match first team= and then route according to severity and select a receiver:
# A-team routing
- match:
team: a
routes:
- match:
severity: critical
receiver: a-ircmail
- match:
severity: warning
receiver: a-irc
- match:
severity: task
receiver: a-task
The routing tree can be explored and tested using the online routing tree editor. The routing configuration is managed by Puppet and changes are relatively infrequent: on/off boarding teams, changing emails, etc. For a practical example see the patch to add Traffic team alerts.
Create alerts
With alert routing is set up, you can start creating alerts for Alertmanager to handle. Alerts are defined as Prometheus' alerting rules: the alert's metric expression is evaluated periodically and all metrics matching the expression are turned into alerts. Consider the following example alert on etcd request latencies:
groups:
- name: etcd
rules:
- alert: HighRequestLatency
expr: instance_operation:etcd_request_latencies_summary:avg5m > 50000
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "etcd request {{ $labels.operation }} high latency"
description: "etcd is experiencing high average five minutes latency for {{ $labels.operation }}: {{ $value }}ms"
dashboard: https://...
runbook: https://...
The example defines an alert named HighRequestLatency based on instance_operation:etcd_request_latencies_summary:avg5m metric. When the expression yields results for more than five minutes, then an alert will be fired for each metric returned by the expression. Each alert will have a list of labels attached, in addition to the result's metric labels, and used for routing the alert. The alert's annotations are used to provide guidance to humans handling the alert, by convention using the following:
- summary
- Short description of the problem, used where brevity is needed (e.g. IRC)
- description
- A more extensive description of the problem, including more details
- dashboard
- A link to the dashboard for the service/problem/etc
- runbook
- A link to the service's runbook to follow, ideally linking to the specific alert
Annotations and labels can be templated as showcased above; this feature is quite useful to make full use of Prometheus' multi-dimensional data model. The Prometheus template examples and template reference are good documentation to get started.
It is worth noting at this point one key difference between Icinga's alert model and Prometheus': Icinga knows about all possible alerts that might fire, whereas Prometheus evaluates an expression. The evaluation might result in one or more alerts, depending on the expression's results; there's no explicit list of all possible labels combinations for all alerts.
Alerting rules are committed to the operations/alerts repository and deployed automatically by Puppet. When writing alerting rules make sure to include unit tests of rules as per Prometheus documentation: unit tests are run by CI automatically or locally via tox
(promtool
needs to be installed as well, it is part of Prometheus). To test an alert's expression you can also evaluate it at https://thanos.wikimedia.org.
Grafana alerts
It is possible to send Grafana notifications to Alertmanager and get notified accordingly. While using Grafana for alerting is supported, the recommended way (assuming your metrics are in Prometheus and not Graphite) to manage your alerts is to commit Prometheus alerting rules to git as mentioned in the section above.
To configure a new alert follow the instructions below:
- Edit the panel and select the "Alert" tab (dashboards with template variables are not supported in alerts as per upstream issue) then "create alert".
- Fill in the rule name, this is the alert's name showing up at alerts dashboard: alerts with the same name and different labels will be grouped together. An useful convention for alerts names is to be symptom-oriented and CamelCased without spaces, see also the examples above.
- The "evaluate every" field must be set to "1m" to get Alertmanager "alert liveness" logic to work, while the "for" field indicates for how long a threshold must be breached before the alert fires.
- Select the conditions for the alert to fire, see also Grafana's create alert documentation
- In the notifications section, add "AlertManager". The "message" text area corresponds to the alert's
summary
label and is used as a short but indicative text about the alert (e.g. will be displayed on IRC alongside the alert's name). In this field you can use templated variables from the alert's expression as per Grafana documentation. - Add the alert's tags: these must contain at least
team
andseverity
for proper routing by Alertmanager (see also section above for a detailed description). The dashboard's panel will be linked automatically as the alert's "source" and is available both e.g. in email notifications and on the alerts dashboard.
Silences & acknowledgements
In Alertmanager a silence is used to mute notifications for all alerts matching the silence's labels. Unlike Icinga, silences exist independent of the alerts they are matching: you can create silence for alerts that have yet to fire (this is useful for example when turning up hosts and/or services not yet in production).
To create a new silence select the crossed bell on top right of https://alerts.wikimedia.org to bring up the silence creation form. Then add the matching label names and their values, the silence's duration (hint: you can you the mouse wheel to change the duration's hour/day), a comment and then hit preview. If there are firing alerts they will be displayed in the preview, finally hit submit. At the next interface refresh the alert will be gone from the list of active alerts.
The silence form is available also pre-filled via each alert group's three vertical dots, and the alert's duration dropdown as illustrated below. When using the pre-filled silence form make sure to check the labels and add/remove labels as intended.
Within Alertmanager there is no concept of acknowledgement per-se, however any alert with comment starting with ACK!
will be considered and acknowledgement. Such alerts will be periodically checked and their expiration extended until there are no matching alerts firing anymore. The acknowledgement functionality is also available from the UI via the "tick mark" next to each alert group, clicking the button will acknowledge the whole alert group. For more information see https://github.com/prymitive/kthxbye#current-acknowledgment-workflow-with-alertmanager.
FAQ
I'm part of a new team that needs onboarding to Alertmanager, what do I need to do?
Broadly speaking, the steps to be onboarded to AM are the following:
- Pick a name for your team, this is the
team
label value to be used in your alerts. A short but identifiable name is recommended. - Decide how different alert severities will reach your team (e.g.
critical
alerts should go to IRC channel#team
and emailteam@
). This is achieved by routing alerts in the onboard section - Start sending alerts to Alertmanager! Depending on the preferred method you can create Prometheus-based alerts and/or send alerts from Grafana
Software stack
When talking about the Alertmanager stack as a whole it is useful to list its components as deployed at Wikimedia Foundation, namely the following software is:
- Alertmanager the daemon actually in charge of handling alerts and sending out notifications
- Karma the dashboard/UI for Alertmanager alerts, it powers https://alerts.wikimedia.org
- kthxbye implements the "acknowledgement" functionality for alerts
- alertmanager-irc-relay forwards alerts to IRC channels
- prometheus-icinga-exporter compatibility shim to forward active Icinga alerts to Alertmanager, also provides Prometheus-style metrics for Icinga
Notifications
As of Jan 2021, Alertmanager supports the following notification methods:
- email - sent by Alertmanager itself
- IRC - via the
jinxer-wm
bot on Libera.chat - phabricator - through @phaultfinder user
- pages - sent via Splunk Oncall (formerly known as VictorOps)
Notification preferences are set per-team and are based on the alert' severity (respectively the team
and severity
labels attached to the alert)