You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Prometheus: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
imported>Filippo Giunchedi
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= What is it? =
{{Navigation Wikimedia infrastructure|expand=logging}}
'''Prometheus''' is an open source software ecosystem for monitoring and alerting, with focus on reliability and simplicity. See also upstream's [https://prometheus.io/docs/introduction/overview/ Prometheus overview] and [https://prometheus.io/docs/introduction/faq/ Prometheus FAQ].


Prometheus is a free software ecosystem for monitoring and alerting, with focus
== What is it? ==
on reliability and simplicity. See also
Distinguishing features of Prometheus compared to other metrics systems include:
[https://prometheus.io/docs/introduction/overview/ prometheus overview] and
[https://prometheus.io/docs/introduction/faq/ prometheus FAQ].
 
There's a few interesting features that are missing from what we have now,
among others:


; multi-dimensional data model
; multi-dimensional data model
Line 18: Line 14:
: Prometheus is primarily based on a ''pull'' model, in which the prometheus server has a list of ''targets'' it should ''scrape'' metrics from. The pull protocol is HTTP based and simply put, the target returns a list of "<metric> <value>". Pushing metrics is supported too, see also http://prometheus.io/docs/instrumenting/pushing/.
: Prometheus is primarily based on a ''pull'' model, in which the prometheus server has a list of ''targets'' it should ''scrape'' metrics from. The pull protocol is HTTP based and simply put, the target returns a list of "<metric> <value>". Pushing metrics is supported too, see also http://prometheus.io/docs/instrumenting/pushing/.


After the Prometheus POC (as per [[User:Filippo_Giunchedi/Prometheus_POC]]) has
After the Prometheus proof of concept (as per [[User:Filippo_Giunchedi/Prometheus_POC]]) has been running in Labs for some time, during FQ1 2016-2017 the Prometheus deployment has been extended to production, as outlined in the [[mw:Wikimedia_Engineering/2016-17_Q1_Goals#Technical_Operations|WMF Engineering 2017 Goals]].
been running in Labs for some time, during FQ1 2016-2017 the  
Prometheus deployment has been extended to production, as outlined in the
[https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q1_Goals#Technical_Operations Technical Operations goals ].


= Architecture =
== Service ==


Each prometheus server is configured to ''scrape'' a list of ''targets'' (i.e.
=== Server location ===
HTTP endpoints) at a certain frequency, in our case starting at 60s. All
The various Prometheus servers are logically separated, though physically they can share one or multiple hosts. As of April 2022, we run Prometheus on baremetal hardware in [[Eqiad data center|Eqiad]] and [[Codfw data center|Codfw]], and on [[Ganeti]] VMs in all POPs / caching centers.
metrics are stored on the local disk with a per-server retention
period (minimum of 4 months for the initial goal).


All targets to be scraped are grouped into ''jobs'', depending on the purpose
=== Instances ===
that those targets serve. For example the job to scrape all host-level data for
As of April 2022 the list of all Prometheus instances includes:
a given location using <tt>node-exporter</tt> will be called
<tt>node</tt> and each target will be listed as <tt>hostname:9100</tt>.
Similarly there could be jobs for varnish, mysql, etc.


Each prometheus server is meant to be stand-alone and polling targets in the
; analytics: All things analytics from the Hadoop cluster and similar
same failure domain as the server itself as appropriate (e.g. the same
; ext: Collect external, potentially untrusted, data
datacenter, the same vlan and so on). For example this allows to keep the
; global: Global instance, see below
monitoring local to the datacenter and not have spotty metrics upon
; k8s: Main/production k8s cluster
cross-datacenter connectivity blips. (See also Federation)
; k8s-mlserve: k8s for machine learning
; k8s-staging: Staging k8s cluster
; ops: The biggest instance where most metrics are collected
; services: Dedicated instance for ex services team / cassandra metrics


[[File:Prometheus_single_server.png]]
=== Federation and multiple DCs ===
The use case for a cross-DC view of metrics used to be covered by the "global" Prometheus instance. This instance is now deprecated with this use case (and much more) now covered by [[Thanos]].
 
== Architecture ==
Each Prometheus server is configured to ''scrape'' a list of ''targets'' (i.e. HTTP endpoints) at a certain frequency, in our case starting at 60s. All metrics are stored on the local disk with a per-instance multi-week retention period.


== Exporters ==
All targets to be scraped are grouped into ''jobs'', depending on the purpose that those targets serve. For example the job to scrape all host-level data for a given location using <tt>node-exporter</tt> will be called <tt>node</tt> and each target will be listed as <tt>hostname:9100</tt>. Similarly there are jobs for varnish, mysql, etc.


The endpoint being polled by the prometheus server and answering the GET
Each Prometheus server is meant to be stand-alone and polling targets in the same failure domain as the server itself as appropriate (e.g. the same site, the same vlan and so on). For example this allows to keep the monitoring local to the site and not have spotty metrics upon cross-site connectivity blips. (See also Federation)
requests is typically called ''exporter'', e.g. the host-level metrics
exporter is ''node-exporter''.


Each exporter serves the current snapshot of metrics when polled by the
[[File:Prometheus_single_server.png]]
prometheus server, there is no metric history kept by the exporter itself.
Further, the exporter usually runs on the same host as the service or host it
is monitoring.


== Storage ==
=== Exporters ===
The endpoint being polled by the prometheus server and answering the <code>GET</code> requests is typically called ''exporter'', e.g. the host-level metrics exporter is ''node-exporter''.


Why just stand-alone prometheus servers with local storage and not clustered
Each exporter serves the current snapshot of metrics when polled by the prometheus server, there is no metric history kept by the exporter itself. Further, the exporter usually runs on the same host as the service or host it is monitoring.
storage? The idea behind a single prometheus server is one of reliability: a
 
monitoring system must be ''more'' reliabile than the systems it is monitoring.
=== Storage ===
It is certainly easier to get local storage right and reliable than clustered
Why just stand-alone prometheus servers with local storage and not clustered storage? The idea behind a single prometheus server is one of reliability: a monitoring system must be ''more'' reliabile than the systems it is monitoring. It is certainly easier to get local storage right and reliable than clustered storage, especially important when collecting operational metrics.
storage, especially important when collecting operational metrics.


See also [https://prometheus.io/docs/operating/storage/ prometheus storage documentation] for a more in-depth explanation and storage space requirements.
See also [https://prometheus.io/docs/operating/storage/ prometheus storage documentation] for a more in-depth explanation and storage space requirements.


== High availability ==
=== High availability ===
 
With local storage being the basic building block we can still achieve high-availability by running more than one server in parallel, each configured the same and polling the same set of targets. Queries for data can be routed via LVS in an active/standby fashion, under normal circumstances the load is shared (i.e. active/active).
With local storage being the basic building block we can still achieve
high-availability by running more than one server in parallel, each configured
the same and polling the same set of targets. Queries for data can be routed
via LVS in an active/standby fashion.


[[File:Prometheus_HA_server.png]]
[[File:Prometheus_HA_server.png]]


== Backups ==
=== Backups ===
 
For efficiency reasons, prometheus spools chunks of datapoints in memory for
For efficiency reasons, prometheus spools chunks of datapoints in memory for
each metric before flushing them to disk. This makes it harder to perform
each metric before flushing them to disk. This makes it harder to perform
Line 85: Line 70:
storage files as-is by archiving its storage directory with tar before regular
storage files as-is by archiving its storage directory with tar before regular
(bacula) backups. Since the backup is being done online it will result in some
(bacula) backups. Since the backup is being done online it will result in some
inconsistencies, upon restoring the backup Prometheus will crash-recovery its
inconsistencies, upon restoring the backup Prometheus will crash-recovery its storage at startup.
storage at startup.


To perform backups of consistent/clean state, at the moment prometheus needs to
To perform backups of consistent/clean state, at the moment prometheus needs to
be shutdown gracefully, therefore when running an active/standby configuration
be shutdown gracefully, therefore when running an active/standby configuration backup can be taken on the standby prometheus to minimize its impact. Note that
backup can be taken on the standby prometheus to minimize its impact. Note that
the shutdown will result in gaps in the standby prometheus server for the duration of the shutdown.
the shutdown will result in gaps in the standby prometheus server for the
duration of the shutdown.
 
== Failure recovery ==


=== Failure recovery ===
In the event of a prometheus server having an unusable local storage (disk
In the event of a prometheus server having an unusable local storage (disk
failed, FS failed, corruption, etc) failure recovery can take the form of:
failed, filesystem failed, corruption, etc) failure recovery can take the form of:


* start with empty storage: of course it is a complete loss of metric history for the local server and will obviously fully recover once the metric retention period has passed.
* start with empty storage: of course it is a complete loss of metric history for the local server and will obviously fully recover once the metric retention period has passed.
Line 105: Line 86:
* copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups)
* copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups)


== Federation and multiple DCs ==
== Service Discovery ==
Prometheus supports different kinds of discovery through its [https://prometheus.io/docs/operating/configuration/ configuration]. For example, in <tt>role::prometheus::labs_project</tt> implements auto-discovery of all instances for a given labs project.
<code>file_sd_config</code> is used to continuously monitor a set of configuration files for changes and the script <code>prometheus-labs-targets</code> is run periodically to write the list of instances to the relative configuration file. The <code>file_sd</code> files are reloaded automatically by prometheus, so new instances will be auto-discovered and have their instance-level metrics collected.


Each prometheus server is able to act as a target to another prometheus server
While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also <tt>profile::prometheus::k8s</tt>).
by means of [https://prometheus.io/docs/operating/federation/ Prometheus federation].
Our use case for this feature is primarily hierarchical
federation, namely to have a 'global' prometheus that aggregates
datacenter-level metrics from prometheus in each datacenter.
[[File:Prometheus federation.png]]


The global instance is what we would normally use in grafana as the
== Adding new metrics ==
"datasource" for dashboards to get an overview of all sites and aggregated
In general Prometheus' model is pull-based. In practical terms that means that once metrics are available over HTTP somewhere on the network with the methods described below, Prometheus itself should be instructed to poll for metrics via its configuration (more specifically, a ''job'' as described in [https://prometheus.io/docs/concepts/jobs_instances/ upstream documentation]). Within WMF's Puppet the Prometheus configuration lives inside its respective instance profile, for example <code>modules/profile/manifests/prometheus/ops.pp</code> is often the right place to add new jobs.
metrics. To drilldown further and get more details it is possible to use the
datacenter-local datasource and dashboard.


=== Server location ===
=== Direct service instrumentation ===
 
The most benefits from service metrics are obtained when services are directly instrumented with one of Prometheus clients, e.g. [https://github.com/prometheus/client_python Python client].
In the diagram above the various Prometheus servers are logically separated,
Metrics are then exposed via HTTP/HTTPS, commonly at <tt>/metrics</tt>, on the service's HTTP(S) port (in the common case) or a separate port if the service doesn't talk HTTP to begin with.
though physically they can share one/multiple machines. As of Nov 2016
Prometheus dc-local runs in two VMs for each of eqiad/codfw (instance named
"ops") and we're in process of provisioning real hardware.
 
An open question at this time is where to host the dc-local Prometheus servers
for caching centers, essentially two options:
 
# Local to the site
# Remote, e.g. codfw polling ulsfo and eqiad polling esams
 
The local option offers some advantages since all sites are logically the same
and all polling for monitoring purposes is kept local to the site and reflects our
current Ganglia deployment. Only the global instance would reach out to remote
sites and thus could be affected by cross-DC network unavailability.
 
This is significant especially during outages: the global instance would show a
drop in global aggregates while the dc-local instance can keep collecting
high-resolution data from site-local machines.
 
Disadvantages of the local option include (as of Nov 2016) running Prometheus on
the bastion for sites where we lack internal dedicated machines (e.g. ulsfo)
alongside other services like tftp/installserver. Also the fact that running Prometheus
on a single bastion would provide no redundancy when the bastion is down.
 
= Service Discovery =


Prometheus supports different kinds of discovery through its [https://prometheus.io/docs/operating/configuration/ configuration].
=== Service exporters ===
For example, in <tt>role::prometheus::labs_project</tt> implements auto-discovery of all instances for a given labs project.
<code>file_sd_config</code> is used to continuously monitor a set of configuration files for changes and
the script <code>prometheus-labs-targets</code> is run periodically to write
the list of instances to the relative configuration file. The <code>file_sd</code> files
are reloaded automatically by prometheus, so new instances will be
auto-discovered and have their instance-level metrics collected.
 
While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also <tt>role::prometheus::tools</tt>).
 
= Adding new metrics =
 
In general Prometheus' model is pull-based. In practical terms that means that once metrics are available over HTTP somewhere on the network with the methods described below, Prometheus itself should be instructed to poll for metrics via its configuration (more specifically, a ''job'' as described in https://prometheus.io/docs/concepts/jobs_instances/). Within WMF's Puppet the Prometheus configuration lives inside its respective instance profile, for example <code>modules/profile/manifests/prometheus/ops.pp</code> is often the right place to add new jobs.
 
== Direct service instrumentation ==
The most benefits from service metrics are obtained when services are directly instrumented with one of Prometheus clients, e.g. [https://github.com/prometheus/client_python Python client].
Metrics are then exposed via HTTP/HTTP over TLS, commonly at <tt>/metrics</tt>, on the service's HTTP(S) port (in the common case) or a separate port if the service isn't HTTP to begin with.
== Service exporters ==
For cases where services can't be directly instrumented (aka whitebox monitoring), a sidekick application <tt>exporter</tt> can be run alongside the service that will query the service using whatever mechanism and expose prometheus metrics via the client. This is the case for example for [https://github.com/jonnenauha/prometheus_varnish_exporter varnish_exporter] parsing <tt>varnishstat -j</tt> or [https://github.com/neezgee/apache_exporter apache_exporter] parsing apache's <tt>mod_status</tt> page.
For cases where services can't be directly instrumented (aka whitebox monitoring), a sidekick application <tt>exporter</tt> can be run alongside the service that will query the service using whatever mechanism and expose prometheus metrics via the client. This is the case for example for [https://github.com/jonnenauha/prometheus_varnish_exporter varnish_exporter] parsing <tt>varnishstat -j</tt> or [https://github.com/neezgee/apache_exporter apache_exporter] parsing apache's <tt>mod_status</tt> page.


== Machine-level metrics ==
=== Machine-level metrics ===
Another class of metrics is all those related to the machine itself rather than a particular service. Those involve calling a subprocess and parsing the result, often in a cronjob. In these cases the simplest thing to do is drop plaintext files on the machine's filesystem for <tt>node-exporter</tt> to pick up and expose the metrics on HTTP. This mechanism is named <tt>textfile</tt> and for example the python client has support for it, e.g. [https://github.com/prometheus/client_python#node-exporter-textfile-collector sample textfile collector usage]. This is most likely the mechanism we could use to replace most of the custom collectors we have for Diamond.
Another class of metrics is all those related to the machine itself rather than a particular service. Those involve calling a subprocess and parsing the result, often in a cronjob. In these cases the simplest thing to do is drop plaintext files on the machine's filesystem for <tt>node-exporter</tt> to pick up and expose the metrics on HTTP. This mechanism is named <tt>textfile</tt> and for example the python client has support for it, e.g. [https://github.com/prometheus/client_python#node-exporter-textfile-collector sample textfile collector usage]. This is most likely the mechanism we could use to replace most of the custom collectors we have for Diamond.


== Ephemeral jobs (Pushgateway) ==
=== Ephemeral jobs (Pushgateway) ===
Yet another case involves service-level ephemeral jobs that are not quite long-lived enough to be queried via HTTP. For those jobs there's a push mechanism to be used: metrics are pushed to [https://github.com/prometheus/pushgateway/blob/master/README.md Prometheus' pushgateway] via HTTP and subsequently scraped by Prometheus once a minute from the gateway itself.
Yet another case involves service-level ephemeral jobs that are not quite long-lived enough to be queried via HTTP. For those jobs there's a push mechanism to be used: metrics are pushed to [https://github.com/prometheus/pushgateway/blob/master/README.md Prometheus' pushgateway] via HTTP and subsequently scraped by Prometheus once a minute from the gateway itself.


Line 181: Line 114:
{{Warning|1=When using TLS for metric scraping, make sure the host on the certificate and the one configured match, or you will get a TLS Handshake error. By default, puppet sets just the hostname as the target of monitoring -you are likelty to want to add the option <code>hosts_only => false</code> to use the full qualified domain name as target}}
{{Warning|1=When using TLS for metric scraping, make sure the host on the certificate and the one configured match, or you will get a TLS Handshake error. By default, puppet sets just the hostname as the target of monitoring -you are likelty to want to add the option <code>hosts_only => false</code> to use the full qualified domain name as target}}


= HOWTO =
=== Network probes (blackbox exporter) ===
As of Jul 2022 it is possible to run so-called network blackbox probes via Prometheus. Said probes are run from Prometheus hosts themselves, target network services and are used to assert whether the service works from a user/client perspective (hence the "blackbox" terminology).
 
If your service is part of <tt>service::catalog</tt> in puppet then adding network probes is trivial in most cases. Add a <tt>probes</tt> stanza to your service, for example probing <tt>/?spec</tt> and test for a 2xx response is achieved by the following:
 
  probes:
    - type: http
      path: /?spec


== Global view (Thanos) web interface ==
Refer to the [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/wmflib/types/service/probe.pp Wmflib::Service::Probe type documentation] for more advanced use cases.
As of Jul 2020 the [[Thanos]] web interface is available at https://thanos.wikimedia.org. This interface offers a global view over Prometheus data and should be preferred for new use cases. Please consult the [[Thanos]] page to find out more.
 
Custom checks/probes defined outside <tt>service::catalog</tt> can be implemented in Puppet via <tt>prometheus::blackbox::check::{http,tcp,icmp}</tt> abstractions. They will deploy both network probes and related alerts (e.g. when the probe is unsuccessful, or the TLS certificates are about to expire), by default probing both ipv4 and ipv6 address families. The probe's usage largely depends on the use case, ranging from a simple example like below:
 
  # Probe the phabricator.wikimedia.org vhost, using TLS, and talk to the host(s) this check is deployed to
  prometheus::blackbox::check::http { 'phabricator.wikimedia.org':
      severity => 'page',
  }
 
To more complex use cases like VTRS, checking responses for specific text, on ipv4, etc:


== Access Prometheus web interface ==
  prometheus::blackbox::check::http { 'ticket.wikimedia.org':
      team              => 'serviceops-collab',           
      severity          => 'warning',   
      path              => '/otrs/index.pl',
      port              => 1443,
      ip_families        => ['ip4'],     
      force_tls          => true,
      body_regex_matches => ['wikimedia'],
  }


Use https://thanos.wikimedia.org to run Prometheus queries across all Prometheus instances in all sites. The old method of SSH port forwarding still works but has been deprecated and replaced by the Thanos web interface. In short, for example for the 'ops' instance (port 9900) in prometheus codfw: <tt>ssh -L9900:localhost:9900 prometheus2003.codfw.wmnet</tt> then browse http://localhost:9900
Check [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp the http check documentation] for more information.


To access the prometheus web interface in beta (deployment-prep) you use https://beta-prometheus.wmflabs.org/beta/graph
It is recommended to pick the highest-level check possible for your service (IOW prefer HTTP over TCP for example) to improve signal-to-noise ratio.


To access the prometheus web interface for Cloud Services hardware that are using the cloudmetrics monitoring setup, please follow the instructions at [[Portal:Cloud_VPS/Admin/Monitoring#Accessing_"labs"_prometheus]]
== Example use ==
MySQL monitoring is performed by running <code>prometheus-mysqld-exporter</code> on the database machine to be monitored. Metrics are exported via http on port <code>9104</code> and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with:<pre>curl -s localhost:9104/metrics | grep -v '^#'</pre>Grafana dashboards:


== List metrics with curl ==
* Per group / shard / role overview: https://grafana.wikimedia.org/d/000000278/mysql-aggregated
One easy way to check what metrics are being collected by prometheus on a given machine is to request the metrics via HTTP like prometheus server does at scrape time, e.g. for node-exporter on port 9100:
* Per server drilldown: https://grafana.wikimedia.org/d/000000273/mysql
 
  curl -s localhost:9100/metrics


== Query cheatsheet ==
== Query cheatsheet ==
Line 241: Line 196:
This way the time will be accurate to the second. If you use the timestamp of the metric or <code>time()</code>, you will get varying times within a minute.
This way the time will be accurate to the second. If you use the timestamp of the metric or <code>time()</code>, you will get varying times within a minute.


== Aggregate metrics from multiple sites ==
== FAQ ==


Sometimes it is useful to have an overall view of all sites from where metrics are collected. That's the use case for our 'global' instance of Prometheus, namely to pull metrics from site-local Prometheus instances.
=== How long are metrics stored in Prometheus? ===
As of June 2020, we have deployed [[Thanos]] for long term storage of metrics. The target retention period for all one-minute metrics is three years, although as of Jul 2022 the one-minute retention has been shortened for capacity reasons (cfr {{bug|T311690}}). Five-minute and one-hour aggregated datapoints retention target is still set at three years.


Prometheus' name for this feature is ''federation'', as described in https://prometheus.io/docs/prometheus/latest/federation/ and https://www.robustperception.io/federation-what-is-it-good-for/.
=== What are the semantics of rate/irate/increase? ===
 
Adding new aggregated metrics to the global instance is composed of two parts:
# Instruct the site-local Prometheus to calculate new aggregated metrics, for example the <tt>ops</tt> instance uses <tt>modules/role/files/prometheus/rules_ops.conf</tt> in Puppet. The format of the file and its best practices are described at https://prometheus.io/docs/practices/rules/
# Instruct the global instance to pick up the newly-created aggregated metrics, via the global instance configuration at <tt>modules/role/manifests/prometheus/global.pp</tt>
 
== Sync data from an existing Prometheus host ==
 
When replacing existing Prometheus hosts it is possible to keep existing data by rsync'ing the <tt>metrics</tt> directory from the old host into the new. It is important to make sure first that the new host has puppet run successfully (thus Prometheus is configured) and can Prometheus can reach its targets successfully (i.e. the new host is part of <tt>prometheus_nodes</tt> for its site. Once all of that is done the rsync can happen, on the new host:
 
  puppet agent --disable "copying prometheus data"
  export old_host=<hostname>
  export instance_name=ops
  systemctl stop prometheus@${instance_name}
  su -s /bin/bash prometheus
  rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
  # do a first rsync pass in parallel for each subdirectory
  /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
  # once this is completed stop puppet and prometheus on $old_host as well, and repeat the rsync for a final pass.
  rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
  /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
  # once this is completed you can restart prometheus and puppet on both hosts
 
== Prometheus host running out of space ==
 
It might happen that Prometheus hosts get close to running out of space on one of their per-instance filesystems. Assuming the underlying volume group has space available (<tt>lvs</tt> to check what LVs are present and on which VGs, then <tt>vgs</tt> to check VGs themselves) then it is possible to extend the filesystem online with (e.g. +25G to the prometheus-foo LV on vg-hdd VG, remove <tt>--test</tt> once happy).
 
  lvextend --test --resizefs --size +25G vg-hdd/prometheus-foo
 
Make sure to:
* Leave some space available on the VG, to handle cases like this in the future if possible
* Extend the filesystem on all prometheus hosts in the same site
* <tt>!log</tt> your actions for easier traceability
 
=== No space available on the volume group ===
 
At some point the space on volume group might be fully allocated (e.g. like on bastions). In this case the emergency remedy is to decrease Prometheus retention time via <tt>prometheus::server::storage_retention</tt> in Puppet, and restart Prometheus with the new settings.
 
In the unfortunate case that the filesystem is 100% utilized is also possible to manually remove storage "blocks" (i.e. directories) from the <tt>metrics</tt> directory under <tt>/srv/prometheus/INSTANCE</tt>. The filenames are sortable, which each directory representing maximum 24h of data.
 
== Add filesystems for a new instance ==
 
Until {Bug|T163692} is fully resolved, new Prometheus instances require adding LVs to the Prometheus hosts in eqiad/codfw. There are two volume groups (<tt>vg-ssd</tt> or <tt>vg-hdd</tt>) depending on the type of storage.
 
Set the <tt>instance</tt> and <tt>vg</tt> variables, then the following commands can be used as-is:
 
  instance=prometheus-NAME
  vg=vg-hdd
 
  mp=/srv/${instance/-//}
  lvcreate -L 50G -n $instance $vg
  mkfs.ext4 /dev/${vg}/${instance}
  install -d -o prometheus -m 750 $mp
  echo "/dev/${vg}/${instance} $mp  ext4  defaults  0 0" >> /etc/fstab
  mount $mp
 
== Add metrics from a new service ==
 
Most services which export metrics to Prometheus do so via an HTTP endpoint, running on its own port.  This HTTP endpoint can be served by the daemon itself, or by a separate "exporter" process.
 
Prometheus needs to be told to scrape the HTTP endpoint, which it calls a "target."  (A logical grouping of targets is called a "job.")  In addition to adding the new job to the Prometheus server, you will need to add a firewall rule exposing the HTTP endpoint.
 
For an example Puppet changes to add new jobs, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/504360 or https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/572141.
 
= FAQ =
 
== How long are metrics stored in Prometheus? ==
As of June 2020, we have deployed [[Thanos]] for long term storage of metrics. The target retention period for all one-minute metrics is three years.
 
== What are the semantics of rate/irate/increase? ==
These functions generally take a counter metric (i.e. non-decreasing) and return a "value over time". Rate and irate return per second counts, while increase returns the change over the given interval. See also [https://promlabs.com/blog/2021/01/29/how-exactly-does-promql-calculate-rates in depth explanation at promlabs.com]
These functions generally take a counter metric (i.e. non-decreasing) and return a "value over time". Rate and irate return per second counts, while increase returns the change over the given interval. See also [https://promlabs.com/blog/2021/01/29/how-exactly-does-promql-calculate-rates in depth explanation at promlabs.com]


== What best practices should we use for label and metric naming ==
=== What best practices should we use for label and metric naming ===
We generally tend to follow the same general guidelines as Prometheus: https://prometheus.io/docs/practices/naming/. Don't hesitate to reach out to Observability with further questions around metric/label naming.
We generally tend to follow the same general guidelines as Prometheus: https://prometheus.io/docs/practices/naming/. Don't hesitate to reach out to Observability with further questions around metric/label naming.


= Use cases =
== Replacing Graphite ==
 
== MySQL ==
MySQL monitoring is performed by running <tt>prometheus-mysqld-exporter</tt> on the database machine to be monitored. Metrics are exported via http on port <tt>9104</tt> and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with:
<pre>curl -s localhost:9104/metrics | grep -v '^#'</pre>
 
=== Dashboards ===
; Per group / shard / role overview : https://grafana.wikimedia.org/d/000000278/mysql-aggregated
; Per server drilldown : https://grafana.wikimedia.org/d/000000273/mysql
 
= Replacing Graphite =
 
Another use case imaginable for Prometheus is to replace the current Graphite deployment. This task is less "standalone" than replacing Ganglia and therefore more difficult: Graphite is more powerful and used by more people/services/dashboards.
Another use case imaginable for Prometheus is to replace the current Graphite deployment. This task is less "standalone" than replacing Ganglia and therefore more difficult: Graphite is more powerful and used by more people/services/dashboards.
Nevertheless it should be possible to keep Prometheus and Graphite alongside each other and progressively put more data into Prometheus without affecting Graphite users.
Nevertheless it should be possible to keep Prometheus and Graphite alongside each other and progressively put more data into Prometheus without affecting Graphite users.
The top contributors to data that flows into Graphite as of Aug 2016 are Diamond, Statsd and Cassandra.
The top contributors to data that flows into Graphite as of Aug 2016 are Diamond, Statsd and Cassandra.
== Statsd ==


=== Statsd ===
Statsd traffic for the most part flows from machines to <tt>statsd.eqiad.wmnet</tt> over UDP on port 8125 for aggregation. There are some exceptions (e.g. swift) where statsd aggregation is performed on localhost and then pushed via graphite line-oriented protocol.
Statsd traffic for the most part flows from machines to <tt>statsd.eqiad.wmnet</tt> over UDP on port 8125 for aggregation. There are some exceptions (e.g. swift) where statsd aggregation is performed on localhost and then pushed via graphite line-oriented protocol.


Line 351: Line 227:
If you are migrating your service that uses statsd to k8s, see also [[Prometheus/statsd_k8s]]
If you are migrating your service that uses statsd to k8s, see also [[Prometheus/statsd_k8s]]


== Cassandra ==
=== Cassandra ===
Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. WRT implementation there are two viable options:
Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. WRT implementation there are two viable options:
* Scrape JMX cassandra metrics with https://github.com/prometheus/jmx_exporter either externally with a JMX connection or as a "java agent" on the side to cassandra
* Scrape JMX cassandra metrics with https://github.com/prometheus/jmx_exporter either externally with a JMX connection or as a "java agent" on the side to cassandra
Line 357: Line 233:
* Add Prometheus java client to creole (https://github.com/eevans/creole)
* Add Prometheus java client to creole (https://github.com/eevans/creole)


== JMX ==
=== JMX ===
Prometheus [https://github.com/prometheus/jmx_exporter jmx_exporter] can be used to collect metrics through JMX.
Prometheus [https://github.com/prometheus/jmx_exporter jmx_exporter] can be used to collect metrics through JMX.


Line 370: Line 246:
This implies that an overly broad blacklist query can still have a non trivial cost.
This implies that an overly broad blacklist query can still have a non trivial cost.


=== List/inspect existing mbeans ===
==== List/inspect existing mbeans ====
 
Scenario: you want to check JMX MBeans available or generic JVM data in Production from your laptop:
Scenario: you want to check JMX MBeans available or generic JVM data in Production from your laptop:
<pre>
<pre>
Line 379: Line 254:
Then Jconsole will be opened and you'll need to select '''Remote Process''', adding the following: '''$hostname$:port''' (don't use localhost, it will not work!)
Then Jconsole will be opened and you'll need to select '''Remote Process''', adding the following: '''$hostname$:port''' (don't use localhost, it will not work!)


== Dashboards ==
=== Grafana dashboards ===
Grafana dashboards will need porting from Graphite to Prometheus metrics; this is likely to be the most labor-intensive part since most (all?) dashboards are hand-curated. While it should be possible to programmatically change statsd metric names into prometheus metric names, the query language is different enough to make this impractical except for very basic cases.
 
=== Replacing Watchmouse (CA DX APP) ===
Prometheus is replacing the 3rd party monitoring system we often refer to as "watchmouse" (since rebranded to CA DX APP monitoring)
 
This replacement has been dubbed "pingthing" as a reference to the functionality it has been deployed to replace, essentially static checks of public facing resources.  Pingthing checks are driven by prometheus blackbox exporter.
 
== {{Anchor|HOWTO}}Runbooks ==
 
=== Global view (Thanos) web interface ===
As of Jul 2020 the [[Thanos]] web interface is available at https://thanos.wikimedia.org. This interface offers a global view over Prometheus data and should be preferred for new use cases. Please consult the [[Thanos]] page to find out more.
 
=== Access Prometheus web interface ===
Use https://thanos.wikimedia.org to run Prometheus queries across all Prometheus instances in all sites. The old method of SSH port forwarding still works but has been deprecated and replaced by the Thanos web interface. In short, for example for the 'ops' instance (port 9900) in prometheus codfw: <tt>ssh -L9900:localhost:9900 prometheus2003.codfw.wmnet</tt> then browse http://localhost:9900
 
To access the prometheus web interface in beta (deployment-prep) you use https://beta-prometheus.wmflabs.org/beta/graph
 
To access the prometheus web interface for Cloud Services hardware that are using the cloudmetrics monitoring setup, please follow the instructions at [[Portal:Cloud_VPS/Admin/Monitoring#Accessing_"labs"_prometheus]]
 
=== List metrics with curl ===
One easy way to check what metrics are being collected by prometheus on a given machine is to request the metrics via HTTP like prometheus server does at scrape time, e.g. for node-exporter on port 9100:
 
  curl -s localhost:9100/metrics
 
=== Aggregate metrics from multiple sites ===
The use case for a "global" view of metrics used to be covered by the global Prometheus instance. Said instance is deprecated and this use case (and more) are covered by [[Thanos]].


Grafana dashboards will need porting from Graphite to Prometheus metrics; this is likely to be the most labor-intensive part since most (all?) dashboards are hand-curated.
=== Sync data from an existing Prometheus host ===
While it should be possible to programmatically change statsd metric names into prometheus metric names, the query language is different enough to make this impractical except for very basic cases.
When replacing existing Prometheus hosts it is possible to keep existing data by rsync'ing the <tt>metrics</tt> directory from the old host into the new. It is important to make sure first that the new host has puppet run successfully (thus Prometheus is configured) and can Prometheus can reach its targets successfully (i.e. the new host is part of <tt>prometheus_nodes</tt> for its site. Once all of that is done the rsync can happen, on the new host:


= Replacing Watchmouse (CA DX APP) =
  puppet agent --disable "copying prometheus data"
Prometheus is replacing the 3rd party monitoring system we often refer to as "watchmouse" (since rebranded to CA DX APP monitoring)
  export old_host=<hostname>
  export instance_name=ops
  systemctl stop prometheus@${instance_name}
  su -s /bin/bash prometheus
  rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
  # do a first rsync pass in parallel for each subdirectory
  /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
  # once this is completed stop puppet and prometheus on $old_host as well, and repeat the rsync for a final pass.
  rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
  /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
  # once this is completed you can restart prometheus and puppet on both hosts
 
=== Prometheus host running out of space ===
It might happen that Prometheus hosts get close to running out of space on one of their per-instance filesystems. Assuming the underlying volume group has space available (<tt>lvs</tt> to check what LVs are present and on which VGs, then <tt>vgs</tt> to check VGs themselves) then it is possible to extend the filesystem online with (e.g. +25G to the prometheus-foo LV on vg0 VG, remove <tt>--test</tt> once happy).
 
  lvextend --test --resizefs --size +25G vg0/prometheus-foo
 
Make sure to:
* Leave some space available on the VG, to handle cases like this in the future if possible
* Extend the filesystem on all prometheus hosts in the same site
* <tt>!log</tt> your actions for easier traceability
 
=== No space available on the volume group ===
 
At some point the space on volume group might be fully allocated. In this case the emergency remedy is to decrease Prometheus retention time via <tt>prometheus::server::storage_retention</tt> in Puppet, and restart Prometheus with the new settings.
 
In the unfortunate case that the filesystem is 100% utilized is also possible to manually remove storage "blocks" (i.e. directories) from the <tt>metrics</tt> directory under <tt>/srv/prometheus/INSTANCE</tt>. Sorting the filenames alphabetically will ensure they are sorted chronologically as well.
 
=== Add filesystems for a new instance ===
Until [[phab:T163692|T163692]] is fully resolved, new Prometheus instances require adding LVs to the Prometheus hosts in eqiad/codfw. When provisioning a new instance refer to <tt>modules/prometheus/files/provision-fs.sh</tt>: add the new instance there and run the script on eqiad/codfw Prometheus hosts.


This replacement has been dubbed "watchrat" as a reference to the functionality it has been deployed to replace, essentially static checks of public facing resourcesWatchrat checks are driven by prometheus blackbox exporter.
=== Add metrics from a new service ===
Most services which export metrics to Prometheus do so via an HTTP endpoint, running on its own portThis HTTP endpoint can be served by the daemon itself, or by a separate "exporter" process.


= Runbooks =
Prometheus needs to be told to scrape the HTTP endpoint, which it calls a "target."  (A logical grouping of targets is called a "job.")  In addition to adding the new job to the Prometheus server, you will need to add a firewall rule exposing the HTTP endpoint.


== Stop queries on problematic instances ==
For an example Puppet changes to add new jobs, see [[gerrit:c/operations/puppet/+/504360|change 504360]] and [[gerrit:#/c/operations/puppet/+/572141|change 572141]].


=== Stop queries on problematic instances ===
If a single Prometheus instance is misbehaving (e.g. overloaded) it is possible to temporarily stop queries from reaching that instance, by stopping Puppet commenting the relevant <tt>ProxyPass</tt> entry in <tt>/etc/apache2/prometheus.d/</tt> and issue <tt>apache2ctl graceful</tt>. See also {{Bug|T217715}}.
If a single Prometheus instance is misbehaving (e.g. overloaded) it is possible to temporarily stop queries from reaching that instance, by stopping Puppet commenting the relevant <tt>ProxyPass</tt> entry in <tt>/etc/apache2/prometheus.d/</tt> and issue <tt>apache2ctl graceful</tt>. See also {{Bug|T217715}}.


Line 398: Line 330:
[[Category:Runbooks]]
[[Category:Runbooks]]


== Prometheus was restarted ==
=== Prometheus was restarted ===
 
The alert on Prometheus uptime exists to notify opsen of the possibility of strange monitoring artifacts occurring, as [[Incident documentation/20190425-prometheus|has happened in the past]].  If it was just a single restart, and not a crashloop, no action is strictly necessary (but investigating what happened isn't a bad idea; Prometheus isn't supposed to crash or restart).
The alert on Prometheus uptime exists to notify opsen of the possibility of strange monitoring artifacts occurring, as [[Incident documentation/20190425-prometheus|has happened in the past]].  If it was just a single restart, and not a crashloop, no action is strictly necessary (but investigating what happened isn't a bad idea; Prometheus isn't supposed to crash or restart).


If this alert is firing for a 'global' Prometheus, it can mean that either the global instance restarted, or that one of the Prometheis scraped by the global instance restarted.
If this alert is firing for a 'global' Prometheus, it can mean that either the global instance restarted, or that one of the Prometheis scraped by the global instance restarted.


== Configuration reload failure ==
=== Configuration reload failure ===
 
Check for recent changes in Puppet, particular modifications to monitoring::check_prometheus invocations or to the underlying module/prometheus templates themselves. Hopefully the error message from Prometheus gives you some idea.
Check for recent changes in Puppet, particular modifications to monitoring::check_prometheus invocations or to the underlying module/prometheus templates themselves. Hopefully the error message from Prometheus gives you some idea.


== k8s cache not updating ==
=== k8s cache not updating ===
 
As discovered in {{Bug|T227478}} the Prometheus kubernetes cache can stop updating (reasons TBD). In this case <tt>systemctl restart prometheus@k8s</tt> "fixes" the issue.
As discovered in {{Bug|T227478}} the Prometheus kubernetes cache can stop updating (reasons TBD). In this case <tt>systemctl restart prometheus@k8s</tt> "fixes" the issue.


== Prometheus job unavailable ==
=== Prometheus job unavailable ===
As part of {{Bug|T187708}} there's alerting in place for unavailable Prometheus jobs. This alert means that Prometheus was unable to fetch metrics from most of the job's targets, usually for the following reasons:


As part of {{Bug|T187708}} there's alerting in place for unavailable Prometheus jobs. This means that Prometheus was unable to fetch metrics from most of the job's targets, e.g. because the target is down, unreachable or fetching metrics timed out. See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard and logs.
* the targets themselves are effectively down, unreachable or fetching metrics timed out. Could be caused by missing firewall rules on the host, the service is down, etc
* the target files for Prometheus are incorrect or stale. For example Prometheus is trying to pull metrics from a port /service that's not provisioned on the host anymore. Check <tt>/srv/prometheus/INSTANCE/targets</tt> on Prometheus hosts and the related Puppet configuration at <tt>modules/profile/manifests/prometheus/INSTANCE.pp</tt>.


== Prometheus exporters "up" metrics unavailable ==
See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard


=== Prometheus exporters "up" metrics unavailable ===
Some services don't have native Prometheus metrics support, thus an "exporter" is used that runs alongside the service and converts metrics from the service into Prometheus metrics. It might happen that the exporter itself is up (thus the job is available, see above) but the exporter is unable to contact the service for some reason. Such conditions are reported in metrics such as <code>mysql_up</code> for example by the mysql exporter. See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard and logs.
Some services don't have native Prometheus metrics support, thus an "exporter" is used that runs alongside the service and converts metrics from the service into Prometheus metrics. It might happen that the exporter itself is up (thus the job is available, see above) but the exporter is unable to contact the service for some reason. Such conditions are reported in metrics such as <code>mysql_up</code> for example by the mysql exporter. See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard and logs.


== Failover Prometheus Pushgateway ==
=== Failover Prometheus Pushgateway ===
 
The Prometheus Pushgateway needs to run as a singleton to properly track pushed metrics. For this reason the <tt>prometheus-pushgateway</tt> is active on one host at a time.
The Prometheus Pushgateway needs to run as a singleton to properly track pushed metrics. For this reason the <tt>prometheus-pushgateway</tt> is active on one host at a time.


Line 430: Line 361:
* Run puppet on the old and new host, then on all prometheus hosts in codfw/eqiad to make sure metrics are polled from the new host
* Run puppet on the old and new host, then on all prometheus hosts in codfw/eqiad to make sure metrics are polled from the new host


== Stale file for node-exporter textfile ==
=== Stale file for node-exporter textfile ===
 
Certain metrics are periodically generated by dumping Prometheus-formatted plaintext files (extension <tt>.prom</tt>) into <tt>/var/lib/prometheus/node.d/</tt>. The processes that generate the files run asynchronously to node-exporter, normally via systemd timers, and such processes can fail to update the files.  The alert fire whenever such metric files have failed to be updated; the Icinga alert description will be something like the following:
Certain metrics are periodically generated by dumping Prometheus-formatted plaintext files (extension <tt>.prom</tt>) into <tt>/var/lib/prometheus/node.d/</tt>. The processes that generate the files run asynchronously to node-exporter, normally via systemd timers, and such processes can fail to update the files.  The alert fire whenever such metric files have failed to be updated; the Icinga alert description will be something like the following:


Line 438: Line 368:
Meaning that <tt>an-worker1101</tt> has failed to update <tt>debian_version.prom</tt>. Debugging such failures usually involved finding out which systemd timer is responsible for generating the file, usually by looking at puppet, and further debug from there.
Meaning that <tt>an-worker1101</tt> has failed to update <tt>debian_version.prom</tt>. Debugging such failures usually involved finding out which systemd timer is responsible for generating the file, usually by looking at puppet, and further debug from there.


== Watchrat Non-23xx HTTP response ==
=== Pingthing Non-23xx HTTP response ===
A URL checked by the blackbox/watchrat prometheus job is returning a non 200/300 HTTP code, the URL contained in the instance label should be checked for problems.
A URL checked by the blackbox/pingthing prometheus job is returning a non 200/300 HTTP code, the URL contained in the instance label should be checked for problems.
[[Category:Metrics]]
[[Category:Metrics]]
[[Category:SRE Observability]]
[[Category:SRE Observability]]
[[Category:Services]]

Latest revision as of 12:30, 22 July 2022

Prometheus is an open source software ecosystem for monitoring and alerting, with focus on reliability and simplicity. See also upstream's Prometheus overview and Prometheus FAQ.

What is it?

Distinguishing features of Prometheus compared to other metrics systems include:

multi-dimensional data model
Metrics have a name and several key=value pairs to better model what the metric is about. e.g. to measure varnish requests in the upload cache in eqiad we'd have a metric like http_requests_total{cache="upload",site="eqiad"}.
a powerful query language
Makes it able to ask complex questions, e.g. when debugging problems or drilling down for root cause during outages. From the example above, the query topk(3, sum(http_requests_total{status~="^5"}) by (cache)) would return the top 3 caches (text/upload/misc) with the most errors (status matches the regexp "^5")
pull metrics from targets
Prometheus is primarily based on a pull model, in which the prometheus server has a list of targets it should scrape metrics from. The pull protocol is HTTP based and simply put, the target returns a list of "<metric> <value>". Pushing metrics is supported too, see also http://prometheus.io/docs/instrumenting/pushing/.

After the Prometheus proof of concept (as per User:Filippo_Giunchedi/Prometheus_POC) has been running in Labs for some time, during FQ1 2016-2017 the Prometheus deployment has been extended to production, as outlined in the WMF Engineering 2017 Goals.

Service

Server location

The various Prometheus servers are logically separated, though physically they can share one or multiple hosts. As of April 2022, we run Prometheus on baremetal hardware in Eqiad and Codfw, and on Ganeti VMs in all POPs / caching centers.

Instances

As of April 2022 the list of all Prometheus instances includes:

analytics
All things analytics from the Hadoop cluster and similar
ext
Collect external, potentially untrusted, data
global
Global instance, see below
k8s
Main/production k8s cluster
k8s-mlserve
k8s for machine learning
k8s-staging
Staging k8s cluster
ops
The biggest instance where most metrics are collected
services
Dedicated instance for ex services team / cassandra metrics

Federation and multiple DCs

The use case for a cross-DC view of metrics used to be covered by the "global" Prometheus instance. This instance is now deprecated with this use case (and much more) now covered by Thanos.

Architecture

Each Prometheus server is configured to scrape a list of targets (i.e. HTTP endpoints) at a certain frequency, in our case starting at 60s. All metrics are stored on the local disk with a per-instance multi-week retention period.

All targets to be scraped are grouped into jobs, depending on the purpose that those targets serve. For example the job to scrape all host-level data for a given location using node-exporter will be called node and each target will be listed as hostname:9100. Similarly there are jobs for varnish, mysql, etc.

Each Prometheus server is meant to be stand-alone and polling targets in the same failure domain as the server itself as appropriate (e.g. the same site, the same vlan and so on). For example this allows to keep the monitoring local to the site and not have spotty metrics upon cross-site connectivity blips. (See also Federation)

Prometheus single server.png

Exporters

The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter.

Each exporter serves the current snapshot of metrics when polled by the prometheus server, there is no metric history kept by the exporter itself. Further, the exporter usually runs on the same host as the service or host it is monitoring.

Storage

Why just stand-alone prometheus servers with local storage and not clustered storage? The idea behind a single prometheus server is one of reliability: a monitoring system must be more reliabile than the systems it is monitoring. It is certainly easier to get local storage right and reliable than clustered storage, especially important when collecting operational metrics.

See also prometheus storage documentation for a more in-depth explanation and storage space requirements.

High availability

With local storage being the basic building block we can still achieve high-availability by running more than one server in parallel, each configured the same and polling the same set of targets. Queries for data can be routed via LVS in an active/standby fashion, under normal circumstances the load is shared (i.e. active/active).

Prometheus HA server.png

Backups

For efficiency reasons, prometheus spools chunks of datapoints in memory for each metric before flushing them to disk. This makes it harder to perform backups online by simply copying the files on disk. The issue of having consistent backups is also discussed in prometheus #651.

Notwithstanding the above, it should be possible to backup the prometheus local storage files as-is by archiving its storage directory with tar before regular (bacula) backups. Since the backup is being done online it will result in some inconsistencies, upon restoring the backup Prometheus will crash-recovery its storage at startup.

To perform backups of consistent/clean state, at the moment prometheus needs to be shutdown gracefully, therefore when running an active/standby configuration backup can be taken on the standby prometheus to minimize its impact. Note that the shutdown will result in gaps in the standby prometheus server for the duration of the shutdown.

Failure recovery

In the event of a prometheus server having an unusable local storage (disk failed, filesystem failed, corruption, etc) failure recovery can take the form of:

  • start with empty storage: of course it is a complete loss of metric history for the local server and will obviously fully recover once the metric retention period has passed.
  • recover from backups: restore the storage directory to the last good backup
  • copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups)

Service Discovery

Prometheus supports different kinds of discovery through its configuration. For example, in role::prometheus::labs_project implements auto-discovery of all instances for a given labs project. file_sd_config is used to continuously monitor a set of configuration files for changes and the script prometheus-labs-targets is run periodically to write the list of instances to the relative configuration file. The file_sd files are reloaded automatically by prometheus, so new instances will be auto-discovered and have their instance-level metrics collected.

While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also profile::prometheus::k8s).

Adding new metrics

In general Prometheus' model is pull-based. In practical terms that means that once metrics are available over HTTP somewhere on the network with the methods described below, Prometheus itself should be instructed to poll for metrics via its configuration (more specifically, a job as described in upstream documentation). Within WMF's Puppet the Prometheus configuration lives inside its respective instance profile, for example modules/profile/manifests/prometheus/ops.pp is often the right place to add new jobs.

Direct service instrumentation

The most benefits from service metrics are obtained when services are directly instrumented with one of Prometheus clients, e.g. Python client. Metrics are then exposed via HTTP/HTTPS, commonly at /metrics, on the service's HTTP(S) port (in the common case) or a separate port if the service doesn't talk HTTP to begin with.

Service exporters

For cases where services can't be directly instrumented (aka whitebox monitoring), a sidekick application exporter can be run alongside the service that will query the service using whatever mechanism and expose prometheus metrics via the client. This is the case for example for varnish_exporter parsing varnishstat -j or apache_exporter parsing apache's mod_status page.

Machine-level metrics

Another class of metrics is all those related to the machine itself rather than a particular service. Those involve calling a subprocess and parsing the result, often in a cronjob. In these cases the simplest thing to do is drop plaintext files on the machine's filesystem for node-exporter to pick up and expose the metrics on HTTP. This mechanism is named textfile and for example the python client has support for it, e.g. sample textfile collector usage. This is most likely the mechanism we could use to replace most of the custom collectors we have for Diamond.

Ephemeral jobs (Pushgateway)

Yet another case involves service-level ephemeral jobs that are not quite long-lived enough to be queried via HTTP. For those jobs there's a push mechanism to be used: metrics are pushed to Prometheus' pushgateway via HTTP and subsequently scraped by Prometheus once a minute from the gateway itself.

This method appears similar to what statsd for its simplicity but it should be used with care, see also best practices on when to use the pushgateway. Good use cases are for example mediawiki's maintenance jobs: tracking how long the job took and when it last succeeded; if the job isn't tied to a machine in particular it is usually a good candidate.

In WMF's deployment the pushgateway address to use is http://prometheus-pushgateway.discovery.wmnet

Network probes (blackbox exporter)

As of Jul 2022 it is possible to run so-called network blackbox probes via Prometheus. Said probes are run from Prometheus hosts themselves, target network services and are used to assert whether the service works from a user/client perspective (hence the "blackbox" terminology).

If your service is part of service::catalog in puppet then adding network probes is trivial in most cases. Add a probes stanza to your service, for example probing /?spec and test for a 2xx response is achieved by the following:

 probes:
   - type: http
     path: /?spec

Refer to the Wmflib::Service::Probe type documentation for more advanced use cases.

Custom checks/probes defined outside service::catalog can be implemented in Puppet via prometheus::blackbox::check::{http,tcp,icmp} abstractions. They will deploy both network probes and related alerts (e.g. when the probe is unsuccessful, or the TLS certificates are about to expire), by default probing both ipv4 and ipv6 address families. The probe's usage largely depends on the use case, ranging from a simple example like below:

 # Probe the phabricator.wikimedia.org vhost, using TLS, and talk to the host(s) this check is deployed to
 prometheus::blackbox::check::http { 'phabricator.wikimedia.org':
     severity => 'page',
 }

To more complex use cases like VTRS, checking responses for specific text, on ipv4, etc:

 prometheus::blackbox::check::http { 'ticket.wikimedia.org':
     team               => 'serviceops-collab',             
     severity           => 'warning',    
     path               => '/otrs/index.pl',
     port               => 1443,
     ip_families        => ['ip4'],      
     force_tls          => true,
     body_regex_matches => ['wikimedia'],
 }

Check the http check documentation for more information.

It is recommended to pick the highest-level check possible for your service (IOW prefer HTTP over TCP for example) to improve signal-to-noise ratio.

Example use

MySQL monitoring is performed by running prometheus-mysqld-exporter on the database machine to be monitored. Metrics are exported via http on port 9104 and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with:

curl -s localhost:9104/metrics | grep -v '^#'

Grafana dashboards:

Query cheatsheet

Filter for a specific instance

Given values such as

varnish_mgt_child_stop{instance="cp2001:9131",job="varnish-text",layer="backend"}

and a template variable called $server, containing the server hostname, one can filter for the selected instance as follows:

varnish_mgt_child_start{instance=~"$server:.*",layer="backend"}

Filter by label using multi-values template variables

Given the following two metrics:

varnish_version{job="varnish-upload", ...}
node_uname_info{cluster="cache_upload", ...}

and a multi-value template variable called $cache_type, with the following values: text,upload,misc,canary, it is possible to write a prometheus query filtering the selected cache_types:

node_uname_info{cluster=~"cache_($cache_type)"}
varnish_version{job=~"varnish-($cache_type)"}

Dynamic, query-based template variables

Grafana's templating allows to define template variables based on Prometheus queries.

Given the following metric:

node_uname_info{release="4.9.0-0.bpo.4-amd64", ...}
node_uname_info{release="4.9.0-0.bpo.3-amd64", ...}

Choose Query as the variable Type, the desired Data Source, and specify a query such as the following to extract the values:

 label_values(node_uname_info, release)

Query a metric with high accuracy even if with low precision (e.g. uptime)

Prometheus metrics will never provide high precision- this is mostly because scraping only happens every minute, resulting in values being accurate within that 1 minute of scrape time. However, there are times when you need high accuracy (getting the value at a specific time), even if you don't care what that time is. This is the case, for example, to calculate the uptime of a server: you don't care if you get stale results, as long as they are accurate in the past. To do so, you can query:

timestamp(node_time_seconds) - node_time_seconds

This way the time will be accurate to the second. If you use the timestamp of the metric or time(), you will get varying times within a minute.

FAQ

How long are metrics stored in Prometheus?

As of June 2020, we have deployed Thanos for long term storage of metrics. The target retention period for all one-minute metrics is three years, although as of Jul 2022 the one-minute retention has been shortened for capacity reasons (cfr bug T311690). Five-minute and one-hour aggregated datapoints retention target is still set at three years.

What are the semantics of rate/irate/increase?

These functions generally take a counter metric (i.e. non-decreasing) and return a "value over time". Rate and irate return per second counts, while increase returns the change over the given interval. See also in depth explanation at promlabs.com

What best practices should we use for label and metric naming

We generally tend to follow the same general guidelines as Prometheus: https://prometheus.io/docs/practices/naming/. Don't hesitate to reach out to Observability with further questions around metric/label naming.

Replacing Graphite

Another use case imaginable for Prometheus is to replace the current Graphite deployment. This task is less "standalone" than replacing Ganglia and therefore more difficult: Graphite is more powerful and used by more people/services/dashboards. Nevertheless it should be possible to keep Prometheus and Graphite alongside each other and progressively put more data into Prometheus without affecting Graphite users. The top contributors to data that flows into Graphite as of Aug 2016 are Diamond, Statsd and Cassandra.

Statsd

Statsd traffic for the most part flows from machines to statsd.eqiad.wmnet over UDP on port 8125 for aggregation. There are some exceptions (e.g. swift) where statsd aggregation is performed on localhost and then pushed via graphite line-oriented protocol.

Prometheus provides statsd_exporter to receive statsd metrics and turn those into key => value prometheus metrics according to a user-supplied mapping. The resulting metrics are then exposed via HTTP for prometheus server to scrape.

One idea to integrate statsd_exporter into our statsd traffic is to put it "inline" between the application and statsd.eqiad.wmnet. In other words we would need to:

  1. Modify statsd_exporter to mirror received udp packets to statsd.eqiad.wmnet and install it on end hosts
  2. Opt-in applications by changing their statsd host from statsd.eqiad.wmnet to localhost
  3. Extend the statsd_exporter mapping file to include mappings for our statsd metrics.

This method works well for applications/languages that are request-scoped (e.g. php) since there isn't necessarily a server process to keep and aggregate metrics in. For services that qualify, the recommended way is to switch to Prometheus client for instrumentation.

If you are migrating your service that uses statsd to k8s, see also Prometheus/statsd_k8s

Cassandra

Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. WRT implementation there are two viable options:

JMX

Prometheus jmx_exporter can be used to collect metrics through JMX.

A few notes:

  • Some standard JVM metrics are always collected as DefaultsExports, those cannot be ignored in the jmx_exporter configuration. The same metrics could be collected explicitly from their respective MBeans, but we chose to standardize on the default exports.
  • Without whitelist / blacklist, jmx_exporter will iterate through all MBeans and read all their attributes. This can be expensive, or even dangerous depending on the MBeans exposed by the application.
  • The whitelist / blacklist work as:
    • load all mbeans corresponding to the whitelist query,
    • load all mbeans corresponding to the black list query,
    • remove all blacklisted mbeans from the list of whitelisted mbeans,
    • iterate over the remaining mbeans, including reading all their attributes.

This implies that an overly broad blacklist query can still have a non trivial cost.

List/inspect existing mbeans

Scenario: you want to check JMX MBeans available or generic JVM data in Production from your laptop:

ssh -ND 9099 $some_hostname$
jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=9099

Then Jconsole will be opened and you'll need to select Remote Process, adding the following: $hostname$:port (don't use localhost, it will not work!)

Grafana dashboards

Grafana dashboards will need porting from Graphite to Prometheus metrics; this is likely to be the most labor-intensive part since most (all?) dashboards are hand-curated. While it should be possible to programmatically change statsd metric names into prometheus metric names, the query language is different enough to make this impractical except for very basic cases.

Replacing Watchmouse (CA DX APP)

Prometheus is replacing the 3rd party monitoring system we often refer to as "watchmouse" (since rebranded to CA DX APP monitoring)

This replacement has been dubbed "pingthing" as a reference to the functionality it has been deployed to replace, essentially static checks of public facing resources. Pingthing checks are driven by prometheus blackbox exporter.

Runbooks

Global view (Thanos) web interface

As of Jul 2020 the Thanos web interface is available at https://thanos.wikimedia.org. This interface offers a global view over Prometheus data and should be preferred for new use cases. Please consult the Thanos page to find out more.

Access Prometheus web interface

Use https://thanos.wikimedia.org to run Prometheus queries across all Prometheus instances in all sites. The old method of SSH port forwarding still works but has been deprecated and replaced by the Thanos web interface. In short, for example for the 'ops' instance (port 9900) in prometheus codfw: ssh -L9900:localhost:9900 prometheus2003.codfw.wmnet then browse http://localhost:9900

To access the prometheus web interface in beta (deployment-prep) you use https://beta-prometheus.wmflabs.org/beta/graph

To access the prometheus web interface for Cloud Services hardware that are using the cloudmetrics monitoring setup, please follow the instructions at Portal:Cloud_VPS/Admin/Monitoring#Accessing_"labs"_prometheus

List metrics with curl

One easy way to check what metrics are being collected by prometheus on a given machine is to request the metrics via HTTP like prometheus server does at scrape time, e.g. for node-exporter on port 9100:

 curl -s localhost:9100/metrics

Aggregate metrics from multiple sites

The use case for a "global" view of metrics used to be covered by the global Prometheus instance. Said instance is deprecated and this use case (and more) are covered by Thanos.

Sync data from an existing Prometheus host

When replacing existing Prometheus hosts it is possible to keep existing data by rsync'ing the metrics directory from the old host into the new. It is important to make sure first that the new host has puppet run successfully (thus Prometheus is configured) and can Prometheus can reach its targets successfully (i.e. the new host is part of prometheus_nodes for its site. Once all of that is done the rsync can happen, on the new host:

 puppet agent --disable "copying prometheus data"
 export old_host=<hostname>
 export instance_name=ops
 systemctl stop prometheus@${instance_name}
 su -s /bin/bash prometheus
 rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
 # do a first rsync pass in parallel for each subdirectory
 /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
 # once this is completed stop puppet and prometheus on $old_host as well, and repeat the rsync for a final pass.
 rsync -vd ${old_host}::prometheus-${instance_name}/ /srv/prometheus/${instance_name}/metrics/
 /usr/bin/time parallel -j10 -i rsync -a ${old_host}::prometheus-${instance_name}/{}/ {}/ -- /srv/prometheus/${instance_name}/metrics/*
 # once this is completed you can restart prometheus and puppet on both hosts

Prometheus host running out of space

It might happen that Prometheus hosts get close to running out of space on one of their per-instance filesystems. Assuming the underlying volume group has space available (lvs to check what LVs are present and on which VGs, then vgs to check VGs themselves) then it is possible to extend the filesystem online with (e.g. +25G to the prometheus-foo LV on vg0 VG, remove --test once happy).

 lvextend --test --resizefs --size +25G vg0/prometheus-foo

Make sure to:

  • Leave some space available on the VG, to handle cases like this in the future if possible
  • Extend the filesystem on all prometheus hosts in the same site
  • !log your actions for easier traceability

No space available on the volume group

At some point the space on volume group might be fully allocated. In this case the emergency remedy is to decrease Prometheus retention time via prometheus::server::storage_retention in Puppet, and restart Prometheus with the new settings.

In the unfortunate case that the filesystem is 100% utilized is also possible to manually remove storage "blocks" (i.e. directories) from the metrics directory under /srv/prometheus/INSTANCE. Sorting the filenames alphabetically will ensure they are sorted chronologically as well.

Add filesystems for a new instance

Until T163692 is fully resolved, new Prometheus instances require adding LVs to the Prometheus hosts in eqiad/codfw. When provisioning a new instance refer to modules/prometheus/files/provision-fs.sh: add the new instance there and run the script on eqiad/codfw Prometheus hosts.

Add metrics from a new service

Most services which export metrics to Prometheus do so via an HTTP endpoint, running on its own port. This HTTP endpoint can be served by the daemon itself, or by a separate "exporter" process.

Prometheus needs to be told to scrape the HTTP endpoint, which it calls a "target." (A logical grouping of targets is called a "job.") In addition to adding the new job to the Prometheus server, you will need to add a firewall rule exposing the HTTP endpoint.

For an example Puppet changes to add new jobs, see change 504360 and change 572141.

Stop queries on problematic instances

If a single Prometheus instance is misbehaving (e.g. overloaded) it is possible to temporarily stop queries from reaching that instance, by stopping Puppet commenting the relevant ProxyPass entry in /etc/apache2/prometheus.d/ and issue apache2ctl graceful. See also bug T217715.

Prometheus was restarted

The alert on Prometheus uptime exists to notify opsen of the possibility of strange monitoring artifacts occurring, as has happened in the past. If it was just a single restart, and not a crashloop, no action is strictly necessary (but investigating what happened isn't a bad idea; Prometheus isn't supposed to crash or restart).

If this alert is firing for a 'global' Prometheus, it can mean that either the global instance restarted, or that one of the Prometheis scraped by the global instance restarted.

Configuration reload failure

Check for recent changes in Puppet, particular modifications to monitoring::check_prometheus invocations or to the underlying module/prometheus templates themselves. Hopefully the error message from Prometheus gives you some idea.

k8s cache not updating

As discovered in bug T227478 the Prometheus kubernetes cache can stop updating (reasons TBD). In this case systemctl restart prometheus@k8s "fixes" the issue.

Prometheus job unavailable

As part of bug T187708 there's alerting in place for unavailable Prometheus jobs. This alert means that Prometheus was unable to fetch metrics from most of the job's targets, usually for the following reasons:

  • the targets themselves are effectively down, unreachable or fetching metrics timed out. Could be caused by missing firewall rules on the host, the service is down, etc
  • the target files for Prometheus are incorrect or stale. For example Prometheus is trying to pull metrics from a port /service that's not provisioned on the host anymore. Check /srv/prometheus/INSTANCE/targets on Prometheus hosts and the related Puppet configuration at modules/profile/manifests/prometheus/INSTANCE.pp.

See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard

Prometheus exporters "up" metrics unavailable

Some services don't have native Prometheus metrics support, thus an "exporter" is used that runs alongside the service and converts metrics from the service into Prometheus metrics. It might happen that the exporter itself is up (thus the job is available, see above) but the exporter is unable to contact the service for some reason. Such conditions are reported in metrics such as mysql_up for example by the mysql exporter. See also https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets for dashboard and logs.

Failover Prometheus Pushgateway

The Prometheus Pushgateway needs to run as a singleton to properly track pushed metrics. For this reason the prometheus-pushgateway is active on one host at a time.

To failover the steps involved are the following:

  • Change profile::prometheus::pushgateway_host in Puppet to point to another Prometheus host
  • Change the prometheus-pushgateway.discovery.wmnet record in DNS to point to the same host.
  • Run puppet on the old and new host, then on all prometheus hosts in codfw/eqiad to make sure metrics are polled from the new host

Stale file for node-exporter textfile

Certain metrics are periodically generated by dumping Prometheus-formatted plaintext files (extension .prom) into /var/lib/prometheus/node.d/. The processes that generate the files run asynchronously to node-exporter, normally via systemd timers, and such processes can fail to update the files. The alert fire whenever such metric files have failed to be updated; the Icinga alert description will be something like the following:

 cluster=analytics file=debian_version.prom instance=an-worker1101 job=node site=eqiad

Meaning that an-worker1101 has failed to update debian_version.prom. Debugging such failures usually involved finding out which systemd timer is responsible for generating the file, usually by looking at puppet, and further debug from there.

Pingthing Non-23xx HTTP response

A URL checked by the blackbox/pingthing prometheus job is returning a non 200/300 HTTP code, the URL contained in the instance label should be checked for problems.