You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Prometheus

From Wikitech-static
Revision as of 13:44, 30 August 2016 by imported>Filippo Giunchedi (→‎Use cases)
Jump to navigation Jump to search

What is it?

Prometheus is a free software ecosystem for monitoring and alerting, with focus on reliability and semplicity. See also prometheus overview and prometheus FAQ.

There's a few interesting features that are missing from what we have now, among others:

multi-dimensional data model
Metrics have a name and several key=value pairs to better model what the metric is about. e.g. to measure varnish requests in the upload cache in eqiad we'd have a metric like http_requests_total{cache="upload",site="eqiad"}.
a powerful query language
Makes it able to ask complex questions, e.g. when debugging problems or drilling down for root cause during outages. From the example above, the query topk(3, sum(http_requests_total{status~="^5"}) by (cache)) would return the top 3 caches (text/upload/misc) with the most errors (status matches the regexp "^5")
pull metrics from targets
Prometheus is primarily based on a pull model, in which the prometheus server has a list of targets it should scrape metrics from. The pull protocol is HTTP based and simply put, the target returns a list of "<metric> <value>". Pushing metrics is supported too, see also http://prometheus.io/docs/instrumenting/pushing/.

After the Prometheus POC (as per User:Filippo_Giunchedi/Prometheus_POC) has been running in Labs for some time, during FQ1 2016-2017 we'll be extending Prometheus deployment to production, as outlined in the Technical Operations goals .

Architecture

Each prometheus server is configured to scrape a list of targets (i.e. HTTP endpoints) at a certain frequency, in our case starting at 60s. All metrics are stored on the local disk with a per-server retention period (minimum of 4 months for the initial goal).

All targets to be scraped are grouped into jobs, depending on the purpose that those targets serve. For example the job to scrape all host-level data for a given location using node-exporter will be called node and each target will be listed as hostname:9100. Similarly there could be jobs for varnish, mysql, etc.

Each prometheus server is meant to be stand-alone and polling targets in the same failure domain as the server itself as appropriate (e.g. the same datacenter, the same vlan and so on). For example this allows to keep the monitoring local to the datacenter and not have spotty metrics upon cross-datacenter connectivity blips. (See also Federation)

Prometheus single server.png

Exporters

The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter.

Each exporter serves the current snapshot of metrics when polled by the prometheus server, there is no metric history kept by the exporter itself. Further, the exporter usually runs on the same host as the service or host it is monitoring.

Storage

Why just stand-alone prometheus servers with local storage and not clustered storage? The idea behind a single prometheus server is one of reliability: a monitoring system must be more reliabile than the systems it is monitoring. It is certainly easier to get local storage right and reliable than clustered storage, especially important when collecting operational metrics.

See also prometheus storage documentation for a more in-depth explanation and storage space requirements.

High availability

With local storage being the basic building block we can still achieve high-availability by running more than one server in parallel, each configured the same and polling the same set of targets. Queries for data can be routed via LVS in an active/standby fashion.

Prometheus HA server.png

Backups

For efficiency reasons, prometheus spools chunks of datapoints in memory for each metric before flushing them to disk. This makes it harder to perform backups online by simply copying the files on disk. The issue of having consistent backups is also discussed in prometheus #651.

Notwithstanding the above, it should be possible to backup the prometheus local storage files as-is by archiving its storage directory with tar before regular (bacula) backups. Since the backup is being done online it will result in some inconsistencies, upon restoring the backup Prometheus will crash-recovery its storage at startup.

To perform backups of consistent/clean state, at the moment prometheus needs to be shutdown gracefully, therefore when running an active/standby configuration backup can be taken on the standby prometheus to minimize its impact. Note that the shutdown will result in gaps in the standby prometheus server for the duration of the shutdown.

Failure recovery

In the event of a prometheus server having an unusable local storage (disk failed, FS failed, corruption, etc) failure recovery can take the form of:

  • start with empty storage: of course it is a complete loss of metric history for the local server and will obviously fully recover once the metric retention period has passed.
  • recover from backups: restore the storage directory to the last good backup
  • copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups)

Federation

Each prometheus server is able to act as a target to another prometheus server by means of federation. Our use case for this feature is primarily hierarchical federation, namely to have a 'global' prometheus that aggregates datacenter-level metrics from prometheus in each datacenter.

See also federation documentation

Service Discovery

Prometheus supports different kinds of discovery through its configuration. For example, in role::prometheus::labs_project implements auto-discovery of all instances for a given labs project. file_sd_config is used to continuously monitor a set of configuration files for changes and the script prometheus-labs-targets is run periodically to write the list of instances to the relative configuration file. The file_sd files are reloaded automatically by prometheus, so new instances will be auto-discovered and have their instance-level metrics collected.

While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also role::prometheus::tools).

Use cases

MySQL

MySQL monitoring is performed by running prometheus-mysqld-exporter on the database machine to be monitored. Metrics are exported via http on port 9104 and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with:

curl -s localhost:9104/metrics | grep -v '^#'

.

Dashboards

Per group / shard / role overview
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated
Per server drilldown
https://grafana.wikimedia.org/dashboard/db/mysql

Ganglia

One of the initial use cases for Prometheus is to provide at least as good service as Ganglia. For host-level metrics we're using prometheus-node-exporter and grouping hosts based on $cluster puppet variable.

Dashboards

Per cluster overview
https://grafana.wikimedia.org/dashboard/db/prometheus-by-ganglia-cluster

Replacing Ganglia

As of Aug 2016 Prometheus is deployed in WMF's main locations: codfw and eqiad. To achieve feature-parity with Ganglia we'd need to expand Prometheus deployment to more locations, more machines and more metrics.

more locations
To fully replace Ganglia we'd need to deploy one (or two) prometheus servers in caching DCs too, similar to what we're doing with the ganglia aggregators. In practice this would mean running the server on ulsfo and esams bastions, as of Aug 2016 resources on both seem available (i.e. disk space and memory). To have aggregated stats available it is also possible to deploy one (in eqiad/codfw) "global" Prometheus servers that federates from each DC-local Prometheus.
more machines
Increase the number of machines from which we collect host metrics to 100% for each location Prometheus is deployed to, for jessie and trusty distributions.
more metrics
The current Ganglia deployment includes other metrics other than machine-level, namely the gmond plugins listed below and committed to puppet.git. Some of those can be replaced by existing exporters listed at https://prometheus.io/docs/instrumenting/exporters/ while others will require some porting to prometheus' python client (packaged as python-prometheus-client). Each prometheus exporter will require some deployment/packaging work, namely creating packages (preferably using Debian native go packaging, or fpm as outlined at Prometheus/Exporters) plus puppet integration and instruct prometheus to poll the additional exporters.

Ganglia plugins

apache_status.py
Parses apache's status page, similar to https://github.com/neezgee/apache_exporter
gdnsd.py
Parses gdnsd JSON stats from localhost:3506/json, will require porting to prometheus python client
varnish.py
Parses varnish's JSON, similar to https://github.com/jonnenauha/prometheus_varnish_exporter
vhtcpd.py
Parses metrics from /tmp/vhtcpd.stats and will require porting
mysql.py
Already replaced by prometheus-mysqld-exporter
elasticsearch_monitoring.py
Parse metrics from localhost:9200, replacement could be based off something like https://github.com/Braedon/prometheus-es-exporter or https://github.com/ewr/elasticsearch_exporter
hhvm_mem.py
Parse json from localhost:9002/memory.json, will require porting to prometheus python client
hhvm_health.py
Ditto, for localhost:9002/check-health
gmond_memcached.py
Similar to https://github.com/prometheus/memcached_exporter
ocg.py
Parses stats from http://localhost:8000/?command=health, OCG is on its way out though
osm.py
Parse stats from /srv/osmosis/state.txt, from OSM's ganglia.py
postgresql.py
Similar to https://github.com/wrouesnel/postgres_exporter
gmond_jenkins.py
Similar to https://github.com/lovoo/jenkins_exporter
udp2log_socket.py
Counts sockets from udp2log, still used/useful?