You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Prometheus: Difference between revisions
imported>Filippo Giunchedi |
imported>Filippo Giunchedi m (→MySQL) |
||
Line 105: | Line 105: | ||
* copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups) | * copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups) | ||
== Federation == | == Federation and multiple DCs == | ||
Each prometheus server is able to act as a target to another prometheus server | Each prometheus server is able to act as a target to another prometheus server | ||
by means of | by means of [https://prometheus.io/docs/operating/federation/ Prometheus federation]. | ||
Our use case for this feature is primarily hierarchical | |||
federation, namely to have a 'global' prometheus that aggregates | |||
datacenter-level metrics from prometheus in each datacenter. | datacenter-level metrics from prometheus in each datacenter. | ||
[[File:Prometheus federation.png]] | |||
The global instance is what we would normally use in grafana as the | |||
"datasource" for dashboards to get an overview of all sites and aggregated | |||
metrics. To drilldown further and get more details it is possible to use the | |||
datacenter-local datasource and dashboard. | |||
=== Server location === | |||
In the diagram above the various Prometheus servers are logically separated, | |||
though physically they can share one/multiple machines. As of Nov 2016 | |||
Prometheus dc-local runs in two VMs for each of eqiad/codfw (instance named | |||
"ops") and we're in process of provisioning real hardware. | |||
An open question at this time is where to host the dc-local Prometheus servers | |||
for caching centers, essentially two options: | |||
# Local to the site | |||
# Remote, e.g. codfw polling ulsfo and eqiad polling esams | |||
The local option offers some advantages since all sites are logically the same | |||
and all polling for monitoring purposes is kept local to the site. Only the | |||
global instance would reach out to remote sites and thus could be affected by | |||
cross-DC network unavailability. | |||
This is significant especially during outages: the global instance would show a | |||
drop in global aggregates while the dc-local instance can keep collecting | |||
high-resolution data from site-local machines. | |||
One disadvantage of the local option is (as of Nov 2016) running Prometheus on | |||
the bastion for sites where we lack internal dedicated machines (e.g. ulsfo) | |||
alongside other services like tftp/installserver. | |||
= Service Discovery = | = Service Discovery = | ||
Line 125: | Line 156: | ||
While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also <tt>role::prometheus::tools</tt>). | While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also <tt>role::prometheus::tools</tt>). | ||
= Adding new metrics = | |||
== Direct service instrumentation == | |||
The most benefits from service metrics are obtained when services are directly instrumented with one of Prometheus clients, e.g. [https://github.com/prometheus/client_python Python client]. | |||
Metrics are then exposed via HTTP, commonly at <tt>/metrics</tt>, on the service's HTTP port (in the common case) or a separate port if the service isn't HTTP to begin with. | |||
== Service exporters == | |||
For cases where services can't be directly instrumented (aka whitebox monitoring), a sidekick application <tt>exporter</tt> can be run alongside the service that will query the service using whatever mechanism and expose prometheus metrics via the client. This is the case for example for [https://github.com/jonnenauha/prometheus_varnish_exporter varnish_exporter] parsing <tt>varnishstat -j</tt> or [https://github.com/neezgee/apache_exporter apache_exporter] parsing apache's <tt>mod_status</tt> page. | |||
== Machine-level metrics == | |||
Another class of metrics is all those related to the machine itself rather than a particular service. Those involve calling a subprocess and parsing the result, often in a cronjob. In these cases the simplest thing to do is drop plaintext files on the machine's filesystem for <tt>node-exporter</tt> to pick up and expose the metrics on HTTP. This mechanism is named <tt>textfile</tt> and for example the python client has support for it, e.g. [https://github.com/prometheus/client_python#node-exporter-textfile-collector sample textfile collector usage]. This is most likely the mechanism we could use to replace most of the custom collectors we have for Diamond. | |||
== Ephemeral jobs == | |||
Yet another case involves service-level ephemeral jobs that are not quite long-lived enough to be queried via HTTP. For those jobs there's a push mechanism to be used: metrics are pushed to [https://github.com/prometheus/pushgateway/blob/master/README.md Prometheus' pushgateway] via HTTP and subsequently scraped by Prometheus from the gateway itself. This method appears similar to what statsd for its semplicity but it should be used with care, see also [https://prometheus.io/docs/practices/pushing/ best practices on when to use the pushgateway]. Good use cases could be MW's maintenance jobs: tracking how long the job took and when it last succeeded; if the job isn't tied to a machine in particular it is usually a good candidate. | |||
= Use cases = | = Use cases = | ||
Line 130: | Line 176: | ||
== MySQL == | == MySQL == | ||
MySQL monitoring is performed by running <tt>prometheus-mysqld-exporter</tt> on the database machine to be monitored. Metrics are exported via http on port <tt>9104</tt> and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with: | MySQL monitoring is performed by running <tt>prometheus-mysqld-exporter</tt> on the database machine to be monitored. Metrics are exported via http on port <tt>9104</tt> and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with: | ||
<pre>curl -s localhost:9104/metrics | grep -v '^#'</pre> | <pre>curl -s localhost:9104/metrics | grep -v '^#'</pre> | ||
=== Dashboards === | === Dashboards === | ||
Line 228: | Line 274: | ||
== Cassandra == | == Cassandra == | ||
Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. | Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. WRT implementation there are two viable options: | ||
* Scrape JMX cassandra metrics with https://github.com/prometheus/jmx_exporter either externally with a JMX connection or as a "java agent" on the side to cassandra | |||
* Add Prometheus java client (https://github.com/prometheus/client_java) to cassandra-metrics-collector (https://github.com/wikimedia/cassandra-metrics-collector) alongside Graphite support. | |||
* Add Prometheus java client to creole (https://github.com/eevans/creole) | |||
== Dashboards == | == Dashboards == |
Revision as of 22:23, 3 November 2016
What is it?
Prometheus is a free software ecosystem for monitoring and alerting, with focus on reliability and semplicity. See also prometheus overview and prometheus FAQ.
There's a few interesting features that are missing from what we have now, among others:
- multi-dimensional data model
- Metrics have a name and several key=value pairs to better model what the metric is about. e.g. to measure varnish requests in the upload cache in eqiad we'd have a metric like http_requests_total{cache="upload",site="eqiad"}.
- a powerful query language
- Makes it able to ask complex questions, e.g. when debugging problems or drilling down for root cause during outages. From the example above, the query topk(3, sum(http_requests_total{status~="^5"}) by (cache)) would return the top 3 caches (text/upload/misc) with the most errors (status matches the regexp "^5")
- pull metrics from targets
- Prometheus is primarily based on a pull model, in which the prometheus server has a list of targets it should scrape metrics from. The pull protocol is HTTP based and simply put, the target returns a list of "<metric> <value>". Pushing metrics is supported too, see also http://prometheus.io/docs/instrumenting/pushing/.
After the Prometheus POC (as per User:Filippo_Giunchedi/Prometheus_POC) has been running in Labs for some time, during FQ1 2016-2017 we'll be extending Prometheus deployment to production, as outlined in the Technical Operations goals .
Architecture
Each prometheus server is configured to scrape a list of targets (i.e. HTTP endpoints) at a certain frequency, in our case starting at 60s. All metrics are stored on the local disk with a per-server retention period (minimum of 4 months for the initial goal).
All targets to be scraped are grouped into jobs, depending on the purpose that those targets serve. For example the job to scrape all host-level data for a given location using node-exporter will be called node and each target will be listed as hostname:9100. Similarly there could be jobs for varnish, mysql, etc.
Each prometheus server is meant to be stand-alone and polling targets in the same failure domain as the server itself as appropriate (e.g. the same datacenter, the same vlan and so on). For example this allows to keep the monitoring local to the datacenter and not have spotty metrics upon cross-datacenter connectivity blips. (See also Federation)
Exporters
The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter.
Each exporter serves the current snapshot of metrics when polled by the prometheus server, there is no metric history kept by the exporter itself. Further, the exporter usually runs on the same host as the service or host it is monitoring.
Storage
Why just stand-alone prometheus servers with local storage and not clustered storage? The idea behind a single prometheus server is one of reliability: a monitoring system must be more reliabile than the systems it is monitoring. It is certainly easier to get local storage right and reliable than clustered storage, especially important when collecting operational metrics.
See also prometheus storage documentation for a more in-depth explanation and storage space requirements.
High availability
With local storage being the basic building block we can still achieve high-availability by running more than one server in parallel, each configured the same and polling the same set of targets. Queries for data can be routed via LVS in an active/standby fashion.
Backups
For efficiency reasons, prometheus spools chunks of datapoints in memory for each metric before flushing them to disk. This makes it harder to perform backups online by simply copying the files on disk. The issue of having consistent backups is also discussed in prometheus #651.
Notwithstanding the above, it should be possible to backup the prometheus local storage files as-is by archiving its storage directory with tar before regular (bacula) backups. Since the backup is being done online it will result in some inconsistencies, upon restoring the backup Prometheus will crash-recovery its storage at startup.
To perform backups of consistent/clean state, at the moment prometheus needs to be shutdown gracefully, therefore when running an active/standby configuration backup can be taken on the standby prometheus to minimize its impact. Note that the shutdown will result in gaps in the standby prometheus server for the duration of the shutdown.
Failure recovery
In the event of a prometheus server having an unusable local storage (disk failed, FS failed, corruption, etc) failure recovery can take the form of:
- start with empty storage: of course it is a complete loss of metric history for the local server and will obviously fully recover once the metric retention period has passed.
- recover from backups: restore the storage directory to the last good backup
- copy data from a similar server: when deployed in pairs it is possible to copy/rsync the storage directory onto the failed server, this will likely result in gaps in the recent history though (see also Backups)
Federation and multiple DCs
Each prometheus server is able to act as a target to another prometheus server
by means of Prometheus federation.
Our use case for this feature is primarily hierarchical
federation, namely to have a 'global' prometheus that aggregates
datacenter-level metrics from prometheus in each datacenter.
The global instance is what we would normally use in grafana as the "datasource" for dashboards to get an overview of all sites and aggregated metrics. To drilldown further and get more details it is possible to use the datacenter-local datasource and dashboard.
Server location
In the diagram above the various Prometheus servers are logically separated, though physically they can share one/multiple machines. As of Nov 2016 Prometheus dc-local runs in two VMs for each of eqiad/codfw (instance named "ops") and we're in process of provisioning real hardware.
An open question at this time is where to host the dc-local Prometheus servers for caching centers, essentially two options:
- Local to the site
- Remote, e.g. codfw polling ulsfo and eqiad polling esams
The local option offers some advantages since all sites are logically the same and all polling for monitoring purposes is kept local to the site. Only the global instance would reach out to remote sites and thus could be affected by cross-DC network unavailability.
This is significant especially during outages: the global instance would show a drop in global aggregates while the dc-local instance can keep collecting high-resolution data from site-local machines.
One disadvantage of the local option is (as of Nov 2016) running Prometheus on the bastion for sites where we lack internal dedicated machines (e.g. ulsfo) alongside other services like tftp/installserver.
Service Discovery
Prometheus supports different kinds of discovery through its configuration.
For example, in role::prometheus::labs_project implements auto-discovery of all instances for a given labs project.
file_sd_config
is used to continuously monitor a set of configuration files for changes and
the script prometheus-labs-targets
is run periodically to write
the list of instances to the relative configuration file. The file_sd
files
are reloaded automatically by prometheus, so new instances will be
auto-discovered and have their instance-level metrics collected.
While file-based service discovery works, Prometheus also supports higher-level discovery for example for Kubernetes (see also role::prometheus::tools).
Adding new metrics
Direct service instrumentation
The most benefits from service metrics are obtained when services are directly instrumented with one of Prometheus clients, e.g. Python client. Metrics are then exposed via HTTP, commonly at /metrics, on the service's HTTP port (in the common case) or a separate port if the service isn't HTTP to begin with.
Service exporters
For cases where services can't be directly instrumented (aka whitebox monitoring), a sidekick application exporter can be run alongside the service that will query the service using whatever mechanism and expose prometheus metrics via the client. This is the case for example for varnish_exporter parsing varnishstat -j or apache_exporter parsing apache's mod_status page.
Machine-level metrics
Another class of metrics is all those related to the machine itself rather than a particular service. Those involve calling a subprocess and parsing the result, often in a cronjob. In these cases the simplest thing to do is drop plaintext files on the machine's filesystem for node-exporter to pick up and expose the metrics on HTTP. This mechanism is named textfile and for example the python client has support for it, e.g. sample textfile collector usage. This is most likely the mechanism we could use to replace most of the custom collectors we have for Diamond.
Ephemeral jobs
Yet another case involves service-level ephemeral jobs that are not quite long-lived enough to be queried via HTTP. For those jobs there's a push mechanism to be used: metrics are pushed to Prometheus' pushgateway via HTTP and subsequently scraped by Prometheus from the gateway itself. This method appears similar to what statsd for its semplicity but it should be used with care, see also best practices on when to use the pushgateway. Good use cases could be MW's maintenance jobs: tracking how long the job took and when it last succeeded; if the job isn't tied to a machine in particular it is usually a good candidate.
Use cases
MySQL
MySQL monitoring is performed by running prometheus-mysqld-exporter on the database machine to be monitored. Metrics are exported via http on port 9104 and fetched by prometheus server(s), to preview what metrics are being collected a fetch can be simulated with:
curl -s localhost:9104/metrics | grep -v '^#'
Dashboards
- Per group / shard / role overview
- https://grafana.wikimedia.org/dashboard/db/mysql-aggregated
- Per server drilldown
- https://grafana.wikimedia.org/dashboard/db/mysql
Ganglia
One of the initial use cases for Prometheus is to provide at least as good service as Ganglia. For host-level metrics we're using prometheus-node-exporter and grouping hosts based on $cluster puppet variable.
Dashboards
- Per cluster overview
- https://grafana.wikimedia.org/dashboard/db/prometheus-by-ganglia-cluster
Replacing Ganglia
As of Aug 2016 Prometheus is deployed in WMF's main locations: codfw and eqiad. To achieve feature-parity with Ganglia we'd need to expand Prometheus deployment to more locations, more machines and more metrics.
- more locations
- To fully replace Ganglia we'd need to deploy one (or two) prometheus servers in caching DCs too, similar to what we're doing with the ganglia aggregators. In practice this would mean running the server on ulsfo and esams bastions, as of Aug 2016 resources on both seem available (i.e. disk space and memory). To have aggregated stats available it is also possible to deploy one (in eqiad/codfw) "global" Prometheus servers that federates from each DC-local Prometheus.
- more machines
- Increase the number of machines from which we collect host metrics to 100% for each location Prometheus is deployed to, for jessie and trusty distributions.
- more metrics
- The current Ganglia deployment includes other metrics other than machine-level, namely the gmond plugins listed below and committed to puppet.git. Some of those can be replaced by existing exporters listed at https://prometheus.io/docs/instrumenting/exporters/ while others will require some porting to prometheus' python client (packaged as python-prometheus-client). Each prometheus exporter will require some deployment/packaging work, namely creating packages (preferably using Debian native go packaging, or fpm as outlined at Prometheus/Exporters) plus puppet integration and instruct prometheus to poll the additional exporters.
Ganglia plugins
- apache_status.py
- Parses apache's status page, similar to https://github.com/neezgee/apache_exporter
- gdnsd.py
- Parses gdnsd JSON stats from localhost:3506/json, will require porting to prometheus python client
- varnish.py
- Parses varnish's JSON, similar to https://github.com/jonnenauha/prometheus_varnish_exporter
- vhtcpd.py
- Parses metrics from /tmp/vhtcpd.stats and will require porting
- mysql.py
- Already replaced by prometheus-mysqld-exporter
- elasticsearch_monitoring.py
- Parse metrics from localhost:9200, replacement could be based off something like https://github.com/Braedon/prometheus-es-exporter or https://github.com/ewr/elasticsearch_exporter
- hhvm_mem.py
- Parse json from localhost:9002/memory.json, will require porting to prometheus python client
- hhvm_health.py
- Ditto, for localhost:9002/check-health
- gmond_memcached.py
- Similar to https://github.com/prometheus/memcached_exporter
- ocg.py
- Parses stats from http://localhost:8000/?command=health, OCG is on its way out though
- osm.py
- Parse stats from /srv/osmosis/state.txt, from OSM's ganglia.py
- postgresql.py
- Similar to https://github.com/wrouesnel/postgres_exporter
- gmond_jenkins.py
- Similar to https://github.com/lovoo/jenkins_exporter
- udp2log_socket.py
- Counts sockets from udp2log, still used/useful?
- varnishkafka_ganglia.py
- Parse json from /var/cache/varnishkafka/varnishkafka.stats.json
- kafkatee_ganglia.py
- Similar to varnishkafka_ganglia.py, parses json stats from /var/cache/kafkatee/kafkatee.stats.json
Replacing Graphite
Another use case imaginable for Prometheus is to replace the current Graphite deployment. This task is less "standalone" than replacing Ganglia and therefore more difficult: Graphite is more powerful and used by more people/services/dashboards. Nevertheless it should be possible to keep Prometheus and Graphite alongside each other and progressively put more data into Prometheus without affecting Graphite users. The top contributors to data that flows into Graphite as of Aug 2016 are Diamond, Statsd and Cassandra.
Diamond
Diamond runs on each machine in the fleet, collecting local data and send the resulting metrics via TCP using carbon line-oriented protocol. See also add Prometheus support to Diamond though this might not be trivial as there needs to be a mapping from flat metric names to key => value pairs.
Similarly to Ganglia, there are custom collectors in use that would need an equivalent functionality using Prometheus clients/exporters. For some simple results (e.g. exit code / single output from commands) it is easier to write metrics in a text file for node_exporter to pick up and present it together with machine-level metrics.
- extendedexim.py
- Parse exim's paniclog and queue stats by calling exim -bpr
- localcrontab.py
- Report the number of users' crontabs, mainly used in tools
- minimalpuppetagent.py
- Report puppet stats from last_run_summary.yaml
- nagios.py
- Execute nagios commands locally and report the exit code
- nginx.py
- Collect nginx basic metrics from nginx's status page
- blazegraph.py
- Parse XML from localhost:9999
- cherry-pick-counter-collector.py
- Report the number of cherry-pick patches in a given git repo
- etherpad.py
- Parse localhost:9001 and report stats
- hhvm_apc.py
- Parse localhost:9002/dump-apc-info and report stats
- ircd_stats.py
- Parse MOTD from local irc server
- libvirtkvm.py
- Parse libvirt local KVM stats and expose per-instance stats
- memcached
- See memcached in ganglia above
- nf_conntrack_counter.py
- Report sysctl net.netfilter.nf_conntrack_count
- nfsd.py
- Parse and report stats from /proc/net/rpc/nfsd and /proc/fs/nfsd/pool_stats
- nfsiostat.py
- Emulate iostat for NFS mount points using /proc/self/mountstats
- nutcracker.py
- Parse json from nutcracker stats, though nutcracker might be on its way out and replaced by mcrouter
- openldap.py
- Parse openldap metrics from local ldap server
- powerdns.py / powerdns_recursor.py
- Parse metrics from rec_control
- pybal_state.py
- Parse PyBal's pools info from localhost:9090
- rcstream diamond_collector.py
- Parse RCStream stats from localhost:10080
- rabbitmq.py
- Collect rabbitmq queue stats, for openstack
- redisstat.py
- Collect redis stats from multiple instances
- sge.py
- Collect metrics from gridengine
- sshsessions.py
- Collect number of lines from who
- varnishstatus.py
- Collect varnish stats from varnishtop, used in beta only ?
- wdqs_updater.py
- Collect jmx stats exported by jolokia at http://localhost:8778
- wmfelastic.py
- Paired down collector for elasticsearch, exports basic stats and not per-index
Statsd
Statsd traffic for the most part flows from machines to statsd.eqiad.wmnet over UDP on port 8125 for aggregation. There are some exceptions (e.g. swift) where statsd aggregation is performed on localhost and then pushed via graphite line-oriented protocol.
Prometheus provides statsd_exporter to receive statsd metrics and turn those into key => value prometheus metrics according to a user-supplied mapping. The resulting metrics are then exposed via HTTP for prometheus server to scrape.
One idea to integrate statsd_exporter into our statsd traffic is to put it "inline" between the application and statsd.eqiad.wmnet. In other words we would need to:
- Modify statsd_exporter to mirror received udp packets to statsd.eqiad.wmnet and install it on end hosts
- Opt-in applications by changing their statsd host from statsd.eqiad.wmnet to localhost
- Extend the statsd_exporter mapping file to include mappings for our statsd metrics.
This method works well for applications/languages that are request-scoped (e.g. php) since there isn't necessarily a server process to keep and aggregate metrics in. For services that qualify, the recommended way is to switch to Prometheus client for instrumentation.
Cassandra
Cassandra is hosted on separate Graphite machines due to the number and size of metrics it pushes, particularly in conjunction with Restbase. It should be evaluated separatedly too if e.g. a separate prometheus instance makes sense. WRT implementation there are two viable options:
- Scrape JMX cassandra metrics with https://github.com/prometheus/jmx_exporter either externally with a JMX connection or as a "java agent" on the side to cassandra
- Add Prometheus java client (https://github.com/prometheus/client_java) to cassandra-metrics-collector (https://github.com/wikimedia/cassandra-metrics-collector) alongside Graphite support.
- Add Prometheus java client to creole (https://github.com/eevans/creole)
Dashboards
Grafana dashboards will need porting from Graphite to Prometheus metrics; this is likely to be the most labor-intensive part since most (all?) dashboards are hand-curated. While it should be possible to programmatically change statsd metric names into prometheus metric names, the query language is different enough to make this impractical except for very basic cases.