You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Thanos: Difference between revisions
imported>Filippo Giunchedi No edit summary |
imported>Krinkle |
||
Line 1: | Line 1: | ||
{{Navigation Wikimedia infrastructure|expand=logging}} | |||
= What is it? = | = What is it? = | ||
Line 150: | Line 151: | ||
The Thanos sidecar component has failed to upload blocks to object storage. The alert contains which Prometheus instance, host and site the error is occurring in. Next action is inspecting Thanos sidecar logs (unit <tt>thanos-sidecar@INSTANCE</tt>, one per Prometheus instance) for the upload error. The alert will clear once the affected thanos sidecar unit is restarted. | The Thanos sidecar component has failed to upload blocks to object storage. The alert contains which Prometheus instance, host and site the error is occurring in. Next action is inspecting Thanos sidecar logs (unit <tt>thanos-sidecar@INSTANCE</tt>, one per Prometheus instance) for the upload error. The alert will clear once the affected thanos sidecar unit is restarted. | ||
[[Category:Services]] | |||
[[Category:SRE Observability]] |
Revision as of 19:57, 29 June 2022
What is it?
Thanos is a CNCF (Sandbox) Prometheus-compatible system with global view queries and long-term storage of metrics. See also the project's homepage at https://thanos.io. It is used at the Foundation to enhance the Prometheus deployment, the Thanos interface is available to SRE/NDA at https://thanos.wikimedia.org
Thanos is composed of orthogonal components, as of Aug 2021 at the Foundation the following components are deployed:
- Sidecar
- co-located with existing Prometheus server and implements the "Store API", meaning the Prometheus local storage is available to other Thanos components for querying. Additionally the sidecar can upload local time series blocks (i.e. generated by Prometheus’ tsdb library) to object storage for long term retention.
- Querier
- receives Prometheus queries and reaches out to all configured Store APIs, merging the results as needed. Querier is fully stateless and scales horizontally.
- Store
- exposes the "Store API", but instead of talking to the local Prometheus storage it talks to a remote object storage. The data found in the object storage bucket resembles what’s written on the local disk by Prometheus, i.e. each directory represents one block of time series data spanning a certain time period.
- Compactor
- takes care of the same process Prometheus does on local disk but on the object storage blocks, namely joining one or more blocks together for efficient querying and space savings.
- Rule
- reaches out to all Thanos components and runs periodic queries, both for recording and alerting rules. Alerts that require a "global view" are handled by this component. See also https://wikitech.wikimedia.org/wiki/Alertmanager#Local_and_global_alerts for more information on alerts.
The following diagram illustrates the logical view of Thanos operations (the "data flow") and their protocols:
Use cases
Global view
Thanos query enables the so called "global view" for metric queries. In other words metric queries are sent out to Prometheis in all sites and the results are merged and deduplicated as needed. Thanos is aware of Prometheus HA pairs (replicas) and thus is able to "fill the gaps" for missing data (e.g. as a result of maintenance).
Thanos query is available internally in production at http://thanos-query.discovery.wmnet and in Grafana as the "thanos" datasource. Metrics results will have additional labels, site and prometheus to identify which site and which Prometheus instance the metrics are coming from. For example the query count by (site) (node_boot_time_seconds) will result in aggregated host counts:
{site="eqiad"} 831 {site="codfw"} 667 {site="esams"} 29 {site="ulsfo"} 24 {site="eqsin"} 24
Similarly, this query returns how many targets each Prometheus instance is currently scraping across all sites: count by (prometheus) (up)
{prometheus="k8s"} 386 {prometheus="k8s-staging"} 39 {prometheus="ops" } 9547 {prometheus="analytics"} 241 {prometheus="services"} 103
Internally each Prometheus instance also exports a replica label (a, b) for Thanos to pick up while deduplicating results. Deduplication can be turned off at query time and in the examples above and will result in doubled counts.
Long-term storage and downsampling
Each Thanos sidecar uploads blocks to object storage, in the Foundation's deployment we are using Openstack Swift with the S3 compatibility API. The Swift cluster is independent of Swift for media-storage (ms-* hosts) and available at https://thanos-swift.discovery.wmnet. Data is replicated (without encryption) four times spanning codfw and eqiad (multi-region in Swift parlance) thus making the service fully multi-site.
Metric data uploaded to object storage is retained raw (i.e. unsampled) and periodically downsampled at 5m and 1h resolutions for fast results over long periods of time. See also more information about downsampling at https://thanos.io/components/compact.md/#downsampling-resolution-and-retention
Deployment
The current (Jun 2020) Thanos deployment in Production is illustrated by the following diagram:
File:Thanos deployment view.svg
Web interfaces
The query interface of Thanos is available (SSO-authenticated) at https://thanos.wikimedia.org and can be used to run queries for exploration purposes. The same queries can then be used e.g. within Grafana or in alerts.
The underlying block storage viewer is exposed at https://thanos.wikimedia.org/bucket/ and allows inspection of what blocks are stored and their time range. Occasionally useful to investigate problems with Thanos compactor.
Ports in use
Each Thanos component has two ports listening: gRPC for inter-component communication and HTTP to expose Prometheus metrics. Aside from the metrics use case the only HTTP port in use by external systems is for Thanos query (10902, proxied on port 80 by Apache).
Component | HTTP | gRPC |
---|---|---|
Query | 10902 | 10901 |
Compact | 12902 | N/A |
Store | 11902 | 11901 |
Sidecar | Prometheus port + 10000 | Prometheus port + 20000 |
Bucket web | 15902 | N/A |
Rule | 17902 | 17901 |
Porting dashboards to Thanos
This section outlines strategies to port existing Grafana dashboards to Thanos. See also bug T256954 for the tracking task.
- single "namespace"
- these dashboard display a "thing" which is uniquely named across sites, for example host overview dashboard. In this case it is sufficient to default the datasource to "thanos". The key here being the fact that there's no ambiguity for hostnames across sites.
- multiple sites in the same panel
- these dashboard use multiple datasources mixed in the same panel. the migration to Thanos is straightforward: use thanos as the single datasource and run a single query: the results can be aggregated by site label as needed.
- overlapping names across sites
- the dashboard displays a "thing" deployed with the same name across sites. For example clusters are uniquely identified by their name plus site. In these cases there is usually a datasource variable to select the site-local Prometheus. Especially when first migrating a dashboard it is important to be able to go back to this model and bypass Thanos. Such "backwards compatibility" is achieved by the following steps:
- Introduce a template variable $site with datasource thanos and query label_values(node_boot_time_seconds, site)
- Change the existing datasource variable instance name filter to include thanos as well. The thanos datasource will be the default for the dashboard.
- Change the remaining query-based variables to include a site=~"$site" selector, as needed. For example the cluster variable is based on the label_values(node_boot_time_seconds{site=~"$site"}, cluster) query. Ditto for instance variable, the query is label_values(node_boot_time_seconds{cluster="$cluster",site=~"$site"}, instance).
- Add the same site=~"$site" selector to all queries in panels that need it. Without the selector the panel queries will return data for all sites otherwise.
The intended usage is to select the site via site variable and no longer via datasource (which must default to thanos). If for some reason querying through thanos doesn't work, it is possible to temporarily fallback to querying Prometheus directly:
- Input .* as the site
- Select the desired site-local Prometheus from the datasource dropdown.
Operations
Pool / depool a site
The read path is served by thanos-query.discovery.wmnet and the write path by thanos-swift.discovery.wmnet. Each can be pooled/depooled individually and both are generally active/active.
thanos-swift
The actual depool is a confctl away:
confctl --object-type discovery select 'dnsdisc=thanos-swift,name=SITE' set/pooled=true
The Thanos sidecar maintains persistent connections to object storage. If draining such connections is desired then a roll-restart of thanos-sidecar is sufficient:
cumin -b1 'R:thanos::sidecar' 'systemctl restart thanos-sidecar@*'
thanos-query
There are generally no persistent connections (i.e. Grafana) to the thanos-query endpoint, thus the following is sufficient to (de)pool:
confctl --object-type discovery select 'dnsdisc=thanos-query,name=SITE' set/pooled=true
Alerts
More than one Thanos compact running
Thanos compact has not run / is halted
The compactor has ran into some trouble while compacting (e.g. out of disk space) and has halted compaction. The thanos-compact service logs will have more detailed information about the failure. See also thanos-compact dashboard. The compactor runs as a singleton, the active host is defined by profile::thanos::compact_host in Hiera.
Thanos <component> has disappeared from Prometheus discovery
Thanos <component> has high percentage of failures
Thanos <component> has high latency
Thanos sidecar cannot connect to Prometheus
Thanos sidecar is unhealthy
Thanos query has high gRPC client errors
The thanos-query service is experiencing errors, consulting its logs will reveal the detailed error cause.
Thanos sidecar is failing to upload blocks
The Thanos sidecar component has failed to upload blocks to object storage. The alert contains which Prometheus instance, host and site the error is occurring in. Next action is inspecting Thanos sidecar logs (unit thanos-sidecar@INSTANCE, one per Prometheus instance) for the upload error. The alert will clear once the affected thanos sidecar unit is restarted.