You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Thanos

From Wikitech-static
Jump to navigation Jump to search

Thanos is a CNCF (Sandbox) Prometheus-compatible system with global view queries and long-term storage of metrics. See also the project's homepage at https://thanos.io. It is used at the Foundation to enhance the Prometheus deployment. The Thanos interface is available to SRE/NDA at https://thanos.wikimedia.org

Components

Thanos is composed of orthogonal components. As of Aug 2021 at the Foundation the following components are deployed:

Sidecar
co-located with existing Prometheus server and implements the "Store API", meaning the Prometheus local storage is available to other Thanos components for querying. Additionally the sidecar can upload local time series blocks (i.e. generated by Prometheus’ tsdb library) to object storage for long term retention.
Querier
receives Prometheus queries and reaches out to all configured Store APIs, merging the results as needed. Querier is fully stateless and scales horizontally.
Store
exposes the "Store API", but instead of talking to the local Prometheus storage it talks to a remote object storage. The data found in the object storage bucket resembles what’s written on the local disk by Prometheus, i.e. each directory represents one block of time series data spanning a certain time period.
Compactor
takes care of the same process Prometheus does on local disk but on the object storage blocks, namely joining one or more blocks together for efficient querying and space savings.
Rule
reaches out to all Thanos components and runs periodic queries, both for recording and alerting rules. Alerts that require a "global view" are handled by this component. See also https://wikitech.wikimedia.org/wiki/Alertmanager#Local_and_global_alerts for more information on alerts.

Data flow

The following diagram illustrates the logical view of Thanos operations (the "data flow") and their protocols:

File:Thanos logical view.svg

Use cases

Global view

Thanos query enables the so called "global view" for metric queries. In other words metric queries are sent out to Prometheus in all sites and the results are merged and deduplicated as needed. Thanos is aware of Prometheus HA pairs (replicas) and thus is able to "fill the gaps" for missing data (e.g. as a result of maintenance).

Thanos query is available internally in production at http://thanos-query.discovery.wmnet and in Grafana as the "thanos" datasource. Metrics results will have additional labels, site and prometheus to identify which site and which Prometheus instance the metrics are coming from. For example the query count by (site) (node_boot_time_seconds) will result in aggregated host counts:

{site="eqiad"}  831
{site="codfw"}  667
{site="esams"}   29
{site="ulsfo"}   24
{site="eqsin"}   24

Similarly, this query returns how many targets each Prometheus instance is currently scraping across all sites: count by (prometheus) (up)

{prometheus="k8s"}	     386
{prometheus="k8s-staging"}    39
{prometheus="ops" }         9547
{prometheus="analytics"}     241
{prometheus="services"}	     103

Internally each Prometheus instance also exports a replica label (a, b) for Thanos to pick up while deduplicating results. Deduplication can be turned off at query time and in the examples above and will result in doubled counts.

Long-term storage and downsampling

Each Thanos sidecar uploads blocks to object storage, in the Foundation's deployment we are using Openstack Swift with the S3 compatibility API. The Swift cluster is independent of Swift for media-storage (ms-* hosts) and available at https://thanos-swift.discovery.wmnet. Data is replicated (without encryption) spanning codfw and eqiad (multi-region in Swift parlance) thus making the service fully multi-site.

Metric data uploaded to object storage is retained raw (i.e. unsampled) and periodically downsampled at 5m and 1h resolutions for fast results over long periods of time. See also more information about downsampling at https://thanos.io/components/compact.md/#downsampling-resolution-and-retention

Deployment

The current (Jun 2020) Thanos deployment in Production is illustrated by the following diagram:

File:Thanos deployment view.svg

Web interfaces

The query interface of Thanos is available (SSO-authenticated) at https://thanos.wikimedia.org and can be used to run queries for exploration purposes. The same queries can then be used e.g. within Grafana or in alerts.

The underlying block storage viewer is exposed at https://thanos.wikimedia.org/bucket/ and allows inspection of what blocks are stored and their time range. Occasionally useful to investigate problems with Thanos compactor.

The thanos-rule component interface can be browsed at https://thanos.wikimedia.org/rule/ . From there you can explore the current global alerts and aggregation rules.

Ports in use

Each Thanos component has two ports listening: gRPC for inter-component communication and HTTP to expose Prometheus metrics. Aside from the metrics use case the only HTTP port in use by external systems is for Thanos query (10902, proxied on port 80 by Apache).

Component HTTP gRPC
Query 10902 10901
Compact 12902 N/A
Store 11902 11901
Sidecar Prometheus port + 10000 Prometheus port + 20000
Bucket web 15902 N/A
Rule 17902 17901

Porting dashboards to Thanos

This section outlines strategies to port existing Grafana dashboards to Thanos. See also bug T256954 for the tracking task.

single "namespace"
these dashboard display a "thing" which is uniquely named across sites, for example host overview dashboard. In this case it is sufficient to default the datasource to "thanos". The key here being the fact that there's no ambiguity for hostnames across sites.
multiple sites in the same panel
these dashboard use multiple datasources mixed in the same panel. the migration to Thanos is straightforward: use thanos as the single datasource and run a single query: the results can be aggregated by site label as needed.
overlapping names across sites
the dashboard displays a "thing" deployed with the same name across sites. For example clusters are uniquely identified by their name plus site. In these cases there is usually a datasource variable to select the site-local Prometheus. Especially when first migrating a dashboard it is important to be able to go back to this model and bypass Thanos. Such "backwards compatibility" is achieved by the following steps:
  1. Introduce a template variable $site with datasource thanos and query label_values(node_boot_time_seconds, site)
  2. Change the existing datasource variable instance name filter to include thanos as well. The thanos datasource will be the default for the dashboard.
  3. Change the remaining query-based variables to include a site=~"$site" selector, as needed. For example the cluster variable is based on the label_values(node_boot_time_seconds{site=~"$site"}, cluster) query. Ditto for instance variable, the query is label_values(node_boot_time_seconds{cluster="$cluster",site=~"$site"}, instance).
  4. Add the same site=~"$site" selector to all queries in panels that need it. Without the selector the panel queries will return data for all sites otherwise.

The intended usage is to select the site via site variable and no longer via datasource (which must default to thanos). If for some reason querying through thanos doesn't work, it is possible to temporarily fallback to querying Prometheus directly:

  1. Input .* as the site
  2. Select the desired site-local Prometheus from the datasource dropdown.

Operations

Pool / depool a site

The read path is served by thanos-query.discovery.wmnet and the write path by thanos-swift.discovery.wmnet. Each can be pooled/depooled individually and both are generally active/active. Note that while thanos-query is owned by Observability and runs on titan* hosts, thanos-swift (i.e. the object storage) is owned by Data Persistence and runs on thanos-fe* / thanos-be*

thanos-query

There are generally no persistent connections (i.e. Grafana) to the thanos-query endpoint, thus the following is sufficient to (de)pool:

 confctl --object-type discovery select 'dnsdisc=thanos-query,name=SITE' set/pooled=true

thanos.wikimedia.org

The thanos-web.discovery.wmnet record/discovery is used to serve thanos.w.o; this is a SSO-backed service. To depool a site for thanos.wikimedia.org you can use:

confctl --object-type discovery select 'dnsdisc=thanos-web,name=SITE' set/pooled=true

Delete blocks / free up space

It is possible to force Thanos to clean up blocks of metrics ahead of their scheduled retention time. Keep in mind that the immediate deletion can cause some cached lists of blocks to go stale, though this is normally harmless.

Note: make sure thanos-compact and puppet (or it'll restart thanos-compact) are not running!

Deletion is achieved with the following steps:

# force retention to be applied immediately, marking blocks for deletion
root@titan1001:~$ thanos tools bucket retention --objstore.config-file /etc/thanos-compact/objstore.yaml --retention.resolution-raw RAW-RETENTIONw --retention.resolution-5m FIVE-MINUTES-RETENTION-IN-WEEKSw --retention.resolution-1h ONE-HOUR-RETENTION-IN-WEEKSw
# Force immediate cleanup of all marked blocks
# This can take a while - consider running in screen/tmux
root@titan1001:~$ thanos tools bucket cleanup --objstore.config-file /etc/thanos-compact/objstore.yaml --delete-delay 0h

Then restart puppet and/or thanos-compact

Upgrade to a new version

The guide below assumes you have the new upstream version Debian package built and uploaded internally already. Check out the build instructions in operations/debs/thanos in the debian/README.source file.

cumin 'R:Package ~ "thanos"' 'apt -qy update'
cumin -b1 -s300 'O:titan' 'apt -y install thanos' # staggered upgrade to give thanos-store time to come back up
cumin -b1 -s30  'O:prometheus' 'apt -y install thanos && systemctl restart thanos-sidecar@*'

Add new thanos-fe hosts (Object Storage frontends)

The hosts run only Swift frontend (proxy) as of Nov 2023 (bug T341488). The procedure to add brand new hosts is as follows (see also bug T336348:

  • Provision the hosts with thanos::frontend role
  • Add the new hosts to ACLs and such (https://gerrit.wikimedia.org/r/918387)
  • Run puppet on all thanos-fe / thanos-be hosts
  • Make sure the hosts are healthy monitoring-wise
  • At this point the hosts are ready to go and can be effectively pooled on a traffic/swift level
  • Add hosts to swift-memcache and conftool (https://gerrit.wikimedia.org/r/918418)
  • Run puppet on thanos-fe* hosts, and roll-restart swift-proxy on one host at a time
  • Pool the new hosts:
    • confctl select name=FQDN set/weight=100
    • confctl select name=FQDN set/pooled=yes

Add new titan hosts (Thanos components)

The hosts run all Thanos stateless components, whereas object storage runs on thanos-fe/thanos-be hosts. The procedure to add brand new hosts is as follows:

  • Provision the hosts with titan role
  • Add the new hosts to ACLs and such, by grepping puppet to see where titan hosts are used already
  • Run puppet on all existing titan hosts
  • Make sure the hosts are healthy monitoring-wise
  • At this point the hosts are ready to go and can be effectively pooled
  • Add hosts to conftool-data and merge the change
  • Pool the new hosts:
    • confctl select name=FQDN set/weight=100
    • confctl select name=FQDN set/pooled=yes

Alerts

Thanos components (e.g. query, compact, etc) are hosted on titan* hosts. Whereas thanos-fe/thanos-be hosts run the generic swift/s3 object storage only.

Service thanos-query:443 has failed probes

This is a page alert and means blackbox probes are failing when run against thanos-query. Most likely a "query of death" scenario, leading to OOM of thanos-query service/unit. Check the Titan cluster overview dashboard to verify it is indeed a case of OOM, assuming the dashboard can load under these circumstances. These situations self-recover most of the time when systemd restarts the service and probes start succeeding again. If self-recovery is not happening, consider depooling thanos-query from the affected site (confctl commands listed above) to at least recovery. A last resort action is also to reboot the affected hosts.

On one of the affected titan hosts (see also note above re: titan vs thanos) you can examine the access log to identify potential queries-of-death with the following:

thanos-query-log-explore /var/log/apache2/other_vhosts_access.log

The command will print a list of: timestamp + requested time span + query + reply given to client

A query of death can manifest itself for example as a very long time span (in the order of multiple weeks/months); and problematic queries will show up just before apache starts replying 500 to clients.

More than one Thanos compact running

Thanos compact has not run / is halted

The compactor has ran into some trouble while compacting (e.g. out of disk space) and has halted compaction. The thanos-compact service logs will have more detailed information about the failure (e.g. journalctl -u thanos-compact | grep caller=compact). See also thanos-compact dashboard. The compactor runs as a singleton, the active host is defined by profile::thanos::compact_host in Hiera.

Thanos <component> has disappeared from Prometheus discovery

Thanos <component> has high percentage of failures

Thanos <component> has high latency

Thanos sidecar cannot connect to Prometheus

Thanos sidecar is unhealthy

Thanos query has high gRPC client errors

The thanos-query service is experiencing errors, consulting its logs will reveal the detailed error cause.

Thanos sidecar is failing to upload blocks

The Thanos sidecar component has failed to upload blocks to object storage. The alert contains which Prometheus instance, host and site the error is occurring in. Next action is inspecting Thanos sidecar logs (unit thanos-sidecar@INSTANCE, one per Prometheus instance) for the upload error. The alert will clear once the affected thanos sidecar unit is restarted.

Thanos sidecar no connection to started Prometheus

This alert fires when the Thanos Sidecar cannot access Prometheus, even though Prometheus appears healthy and has reloaded the Write-Ahead Log (WAL). This issue might occur when Prometheus consumes a large amount of memory replaying the WAL, leading to it being terminated by the out-of-memory (OOM) killer.

Sample: (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.

Steps to Remediate

  1. Stop Prometheus Service: systemctl stop prometheus@k8s Wait for the service to stop completely. This might take 1-2 minutes.
  2. Move the WAL Directory: Rename the wal directory to archive it and allow Prometheus to start with a fresh WAL. Replace the timestamp with the current date and time for record-keeping. mv /srv/prometheus/k8s/metrics/wal /srv/prometheus/k8s/metrics/wal-$(date -Im)
  3. Start Prometheus Service: systemctl start prometheus@k8s Prometheus should now start successfully without the old WAL data.

Notes

  • This remediation involves the loss of data that was in the WAL. Ensure that this is acceptable before proceeding.
  • Monitor the Prometheus and Thanos Sidecar after performing these steps to ensure they are functioning correctly.
  • Phabricator task T354399 contains information about other times this alert triggered.

Thanos sidecar is dropping large queries

The alert indicates that large queries (i.e. fetching many series / datapoints) have been issued and rejected by Thanos sidecar (i.e. the component that asks Prometheus for data). The limits are in place to ensure Thanos itself is protected from being swamped with large amounts of data and running out of memory. If the alert persists there likely are dashboards or users of thanos.w.o trying such queries. See also this same page on how to inspect the query log to identify said large queries.

Metrics Retention

At the Foundation, Thanos keeps 54 weeks of raw metric data, and 5 years for 5m and 1h resolution under normal circumstances. If there's object storage space pressure both raw metric data retention and 5m resolution might be shortened.

Backfilling Metrics

Although the metrics used to create a recording rule may have long historical data available, the new metric created by a recording rule will only contain history from recording rule deployment onward. In cases like SLOs we prefer to have a reasonable metric history (roughly a quarter) available at the time of onboarding and changes. To accomplish this we can backfill metrics using the current recording rules to generate old metric data for a given set of recording rules.

1) Generate backfill blocks for the desired recording rules using promtool create-blocks-from rules.

In our example we'll use pyrra recording rules, however this could be any recording rule config. Note: this can take hours to complete, its recommended to use a screen or tmux.

titan1001:~/tmp/backfill/citoid-success-ratio-v1$ time promtool tsdb create-blocks-from rules --start=2025-07-15T00:00:00Z --end=2025-09-02T05:55:00Z --output-dir='output/' --url=http://thanos-query.discovery.wmnet /etc/pyrra/output-rules/wikikube-citoid-success-ratio.yaml

This will generate a set of blocks containing historical data for the provided recording rules. However, promtool itself creates numerous and overlapping blocks which we must first compact before uploading to thanos.

2) Compact the backfill blocks generated above locally using an ad-hoc prometheus instance.

As mentioned above, promtool will create overlapping blocks that will break thanos compaction if uploaded directly. To remedy this we'll compact them locally using an ad-hoc prometheus instance.

Additionally, we will append unique external labels to this ad-hoc prometheus instance. We'll use two external labels:

replica: backfill-abcv1  # where abcv1 is a unique identifier for the backfill being run currently
recorder: backfill  # to note that these metrics were recorded by a manual backfill as opposed to e.g. thanos-rule

These external labels are critical as Thanos requires unique external labels for blocks which overlap in time. Without unique external labels our backfill blocks will fail to upload. The replica label is deduplicated such that overlapping metrics will prefer "live" data and fall back to backfilled data.

titan1001:~/tmp/backfill/citoid-success-ratio-v1$ cat minimal.yml
global:
  scrape_interval: 1h
  external_labels:
    replica: backfill-abcv1
    recorder: backfill
scrape_configs: []

titan1001:~/tmp/backfill/citoid-success-ratio-v1$ prometheus --config.file=./minimal.yml --storage.tsdb.allow-overlapping-blocks --storage.tsdb.path=./output --storage.tsdb.retention.time=3650d

Prometheus will generate a lot of output as it compacts blocks. When this output slows to an idle we'll be ready to upload the compacted blocks to thanos using a sidecar. Note: the ad-hoc prometheus instance will need to continue running as it is a dependency for the sidecar.

3) Upload our compacted backfill blocks to Thanos using a sidecar

titan1001:~/tmp/backfill/citoid-success-ratio-v1$ sudo thanos sidecar --tsdb.path=./output --objstore.config-file=/etc/thanos-store/objstore.yaml --http-address="0.0.0.0:6666" --shipper.ignore-unequal-block-size --grpc-address 0.0.0.0:6699 --shipper.upload-compacted

Be patient as the initial step of fetching all blocks from Thanos will take some time before you start seeing uploads logged. If you encounter overlapping blocks errors, ensure that your external labels (specifically the replica label) are unique.

4) Flush Thanos cache

If backfill data doesn't appear in queries shortly after uploading Thanos' memcached instances likely need flushing. Perform rolling memcached flushes across the titan cluster

titan1001:~$ printf "flush_all\r\nquit\r\n" | nc -q1 localhost 11211
OK