You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
The metricsinfra Cloud VPS project is planned to contain Prometheus-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.
Prometheus scrapes metrics and stores them on local disk. As of August 2021, storing data for 335 nodes with retention period of 720 hours (30 days) consumes about 40G of disk space. Each scrape target (instance, service, etc) will likely be scraped by two Prometheus nodes to keep short-term on-disk data redundant. Long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos/Cortex/whatever.
Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on prometheus-alerts.wmcloud.org, which generally speaking lets project members silence alerts for that project.
prometheus-configurator (client) and prometheus-manager (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.
Thanos/Cortex/other tool for scaling up
Storing Prometheus data on just one node isn't ideal, it lacks redundancy and makes upgrading difficult. Ideally this gap is filled by Thanos, Cortex, or other similar tool that lets us aggregate data from multiple Prometheus instances and ideally from a long-term object storage too.
Work to do
- Taavi's long term end goals:
- scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
- allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
- Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
- Monitoring and alerting for individual Toolforge tools
- Log aggregation and search for anyone
Prometheus configuration tooling
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, prometheus-configurator (client) and prometheus-manager (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.
- TODO: API and UI to manage config
- TODO: deal with security groups (task T288108)
- TODO (long-term): Allow managing config via puppet manifests on target instances
- TODO (long-term): set up Prometheus push gateway for individual Cloud VPS tenants to use
- TODO (long-term): set up Prometheus blackbox exporter for individual Cloud VPS tenants to use
- and possibly monitor all web proxies by default?
- TODO: custom webhooks
Ideally we would monitor the basic metrics from all VMs.
- TODO: Deploy Prometheus as an active-active pair and use Thanos Querier to aggregate results from it - that way if we have to perform maintenance one node we still metrics for that time
- TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
- TODO (long-term): calculate how much space would monitoring all the VMs take
- TODO (long-term): figure out if one Prometheus instance (or replica pair) can handle every VM, if not we need to modify the config tooling to split things up (probably just split the projects in half, hopefully Thanos will be able to query everything)
- TODO: figure out how to deal with security group rules