You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Nova Resource:Metricsinfra/Documentation"

From Wikitech-static
Jump to navigation Jump to search
imported>Majavah
(few notes about metricsinfra current state and plans)
 
imported>Majavah
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
The '''metricsinfra''' [[Help:Cloud VPS|Cloud VPS]] project is planned to contain [[Prometheus]]-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.
== User guide ==
See [[Portal:Cloud VPS/Admin/Monitoring#Monitoring for Cloud VPS]]
== Components ==
* [[toolforge:openstack-browser/project/metricsinfra|List of all instances and other project details]]
=== Prometheus ===
Prometheus scrapes metrics and stores them on local disk. As of August 2021, storing data for 335 nodes with retention period of 720 hours (30 days) consumes about 40G of disk space. Each scrape target (instance, service, etc) will likely be scraped by two Prometheus nodes to keep short-term on-disk data redundant. Long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos/Cortex/whatever.
=== Alert manager ===
Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on [https://prometheus-alerts.wmcloud.org prometheus-alerts.wmcloud.org], but it only lets a hardcoded list of users perform actions. This is an [https://github.com/prymitive/karma/issues/3361 upstream limitation] and will hopefully be fixed at some point.
=== prometheus-configurator ===
[https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-configurator prometheus-configurator] (client) and [https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-manager prometheus-manager] (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.
=== Thanos/Cortex/other tool for scaling up ===
Storing Prometheus data on just one node isn't ideal, it lacks redundancy and makes upgrading difficult. Ideally this gap is filled by [https://thanos.io/ Thanos], [https://cortexmetrics.io/ Cortex], or other similar tool that lets us aggregate data from multiple Prometheus instances and ideally from a long-term object storage too.
== Work to do ==
=== Goals ===
{{tracked|T266050}}
{{tracked|T266050}}
The '''metricsinfra''' [[Help:Cloud VPS|Cloud VPS]] project is planned to contain [[Prometheus]]-based monitoring tooling that can be used on any VPS project. As of writing (June 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter on certain pre-defined projects.
* Taavi's long term end goals:
** scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
** allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
* Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
** Monitoring and alerting for individual Toolforge tools
** Log aggregation and search for anyone


== Prometheus configuration tooling ==
=== Prometheus configuration tooling ===
{{tracked|T284993}}
{{tracked|T284993}}
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. [[User:Majavah|Majavah]] is writing a Python program ([https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-configurator prometheus-configurator]) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and in the near future alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, [https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-configurator prometheus-configurator] (client) and [https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-manager prometheus-manager] (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.


* TODO: database to persist config in
* TODO: API and UI to manage config
* TODO: API to manage config
* TODO: deal with security groups ({{phab|T288108}})
* TODO (long-term): UI to manage config
* TODO (long-term): Allow managing config via puppet manifests on target instances
* TODO (long-term): Allow managing config via puppet manifests on target instances
* TODO (long-term): Make the app automatically open up necessary security group rules


== Alerting ==
=== Data gathering ===
* TODO: split alertmanager from prometheus nodes to their own, add HA
* TODO (long-term): set up [https://prometheus.io/docs/practices/pushing/ Prometheus push gateway] for individual Cloud VPS tenants to use
* TODO (long-term): set up [https://github.com/prometheus/blackbox_exporter Prometheus blackbox exporter] for individual Cloud VPS tenants to use
** and possibly monitor all web proxies by default?
 
=== Alerting ===
* TODO: Allow project members/admins to ack/silence alerts of that project ([[phab:T285055]])
* TODO: Allow project members/admins to ack/silence alerts of that project ([[phab:T285055]])
* TODO: custom webhooks


== Scaling up ==
=== Scaling up ===
Ideally we would monitor the basic metrics from all VMs.
Ideally we would monitor the basic metrics from all VMs.



Latest revision as of 12:35, 7 August 2021

The metricsinfra Cloud VPS project is planned to contain Prometheus-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.

User guide

See Portal:Cloud VPS/Admin/Monitoring#Monitoring for Cloud VPS

Components

Prometheus

Prometheus scrapes metrics and stores them on local disk. As of August 2021, storing data for 335 nodes with retention period of 720 hours (30 days) consumes about 40G of disk space. Each scrape target (instance, service, etc) will likely be scraped by two Prometheus nodes to keep short-term on-disk data redundant. Long-term metrics would ideally be stored in Swift or other object storage and retrieved using Thanos/Cortex/whatever.

Alert manager

Alerts are sent out with Prometheus Alertmanager. As of writing it supports email and IRC alerts. As of writing, there is only one alertmanager instance running (metricsinfra-alertmanager-1), but the puppetization makes adding second, redundant one fairly easily. Some components (notably the IRC relay) will need manual failover, but the dashboard and most alert sending should fail over automatically should one node fail for whatever reason. There is a Karma dashboard on prometheus-alerts.wmcloud.org, but it only lets a hardcoded list of users perform actions. This is an upstream limitation and will hopefully be fixed at some point.

prometheus-configurator

prometheus-configurator (client) and prometheus-manager (backend) deal with dynamically generating configuration files for software in the metricsinfra stack. It's backed by a Trove database. There will be some user interface, but that has not been created yet.

Thanos/Cortex/other tool for scaling up

Storing Prometheus data on just one node isn't ideal, it lacks redundancy and makes upgrading difficult. Ideally this gap is filled by Thanos, Cortex, or other similar tool that lets us aggregate data from multiple Prometheus instances and ideally from a long-term object storage too.

Work to do

Goals

  • Taavi's long term end goals:
    • scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
    • allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
  • Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
    • Monitoring and alerting for individual Toolforge tools
    • Log aggregation and search for anyone

Prometheus configuration tooling

Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. As of August 2021 the configuration is created using two Python apps, prometheus-configurator (client) and prometheus-manager (backend), that uses data stored in a Trove database to generate configuration files for Prometheus and Alertmanager.

  • TODO: API and UI to manage config
  • TODO: deal with security groups (task T288108)
  • TODO (long-term): Allow managing config via puppet manifests on target instances

Data gathering

Alerting

  • TODO: Allow project members/admins to ack/silence alerts of that project (phab:T285055)
  • TODO: custom webhooks

Scaling up

Ideally we would monitor the basic metrics from all VMs.

  • TODO: Deploy Prometheus as an active-active pair and use Thanos Querier to aggregate results from it - that way if we have to perform maintenance one node we still metrics for that time
  • TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
  • TODO (long-term): calculate how much space would monitoring all the VMs take
  • TODO (long-term): figure out if one Prometheus instance (or replica pair) can handle every VM, if not we need to modify the config tooling to split things up (probably just split the projects in half, hopefully Thanos will be able to query everything)
  • TODO: figure out how to deal with security group rules