You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Nova Resource:Metricsinfra/Documentation"

From Wikitech-static
Jump to navigation Jump to search
imported>Majavah
(few notes about metricsinfra current state and plans)
 
imported>Majavah
(more work)
Line 1: Line 1:
The '''metricsinfra''' [[Help:Cloud VPS|Cloud VPS]] project is planned to contain [[Prometheus]]-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.
== Components ==
* [[toolforge:openstack-browser/project/metricsinfra|List of all instances and other project details]]
=== Prometheus ===
=== Alert manager ===
=== prometheus-configurator ===
=== Thanos/Cortex/other tool for scaling up ===
== Work to do ==
=== Goals ===
{{tracked|T266050}}
{{tracked|T266050}}
The '''metricsinfra''' [[Help:Cloud VPS|Cloud VPS]] project is planned to contain [[Prometheus]]-based monitoring tooling that can be used on any VPS project. As of writing (June 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter on certain pre-defined projects.
* Taavi's long term end goals:
** scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
** allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
* Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
** Monitoring and alerting for individual Toolforge tools
** Log aggregation and search for anyone


== Prometheus configuration tooling ==
=== Prometheus configuration tooling ===
{{tracked|T284993}}
{{tracked|T284993}}
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. [[User:Majavah|Majavah]] is writing a Python program ([https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-configurator prometheus-configurator]) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and in the near future alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. [[User:Majavah|Majavah]] is writing a Python program ([https://gerrit.wikimedia.org/r/admin/repos/cloud/metricsinfra/prometheus-configurator prometheus-configurator]) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.


* TODO: database to persist config in
* TODO: database to persist config in
Line 12: Line 32:
* TODO (long-term): Make the app automatically open up necessary security group rules
* TODO (long-term): Make the app automatically open up necessary security group rules


== Alerting ==
=== Data gathering ===
* TODO (long-term): set up [https://prometheus.io/docs/practices/pushing/ Prometheus push gateway] for individual Cloud VPS tenants to use
* TODO (long-term): set up [https://github.com/prometheus/blackbox_exporter Prometheus blackbox exporter] for individual Cloud VPS tenants to use
** and possibly monitor all web proxies by default?
 
=== Alerting ===
* TODO: split alertmanager from prometheus nodes to their own, add HA
* TODO: split alertmanager from prometheus nodes to their own, add HA
* TODO: Allow project members/admins to ack/silence alerts of that project ([[phab:T285055]])
* TODO: Allow project members/admins to ack/silence alerts of that project ([[phab:T285055]])


== Scaling up ==
=== Scaling up ===
Ideally we would monitor the basic metrics from all VMs.
Ideally we would monitor the basic metrics from all VMs.



Revision as of 17:50, 7 July 2021

The metricsinfra Cloud VPS project is planned to contain Prometheus-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.

Components

Prometheus

Alert manager

prometheus-configurator

Thanos/Cortex/other tool for scaling up

Work to do

Goals

  • Taavi's long term end goals:
    • scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
    • allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
  • Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
    • Monitoring and alerting for individual Toolforge tools
    • Log aggregation and search for anyone

Prometheus configuration tooling

Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. Majavah is writing a Python program (prometheus-configurator) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.

  • TODO: database to persist config in
  • TODO: API to manage config
  • TODO (long-term): UI to manage config
  • TODO (long-term): Allow managing config via puppet manifests on target instances
  • TODO (long-term): Make the app automatically open up necessary security group rules

Data gathering

Alerting

  • TODO: split alertmanager from prometheus nodes to their own, add HA
  • TODO: Allow project members/admins to ack/silence alerts of that project (phab:T285055)

Scaling up

Ideally we would monitor the basic metrics from all VMs.

  • TODO: Deploy Prometheus as an active-active pair and use Thanos Querier to aggregate results from it - that way if we have to perform maintenance one node we still metrics for that time
  • TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
  • TODO (long-term): calculate how much space would monitoring all the VMs take
  • TODO (long-term): figure out if one Prometheus instance (or replica pair) can handle every VM, if not we need to modify the config tooling to split things up (probably just split the projects in half, hopefully Thanos will be able to query everything)
  • TODO: figure out how to deal with security group rules