You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Metricsinfra/Documentation

From Wikitech-static
< Nova Resource:Metricsinfra
Revision as of 17:50, 7 July 2021 by imported>Majavah (more work)
Jump to navigation Jump to search

The metricsinfra Cloud VPS project is planned to contain Prometheus-based monitoring tooling that can be used on any VPS project. As of writing (July 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter and other statically defined targets on certain pre-defined projects.

Components

Prometheus

Alert manager

prometheus-configurator

Thanos/Cortex/other tool for scaling up

Work to do

Goals

  • Taavi's long term end goals:
    • scrape and store basic metrics from all Cloud VPS instances in all projects and have sensible default alert rules
    • allow any Cloud VPS project administrator to set arbitrary prometheus scrape targets and alerting rules for their project in a self-service fashion
  • Metrics and monitoring related work that we likely want to pursue in the future, but are out of metricsinfra scope for now:
    • Monitoring and alerting for individual Toolforge tools
    • Log aggregation and search for anyone

Prometheus configuration tooling

Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. Majavah is writing a Python program (prometheus-configurator) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.

  • TODO: database to persist config in
  • TODO: API to manage config
  • TODO (long-term): UI to manage config
  • TODO (long-term): Allow managing config via puppet manifests on target instances
  • TODO (long-term): Make the app automatically open up necessary security group rules

Data gathering

Alerting

  • TODO: split alertmanager from prometheus nodes to their own, add HA
  • TODO: Allow project members/admins to ack/silence alerts of that project (phab:T285055)

Scaling up

Ideally we would monitor the basic metrics from all VMs.

  • TODO: Deploy Prometheus as an active-active pair and use Thanos Querier to aggregate results from it - that way if we have to perform maintenance one node we still metrics for that time
  • TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
  • TODO (long-term): calculate how much space would monitoring all the VMs take
  • TODO (long-term): figure out if one Prometheus instance (or replica pair) can handle every VM, if not we need to modify the config tooling to split things up (probably just split the projects in half, hopefully Thanos will be able to query everything)
  • TODO: figure out how to deal with security group rules