You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
The metricsinfra Cloud VPS project is planned to contain Prometheus-based monitoring tooling that can be used on any VPS project. As of writing (June 2021), there is a proof-of-concept level singular Prometheus instance that monitors prometheus-node-exporter on certain pre-defined projects.
Prometheus configuration tooling
Hopefully in the future Cloud VPS project administrators can self-manage Prometheus targets and alert rules for their project. Majavah is writing a Python program (prometheus-configurator) to handle that: as of writing (June 2021) it takes a simple config file and based on that creates and maintains Prometheus configuration (including custom targets and in the near future alerts). In the future it can be expanded to load the configuration from a database and expose an API or a user interface that allows for self-service management.
- TODO: database to persist config in
- TODO: API to manage config
- TODO (long-term): UI to manage config
- TODO (long-term): Allow managing config via puppet manifests on target instances
- TODO (long-term): Make the app automatically open up necessary security group rules
- TODO: split alertmanager from prometheus nodes to their own, add HA
- TODO: Allow project members/admins to ack/silence alerts of that project (phab:T285055)
Ideally we would monitor the basic metrics from all VMs.
- TODO: Deploy Prometheus as an active-active pair and use Thanos Querier to aggregate results from it - that way if we have to perform maintenance one node we still metrics for that time
- TODO (long-term): Deploy Thanos Store to keep long-term metrics in CloudSwift (when we have that)
- TODO (long-term): calculate how much space would monitoring all the VMs take
- TODO (long-term): figure out if one Prometheus instance (or replica pair) can handle every VM, if not we need to modify the config tooling to split things up (probably just split the projects in half, hopefully Thanos will be able to query everything)
- TODO: figure out how to deal with security group rules