You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Cloud VPS/Admin/Monitoring: Difference between revisions
imported>Jhedden |
imported>Jhedden |
||
Line 38: | Line 38: | ||
== Monitoring for Cloud VPS == | == Monitoring for Cloud VPS == | ||
The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. | |||
=== Adding new projects === | |||
The hiera key `profile::wmcs::prometheus::metricsinfra::projects:` defines which projects have monitoring enabled. | |||
profile::wmcs::prometheus::metricsinfra::projects: | |||
# List of projects that are monitored by the metricsinfra prometheus server. Each project can be | |||
# configured with an optional notify_email list of addresses that will receive alert notifications | |||
# in addition to WMCS admins. | |||
- name: <project> | |||
notify_email: | |||
- user@example.org | |||
- another@example.org | |||
=== Managing notifications === | |||
To silence existing or expected (downtime) notifications you can use the `amtool` command on the metricsinfra prometheus server. | |||
View active notifications | |||
prometheus01:~$ sudo amtool alert | |||
Alertname Starts At Summary | |||
InstanceDown 2020-04-29 19:10:26 UTC | |||
PuppetAgentFailures 2020-04-29 19:20:26 UTC | |||
WidespreadPuppetAgentFailures 2020-04-29 19:20:26 UTC | |||
You can add a `query` to filter alerts | |||
prometheus01:~$ sudo amtool alert query project=tools | |||
Alertname Starts At Summary | |||
PuppetAgentDisabled 2020-04-29 23:12:26 UTC | |||
PuppetAgentDisabled 2020-04-29 23:12:26 UTC | |||
You can use the same query syntax to silence notifications | |||
prometheus01:~$ sudo amtool silence add project=tools -c "Silence all tools projects alerts" -d 30d | |||
3e68bf51-63f6-4406-a009-e6765acf5d8e | |||
View all silences | |||
prometheus01:~$ sudo amtool silence query | |||
ID Matchers Ends At Created By Comment | |||
3e68bf51-63f6-4406-a009-e6765acf5d8e project=tools 2020-06-04 14:39:38 UTC root Silence all tools projects alerts | |||
Expire (remove) a silence | |||
prometheus01:~$ sudo amtool silence expire 3e68bf51-63f6-4406-a009-e6765acf5d8e | |||
=== Links === | |||
* Prometheus dashboard: https://prometheus.wmflabs.org/cloud | |||
* Prometheus active alerts: https://prometheus.wmflabs.org/cloud/alerts | |||
* Grafana alert overview: https://grafana-labs.wikimedia.org/d/woLx6H6Wz/metricsinfra-alerts | |||
* Grafana project overview: https://grafana-labs.wikimedia.org/d/8Npp-46Zz/project-overview | |||
* Grafana instance details: https://grafana-labs.wikimedia.org/d/000000590/instance-details | |||
== Monitoring for Toolforge == | == Monitoring for Toolforge == |
Revision as of 16:55, 5 May 2020
This page is currently a draft. More information and discussion about changes to this draft on the talk page. |
This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.
Deployment
There are 2 physical servers:
- cloudmetrics1002.eqiad.wmnet --- currently master
- cloudmetrics1001.eqiad.wmnet --- currently cold standby
Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite and friends.
Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.
These bits are located at modules/profile/manifests/wmcs/monitoring.pp.
Grafana-labs Graphite-labs
The DNS records grafana-labs.discovery.wmnet and graphite-labs.discovery.wmnet define the active web server servicing requests. This entry is managed in the DNS git repo at /dns/browse/master/templates/wmnet and configured on trafficserver in hieradata/common/profile/trafficserver/backend.yaml.
Metrics Retention
Our metrics retention policy is 90 days. There are two cronjobs for the _graphite
user that are running on labmon1001
for this task:
archive-deleted-instances
: Moves data from deleted instances to/srv/carbon/whisper/archived_metrics
delete-old-instance-archives
: Deletes archived data that is older than 90 days
This prevents the /srv
partition from becoming full.
The archive-instances
script logs operations to /var/log/graphite/instance-archiver.log
Monitoring for Cloud VPS
The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS.
Adding new projects
The hiera key `profile::wmcs::prometheus::metricsinfra::projects:` defines which projects have monitoring enabled.
profile::wmcs::prometheus::metricsinfra::projects: # List of projects that are monitored by the metricsinfra prometheus server. Each project can be # configured with an optional notify_email list of addresses that will receive alert notifications # in addition to WMCS admins. - name: <project> notify_email: - user@example.org - another@example.org
Managing notifications
To silence existing or expected (downtime) notifications you can use the `amtool` command on the metricsinfra prometheus server.
View active notifications
prometheus01:~$ sudo amtool alert Alertname Starts At Summary InstanceDown 2020-04-29 19:10:26 UTC PuppetAgentFailures 2020-04-29 19:20:26 UTC WidespreadPuppetAgentFailures 2020-04-29 19:20:26 UTC
You can add a `query` to filter alerts
prometheus01:~$ sudo amtool alert query project=tools Alertname Starts At Summary PuppetAgentDisabled 2020-04-29 23:12:26 UTC PuppetAgentDisabled 2020-04-29 23:12:26 UTC
You can use the same query syntax to silence notifications
prometheus01:~$ sudo amtool silence add project=tools -c "Silence all tools projects alerts" -d 30d 3e68bf51-63f6-4406-a009-e6765acf5d8e
View all silences
prometheus01:~$ sudo amtool silence query ID Matchers Ends At Created By Comment 3e68bf51-63f6-4406-a009-e6765acf5d8e project=tools 2020-06-04 14:39:38 UTC root Silence all tools projects alerts
Expire (remove) a silence
prometheus01:~$ sudo amtool silence expire 3e68bf51-63f6-4406-a009-e6765acf5d8e
Links
- Prometheus dashboard: https://prometheus.wmflabs.org/cloud
- Prometheus active alerts: https://prometheus.wmflabs.org/cloud/alerts
- Grafana alert overview: https://grafana-labs.wikimedia.org/d/woLx6H6Wz/metricsinfra-alerts
- Grafana project overview: https://grafana-labs.wikimedia.org/d/8Npp-46Zz/project-overview
- Grafana instance details: https://grafana-labs.wikimedia.org/d/000000590/instance-details
Monitoring for Toolforge
There are metrics for every node in the Toolforge cluster.
Dashboards and handy links
If you want to get an overview of what's going on the Cloud VPS infra, open these links:
Datacenter | What | Mechanism | Comments | Link |
---|---|---|---|---|
eqiad | NFS servers | icinga | labstore1xxx servers | [1] |
eqiad | Cloud VPS main services | icinga | service servers, non virts | [2] |
codfw | Cloud VPS labtest servers | icinga | all physical servers | [3] |
eqiad | Toolforge basic alerts | grafana | some interesting metrics from Toolforge | [4] |
eqiad | Toolforge grid status | custom tool | jobs running on Toolforge's grid | [5] |
any | cloud servers | icinga | all physical servers with the cloudXXXX naming scheme | [6] |
eqiad | Cloud VPS eqiad1 capacity | grafana | capacity planning | [7] |
eqiad | labstore1004/labstore1005 | grafana | load & general metrics | [8] |
eqiad | Cloud VPS eqiad1 | grafana | load & general metrics | [9] |
eqiad | Cloud VPS eqiad1 | grafana | internal openstack metrics | [10] |
eqiad | Cloud VPS eqiad1 | grafana | hypervisor metrics from openstack | [11] |
eqiad | Cloud VPS memcache | grafana | cloudservices servers | [12] |
eqiad | Toolforge | grafana | Arturo's metrics | [13] |
eqiad | Cloud HW eqiad | icinga | Icinga group for WMCS in eqiad | [14] |
eqiad | Toolforge, new kubernetes cluster | prometheus/grafana | Generic dashboard for the new Kubernetes cluster | [15] |
eqiad | Toolforge, new kubernetes cluster, namespaces | prometheus/grafana | Per-namspace dashboard for the new Kubernetes cluster | [16] |
eqiad | Toolforge, new kubernetes cluster, ingress | prometheus/grafana | dashboard about the ingress for the new kubernetes cluster | [17] |
eqiad | Toolforge | prometheus/grafana | dashboard showing a table with basic information about all VMs in the tools project | [18] |
Datacenter | What | Mechanism | Comments | Link |