You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Cloud VPS/Admin/Monitoring"

From Wikitech-static
Jump to navigation Jump to search
imported>Jhedden
imported>BryanDavis
(→‎Managing notifications: prettier example formatting)
Line 53: Line 53:


=== Managing notifications ===
=== Managing notifications ===
To silence existing or expected (downtime) notifications you can use the `amtool` command on the metricsinfra prometheus server.
To silence existing or expected (downtime) notifications you can use the `amtool` command on the metricsinfra prometheus server (prometheus01.metricsinfra.eqiad.wmflabs).


View active notifications
View active notifications
prometheus01:~$ sudo amtool alert
<syntaxhighlight lang="shell-session">
  Alertname                      Starts At                Summary   
prometheus01:~$ sudo amtool alert
  InstanceDown                  2020-04-29 19:10:26 UTC           
Alertname                      Starts At                Summary   
  PuppetAgentFailures            2020-04-29 19:20:26 UTC           
InstanceDown                  2020-04-29 19:10:26 UTC           
  WidespreadPuppetAgentFailures  2020-04-29 19:20:26 UTC   
PuppetAgentFailures            2020-04-29 19:20:26 UTC           
WidespreadPuppetAgentFailures  2020-04-29 19:20:26 UTC   
</syntaxhighlight>


You can add a `query` to filter alerts
You can add a `query` to filter alerts
prometheus01:~$ sudo amtool alert query project=tools
<syntaxhighlight lang="shell-session">
  Alertname            Starts At                Summary   
prometheus01:~$ sudo amtool alert query project=tools
  PuppetAgentDisabled  2020-04-29 23:12:26 UTC           
Alertname            Starts At                Summary   
  PuppetAgentDisabled  2020-04-29 23:12:26 UTC           
PuppetAgentDisabled  2020-04-29 23:12:26 UTC           
PuppetAgentDisabled  2020-04-29 23:12:26 UTC           
</syntaxhighlight>


You can use the same query syntax to silence notifications
You can use the same query syntax to silence notifications
prometheus01:~$ sudo amtool silence add project=tools -c "Silence all tools projects alerts" -d 30d
<syntaxhighlight lang="shell-session">
  3e68bf51-63f6-4406-a009-e6765acf5d8e
prometheus01:~$ sudo amtool silence add project=tools -c "Silence all tools projects alerts" -d 30d
3e68bf51-63f6-4406-a009-e6765acf5d8e
</syntaxhighlight>


View all silences  
View all silences  
prometheus01:~$ sudo amtool silence query
<syntaxhighlight lang="shell-session">
  ID                                    Matchers      Ends At                  Created By  Comment                             
prometheus01:~$ sudo amtool silence query
  3e68bf51-63f6-4406-a009-e6765acf5d8e  project=tools  2020-06-04 14:39:38 UTC  root        Silence all tools projects alerts  
ID                                    Matchers      Ends At                  Created By  Comment                             
3e68bf51-63f6-4406-a009-e6765acf5d8e  project=tools  2020-06-04 14:39:38 UTC  root        Silence all tools projects alerts  
</syntaxhighlight>


Expire (remove) a silence
Expire (remove) a silence
prometheus01:~$ sudo amtool silence expire 3e68bf51-63f6-4406-a009-e6765acf5d8e
<syntaxhighlight lang="shell-session">
prometheus01:~$ sudo amtool silence expire 3e68bf51-63f6-4406-a009-e6765acf5d8e
</syntaxhighlight>
 
=== Links ===
=== Links ===
* Prometheus dashboard: https://prometheus.wmflabs.org/cloud
* Prometheus dashboard: https://prometheus.wmflabs.org/cloud

Revision as of 16:47, 18 May 2020

This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.

Deployment

There are 2 physical servers:

Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite and friends.

Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.

These bits are located at modules/profile/manifests/wmcs/monitoring.pp.

Grafana-labs Graphite-labs

The DNS records grafana-labs.discovery.wmnet and graphite-labs.discovery.wmnet define the active web server servicing requests. This entry is managed in the DNS git repo at /dns/browse/master/templates/wmnet and configured on trafficserver in hieradata/common/profile/trafficserver/backend.yaml.

Metrics Retention

Our metrics retention policy is 90 days. There are two cronjobs for the _graphite user that are running on labmon1001 for this task:

  • archive-deleted-instances: Moves data from deleted instances to /srv/carbon/whisper/archived_metrics
  • delete-old-instance-archives: Deletes archived data that is older than 90 days

This prevents the /srv partition from becoming full.

The archive-instances script logs operations to /var/log/graphite/instance-archiver.log

Monitoring for Cloud VPS

The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS.

Adding new projects

The hiera key `profile::wmcs::prometheus::metricsinfra::projects:` defines which projects have monitoring enabled.

profile::wmcs::prometheus::metricsinfra::projects:
# List of projects that are monitored by the metricsinfra prometheus server. Each project can be
# configured with an optional notify_email list of addresses that will receive alert notifications
# in addition to WMCS admins.
- name: <project>
   notify_email:
     - user@example.org
     - another@example.org

Managing notifications

To silence existing or expected (downtime) notifications you can use the `amtool` command on the metricsinfra prometheus server (prometheus01.metricsinfra.eqiad.wmflabs).

View active notifications

prometheus01:~$ sudo amtool alert
Alertname                      Starts At                Summary  
InstanceDown                   2020-04-29 19:10:26 UTC           
PuppetAgentFailures            2020-04-29 19:20:26 UTC           
WidespreadPuppetAgentFailures  2020-04-29 19:20:26 UTC

You can add a `query` to filter alerts

prometheus01:~$ sudo amtool alert query project=tools
Alertname            Starts At                Summary  
PuppetAgentDisabled  2020-04-29 23:12:26 UTC           
PuppetAgentDisabled  2020-04-29 23:12:26 UTC

You can use the same query syntax to silence notifications

prometheus01:~$ sudo amtool silence add project=tools -c "Silence all tools projects alerts" -d 30d
3e68bf51-63f6-4406-a009-e6765acf5d8e

View all silences

prometheus01:~$ sudo amtool silence query
ID                                    Matchers       Ends At                  Created By  Comment                            
3e68bf51-63f6-4406-a009-e6765acf5d8e  project=tools  2020-06-04 14:39:38 UTC  root        Silence all tools projects alerts

Expire (remove) a silence

prometheus01:~$ sudo amtool silence expire 3e68bf51-63f6-4406-a009-e6765acf5d8e

Links

Monitoring for Toolforge

There are metrics for every node in the Toolforge cluster.

Dashboards and handy links

If you want to get an overview of what's going on the Cloud VPS infra, open these links:

Datacenter What Mechanism Comments Link
eqiad NFS servers icinga labstore1xxx servers [1]
eqiad Cloud VPS main services icinga service servers, non virts [2]
codfw Cloud VPS labtest servers icinga all physical servers [3]
eqiad Toolforge basic alerts grafana some interesting metrics from Toolforge [4]
eqiad Toolforge grid status custom tool jobs running on Toolforge's grid [5]
any cloud servers icinga all physical servers with the cloudXXXX naming scheme [6]
eqiad Cloud VPS eqiad1 capacity grafana capacity planning [7]
eqiad labstore1004/labstore1005 grafana load & general metrics [8]
eqiad Cloud VPS eqiad1 grafana load & general metrics [9]
eqiad Cloud VPS eqiad1 grafana internal openstack metrics [10]
eqiad Cloud VPS eqiad1 grafana hypervisor metrics from openstack [11]
eqiad Cloud VPS memcache grafana cloudservices servers [12]
eqiad Toolforge grafana Arturo's metrics [13]
eqiad Cloud HW eqiad icinga Icinga group for WMCS in eqiad [14]
eqiad Toolforge, new kubernetes cluster prometheus/grafana Generic dashboard for the new Kubernetes cluster [15]
eqiad Toolforge, new kubernetes cluster, namespaces prometheus/grafana Per-namspace dashboard for the new Kubernetes cluster [16]
eqiad Toolforge, new kubernetes cluster, ingress prometheus/grafana dashboard about the ingress for the new kubernetes cluster [17]
eqiad Toolforge prometheus/grafana dashboard showing a table with basic information about all VMs in the tools project [18]
Datacenter What Mechanism Comments Link

See also