You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Cloud VPS/Admin/Monitoring"

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
imported>Majavah
(→‎Managing notifications: add minimal CLI bit back)
 
(17 intermediate revisions by 7 users not shown)
Line 7: Line 7:
There are 2 physical servers:
There are 2 physical servers:


* [[Labmon1001 | labmon1001.eqiad.wmnet]] --- currently master
* [[Cloudmetrics1001 | cloudmetrics1002.eqiad.wmnet]] --- currently master
* labmon1002.eqiad.wmnet --- currently cold standby
* cloudmetrics1001.eqiad.wmnet --- currently cold standby


Both servers get applied the [[puppet]] role '''labs::monitoring''' ([[phab:source/operations-puppet/browse/production/modules/role/manifests/labs/monitoring.pp |modules/role/manifests/labs/monitoring.pp]]), which get them ready
Both servers get applied the [[puppet]] role '''role::wmcs::monitoring''' ([[phab:source/operations-puppet/browse/production/modules/role/manifests/wmcs/monitoring.pp |modules/role/manifests/wmcs/monitoring.pp]]), which get them ready
to collect metrics using a software stack composed of carbon, [[graphite]] and friends.
to collect metrics using a software stack composed of carbon, [[graphite]], [[Prometheus]] and friends.


Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master ('''/srv/carbon/whisper/'''),
Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master ('''/srv/carbon/whisper/'''),
so in case of a failover we could rebuild the service without much metrics loss.
so in case of a failover we could rebuild the service without much metrics loss.


These bits are located at [[phab:source/operations-puppet/browse/production/modules/profile/manifests/labs/monitoring.pp |modules/profile/manifests/labs/monitoring.pp]].
These bits are located at [[phab:source/operations-puppet/browse/production/modules/profile/manifests/wmcs/monitoring.pp |modules/profile/manifests/wmcs/monitoring.pp]].


== Metrics life-cycle ==
== Grafana-labs Graphite-labs==


There is a cron job with a custom script ('''archive-instances''') that rotates (archives) old metrics from CloudVPS projects.
The DNS records grafana-labs.discovery.wmnet and graphite-labs.discovery.wmnet define the active web server servicing requests. This entry is managed in the DNS git repo at /dns/browse/master/templates/wmnet and configured on trafficserver in [[phab:source/operations-puppet/browse/production/hieradata/common/profile/trafficserver/backend.yaml |hieradata/common/profile/trafficserver/backend.yaml]].
Then, archived metrics get deleted if they are older than 90 days by means of a cronjob.


Both the '''/srv/''' partition and the archive directory '''/srv/carbon/whisper/archived_metrics/''' could grow out of control if these mechanisms are not properly adjusted.
== Accessing "labs" prometheus ==
Our monitoring for physical servers is a mix of production Prometheus/[[Thanos]] and the Prometheus setup on the cloudmetrics100x servers.  These are mentioned in https://grafana.wikimedia.org as "eqiad prometheus/labs".  To access the servers directly in order to troubleshoot what the scrapes are coming up with and more quickly construct queries, you can set up an ssh proxy like so:


These bits are located at:
<code>ssh -L 8000:prometheus-labmon.eqiad.wmnet:80 cloudmetrics1001.eqiad.wmnet</code>


* [[phab:source/operations-puppet/browse/production/modules/graphite/manifests/labs/archiver.pp | modules/graphite/manifests/labs/archiver.pp]]
And then point your web browser to http://localhost:8000/labs to bring up the Prometheus web interface. You can then construct and execute PromQL queries as needed per the upstream docs. Note, that sometimes a copied grafana query will not work because it has a grafana variable in it. Just watch out for things with a "$name" format, since that's not PromQL.
* [[phab:source/operations-puppet/browse/production/modules/graphite/files/archive-instances | modules/graphite/files/archive-instances]]


The script logs operations to '''/var/log/graphite/instance-archiver.log'''.
 
== Metrics Retention ==
 
Our metrics retention policy is 90 days. There are two cronjobs for the <code>_graphite</code> user that are running on <code>labmon1001</code> for this task:
 
* <code>archive-deleted-instances</code>: Moves data from deleted instances to <code>/srv/carbon/whisper/archived_metrics</code>
* <code>delete-old-instance-archives</code>: Deletes archived data that is older than 90 days
 
This prevents the <code>/srv</code> partition from becoming full.
 
* [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/graphite/manifests/wmcs/archiver.pp modules/graphite/manifests/wmcs/archiver.pp]
* [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/graphite/files/archive-instances modules/graphite/files/archive-instances]
 
The <code>archive-instances</code> script logs operations to <code>/var/log/graphite/instance-archiver.log</code>


== Monitoring for Cloud VPS ==
== Monitoring for Cloud VPS ==


There are metrics per project.
The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. Technical documentation for the setup is at [[Nova Resource:Metricsinfra/Documentation]].
 
=== Adding new projects ===
The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh to <code>metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud</code> and use <code>sudo -i mariadb</code> to edit the database by hand.
 
=== Managing notifications ===
Some hardcoded accounts (WMCS staff and some trusted volunteers) can use [https://prometheus-alerts.wmcloud.org prometheus-alerts.wmcloud.org] to create and edit silences. In the future the same interface will work for all project administrators for their own projects.
 
Alternatively to silence existing or expected (downtime) notifications you can use the `amtool` command on any metricsinfra alertmanager server (currently for example metricsinfra-alertmanager-1.metricsinfra.eqiad1.wikimedia.cloud). For example to silence all Toolsbeta alerts you could use:
<syntaxhighlight lang="shell-session">
metricsinfra-alertmanager-1:~$ sudo amtool silence add project=toolsbeta -c "per T123456" -d 30d
3e68bf51-63f6-4406-a009-e6765acf5d8e
</syntaxhighlight>
 
=== Links ===
* Prometheus dashboard: https://prometheus.wmflabs.org/cloud
* Prometheus active alerts: https://prometheus.wmflabs.org/cloud/alerts
* Grafana alert overview: https://grafana-labs.wikimedia.org/d/woLx6H6Wz/metricsinfra-alerts
* Grafana project overview: https://grafana-labs.wikimedia.org/d/8Npp-46Zz/project-overview
* Grafana instance details: https://grafana-labs.wikimedia.org/d/000000590/instance-details


== Monitoring for Toolforge ==
== Monitoring for Toolforge ==
Line 57: Line 88:
| labstore1xxx servers
| labstore1xxx servers
| [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=labsnfs_eqiad&style=hostservicedetail]
| [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=labsnfs_eqiad&style=hostservicedetail]
|-
! eqiad
| NFS Server Statistics
| grafana
| labstore and cloudstore NFS operations, connections and various details
| [https://grafana.wikimedia.org/d/ykpqNajZk/cloud-nfs-stats?orgId=1]
|-
|-
! eqiad
! eqiad
Line 75: Line 112:
| some interesting metrics from Toolforge
| some interesting metrics from Toolforge
| [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts]
| [https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts]
|-
! eqiad
| ToolsDB (Toolforge R/W MariaDB)
| grafana
| Database metrics for ToolsDB servers
| [https://grafana-labs.wikimedia.org/d/000000273/tools-mariadb?orgId=1]
|-
|-
! eqiad
! eqiad
Line 80: Line 123:
| custom tool
| custom tool
| jobs running on Toolforge's grid
| jobs running on Toolforge's grid
| [https://tools.wmflabs.org/admin/oge/status]
| [https://sge-status.toolforge.org/]
|-
|-
! any
! any
Line 105: Line 148:
| load & general metrics
| load & general metrics
| [https://grafana.wikimedia.org/dashboard/db/cloudvps-eqiad1]
| [https://grafana.wikimedia.org/dashboard/db/cloudvps-eqiad1]
|-
! eqiad
| Cloud VPS eqiad1
| grafana
| internal openstack metrics
| [https://grafana.wikimedia.org/dashboard/db/wmcs-openstack-eqiad1]
|-
! eqiad
| Cloud VPS eqiad1
| grafana
| hypervisor metrics from openstack
| [https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1014]
|-
! eqiad
| Cloud VPS memcache
| grafana
| cloudservices servers
| [https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=wmcs&var-instance=All]
|-
! eqiad
| openstack database backend (per host)
| grafana
| mariadb/galera on cloudcontrols
| [https://grafana.wikimedia.org/d/tN1aK6MGk/cloudcontrol-mysql]
|-
! eqiad
| openstack database backend (aggregated)
| grafana
| mariadb/galera on cloudcontrols
| [https://grafana.wikimedia.org/d/8KPwK6GMk/cloudcontrol-mysql-aggregated]
|-
! eqiad
| Toolforge
| grafana
| Arturo's metrics
| [https://grafana-labs.wikimedia.org/dashboard/db/arturo-toolforge-dashboard]
|-
! eqiad
| Cloud HW eqiad
| icinga
| Icinga group for WMCS in eqiad
| [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=wmcs_eqiad&style=overview]
|-
! eqiad
| Toolforge, new kubernetes cluster
| prometheus/grafana
| Generic dashboard for the new Kubernetes cluster
| [https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1]
|-
! eqiad
| Toolforge, new kubernetes cluster, namespaces
| prometheus/grafana
| Per-namspace dashboard for the new Kubernetes cluster
| [https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources]
|-
! eqiad
| Toolforge, new kubernetes cluster, ingress
| prometheus/grafana
| dashboard about the ingress for the new kubernetes cluster
| [https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress]
|-
! eqiad
| Toolforge
| prometheus/grafana
| dashboard showing a table with basic information about all VMs in the tools project
| [https://grafana-labs.wikimedia.org/d/mbEvbK2Wz/toolforge-vm-table?orgId=1&refresh=1m]
|-
! eqiad
| Toolforge email server
| prometheus/grafana
| dashboard showing data about Toolforge exim email server
| [https://grafana-labs.wikimedia.org/d/HcDsu-WGk/toolforge-email-dashboard]
|- class="sortbottom"
|- class="sortbottom"
! Datacenter
! Datacenter
Line 119: Line 234:
[[Category:Cloud Services admin|Wikimedia VPS]]
[[Category:Cloud Services admin|Wikimedia VPS]]
[[Category:VPS admin|Wikimedia VPS]]
[[Category:VPS admin|Wikimedia VPS]]
[[Category:Toolforge]]

Latest revision as of 17:20, 7 August 2021

This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.

Deployment

There are 2 physical servers:

Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite, Prometheus and friends.

Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.

These bits are located at modules/profile/manifests/wmcs/monitoring.pp.

Grafana-labs Graphite-labs

The DNS records grafana-labs.discovery.wmnet and graphite-labs.discovery.wmnet define the active web server servicing requests. This entry is managed in the DNS git repo at /dns/browse/master/templates/wmnet and configured on trafficserver in hieradata/common/profile/trafficserver/backend.yaml.

Accessing "labs" prometheus

Our monitoring for physical servers is a mix of production Prometheus/Thanos and the Prometheus setup on the cloudmetrics100x servers. These are mentioned in https://grafana.wikimedia.org as "eqiad prometheus/labs". To access the servers directly in order to troubleshoot what the scrapes are coming up with and more quickly construct queries, you can set up an ssh proxy like so:

ssh -L 8000:prometheus-labmon.eqiad.wmnet:80 cloudmetrics1001.eqiad.wmnet

And then point your web browser to http://localhost:8000/labs to bring up the Prometheus web interface. You can then construct and execute PromQL queries as needed per the upstream docs. Note, that sometimes a copied grafana query will not work because it has a grafana variable in it. Just watch out for things with a "$name" format, since that's not PromQL.


Metrics Retention

Our metrics retention policy is 90 days. There are two cronjobs for the _graphite user that are running on labmon1001 for this task:

  • archive-deleted-instances: Moves data from deleted instances to /srv/carbon/whisper/archived_metrics
  • delete-old-instance-archives: Deletes archived data that is older than 90 days

This prevents the /srv partition from becoming full.

The archive-instances script logs operations to /var/log/graphite/instance-archiver.log

Monitoring for Cloud VPS

The Cloud VPS project "metricsinfra" provides the base infrastructure and services for multi-tenant instance monitoring on Cloud VPS. Technical documentation for the setup is at Nova Resource:Metricsinfra/Documentation.

Adding new projects

The monitoring configuration is mostly kept in a Trove database. There is no interface for more user-friendly management yet, but for now you can ssh to metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud and use sudo -i mariadb to edit the database by hand.

Managing notifications

Some hardcoded accounts (WMCS staff and some trusted volunteers) can use prometheus-alerts.wmcloud.org to create and edit silences. In the future the same interface will work for all project administrators for their own projects.

Alternatively to silence existing or expected (downtime) notifications you can use the `amtool` command on any metricsinfra alertmanager server (currently for example metricsinfra-alertmanager-1.metricsinfra.eqiad1.wikimedia.cloud). For example to silence all Toolsbeta alerts you could use:

metricsinfra-alertmanager-1:~$ sudo amtool silence add project=toolsbeta -c "per T123456" -d 30d
3e68bf51-63f6-4406-a009-e6765acf5d8e

Links

Monitoring for Toolforge

There are metrics for every node in the Toolforge cluster.

Dashboards and handy links

If you want to get an overview of what's going on the Cloud VPS infra, open these links:

Datacenter What Mechanism Comments Link
eqiad NFS servers icinga labstore1xxx servers [1]
eqiad NFS Server Statistics grafana labstore and cloudstore NFS operations, connections and various details [2]
eqiad Cloud VPS main services icinga service servers, non virts [3]
codfw Cloud VPS labtest servers icinga all physical servers [4]
eqiad Toolforge basic alerts grafana some interesting metrics from Toolforge [5]
eqiad ToolsDB (Toolforge R/W MariaDB) grafana Database metrics for ToolsDB servers [6]
eqiad Toolforge grid status custom tool jobs running on Toolforge's grid [7]
any cloud servers icinga all physical servers with the cloudXXXX naming scheme [8]
eqiad Cloud VPS eqiad1 capacity grafana capacity planning [9]
eqiad labstore1004/labstore1005 grafana load & general metrics [10]
eqiad Cloud VPS eqiad1 grafana load & general metrics [11]
eqiad Cloud VPS eqiad1 grafana internal openstack metrics [12]
eqiad Cloud VPS eqiad1 grafana hypervisor metrics from openstack [13]
eqiad Cloud VPS memcache grafana cloudservices servers [14]
eqiad openstack database backend (per host) grafana mariadb/galera on cloudcontrols [15]
eqiad openstack database backend (aggregated) grafana mariadb/galera on cloudcontrols [16]
eqiad Toolforge grafana Arturo's metrics [17]
eqiad Cloud HW eqiad icinga Icinga group for WMCS in eqiad [18]
eqiad Toolforge, new kubernetes cluster prometheus/grafana Generic dashboard for the new Kubernetes cluster [19]
eqiad Toolforge, new kubernetes cluster, namespaces prometheus/grafana Per-namspace dashboard for the new Kubernetes cluster [20]
eqiad Toolforge, new kubernetes cluster, ingress prometheus/grafana dashboard about the ingress for the new kubernetes cluster [21]
eqiad Toolforge prometheus/grafana dashboard showing a table with basic information about all VMs in the tools project [22]
eqiad Toolforge email server prometheus/grafana dashboard showing data about Toolforge exim email server [23]
Datacenter What Mechanism Comments Link

See also