You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Cloud VPS/Admin/Monitoring: Difference between revisions
No edit summary
imported>Arturo Borrero Gonzalez
(→Dashboards and handy links: add toolforge ingress dashboard link)
|Line 135:||Line 135:|
| Icinga group for WMCS in eqiad
| Icinga group for WMCS in eqiad
Revision as of 13:11, 13 January 2020
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.
There are 2 physical servers:
- cloudmetrics1002.eqiad.wmnet --- currently master
- cloudmetrics1001.eqiad.wmnet --- currently cold standby
Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite and friends.
Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.
These bits are located at modules/profile/manifests/wmcs/monitoring.pp.
Our metrics retention policy is 90 days. There are two cronjobs for the
_graphite user that are running on
labmon1001 for this task:
archive-deleted-instances: Moves data from deleted instances to
delete-old-instance-archives: Deletes archived data that is older than 90 days
This prevents the
/srv partition from becoming full.
archive-instances script logs operations to
Monitoring for Cloud VPS
There are metrics per project.
Monitoring for Toolforge
There are metrics for every node in the Toolforge cluster.
Dashboards and handy links
If you want to get an overview of what's going on the Cloud VPS infra, open these links:
|eqiad||NFS servers||icinga||labstore1xxx servers|||
|eqiad||Cloud VPS main services||icinga||service servers, non virts|||
|codfw||Cloud VPS labtest servers||icinga||all physical servers|||
|eqiad||Toolforge basic alerts||grafana||some interesting metrics from Toolforge|||
|eqiad||Toolforge grid status||custom tool||jobs running on Toolforge's grid|||
|any||cloud servers||icinga||all physical servers with the cloudXXXX naming scheme|||
|eqiad||Cloud VPS eqiad1 capacity||grafana||capacity planning|||
|eqiad||labstore1004/labstore1005||grafana||load & general metrics|||
|eqiad||Cloud VPS eqiad1||grafana||load & general metrics|||
|eqiad||Cloud VPS eqiad1||grafana||internal openstack metrics|||
|eqiad||Cloud VPS eqiad1||grafana||hypervisor metrics from openstack|||
|eqiad||Cloud VPS memcache||grafana||cloudservices servers|||
|eqiad||Cloud HW eqiad||icinga||Icinga group for WMCS in eqiad|||
|eqiad||Toolforge, new kubernetes cluster||prometheus/grafana||Generic dashboard for the new Kubernetes cluster|||
|eqiad||Toolforge, new kubernetes cluster, namespaces||prometheus/grafana||Per-namspace dashboard for the new Kubernetes cluster|||
|eqiad||Toolforge, new kubernetes cluster, ingress||prometheus/grafana||dashboard about the ingress for the new kubernetes cluster|||