You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Cloud VPS/Admin/Monitoring"

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
(→‎Dashboards and handy links: add hypervisor openstack metrics)
imported>Arturo Borrero Gonzalez
(→‎Dashboards and handy links: add icinga group WMCS eqiad)
Line 10: Line 10:
* labmon1002.eqiad.wmnet --- currently cold standby
* labmon1002.eqiad.wmnet --- currently cold standby


Both servers get applied the [[puppet]] role '''labs::monitoring''' ([[phab:source/operations-puppet/browse/production/modules/role/manifests/labs/monitoring.pp |modules/role/manifests/labs/monitoring.pp]]), which get them ready
Both servers get applied the [[puppet]] role '''role::wmcs::monitoring''' ([[phab:source/operations-puppet/browse/production/modules/role/manifests/wmcs/monitoring.pp |modules/role/manifests/wmcs/monitoring.pp]]), which get them ready
to collect metrics using a software stack composed of carbon, [[graphite]] and friends.
to collect metrics using a software stack composed of carbon, [[graphite]] and friends.


Line 16: Line 16:
so in case of a failover we could rebuild the service without much metrics loss.
so in case of a failover we could rebuild the service without much metrics loss.


These bits are located at [[phab:source/operations-puppet/browse/production/modules/profile/manifests/labs/monitoring.pp |modules/profile/manifests/labs/monitoring.pp]].
These bits are located at [[phab:source/operations-puppet/browse/production/modules/profile/manifests/wmcs/monitoring.pp |modules/profile/manifests/wmcs/monitoring.pp]].


== Metrics life-cycle ==
== Metrics Retention ==


There is a cron job with a custom script ('''archive-instances''') that rotates (archives) old metrics from CloudVPS projects.
Our metrics retention policy is 90 days. There are two cronjobs for the <code>_graphite</code> user that are running on <code>labmon1001</code> for this task:
Then, archived metrics get deleted if they are older than 90 days by means of a cronjob.


Both the '''/srv/''' partition and the archive directory '''/srv/carbon/whisper/archived_metrics/''' could grow out of control if these mechanisms are not properly adjusted.
* <code>archive-deleted-instances</code>: Moves data from deleted instances to <code>/srv/carbon/whisper/archived_metrics</code>
* <code>delete-old-instance-archives</code>: Deletes archived data that is older than 90 days


These bits are located at:
This prevents the <code>/srv</code> partition from becoming full.


* [[phab:source/operations-puppet/browse/production/modules/graphite/manifests/labs/archiver.pp | modules/graphite/manifests/labs/archiver.pp]]
* [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/graphite/manifests/wmcs/archiver.pp modules/graphite/manifests/wmcs/archiver.pp]
* [[phab:source/operations-puppet/browse/production/modules/graphite/files/archive-instances | modules/graphite/files/archive-instances]]
* [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/graphite/files/archive-instances modules/graphite/files/archive-instances]


The script logs operations to '''/var/log/graphite/instance-archiver.log'''.
The <code>archive-instances</code> script logs operations to <code>/var/log/graphite/instance-archiver.log</code>


== Monitoring for Cloud VPS ==
== Monitoring for Cloud VPS ==
Line 116: Line 116:
| grafana
| grafana
| hypervisor metrics from openstack
| hypervisor metrics from openstack
| [https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1014&from=1544688233838&to=1544709833838]
| [https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&var-hypervisor=cloudvirt1014]
|-
! eqiad
| Cloud VPS memcache
| grafana
| cloudservices servers
| [https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=wmcs&var-instance=All]
|-
! eqiad
| Toolforge
| grafana
| Arturo's metrics
| [https://grafana-labs.wikimedia.org/dashboard/db/arturo-toolforge-dashboard]
|-
! eqiad
| Cloud HW eqiad
| icinga
| Icinga group for WMCS in eqiad
| [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=wmcs_eqiad&style=overview]
|- class="sortbottom"
|- class="sortbottom"
! Datacenter
! Datacenter
Line 131: Line 149:
[[Category:Cloud Services admin|Wikimedia VPS]]
[[Category:Cloud Services admin|Wikimedia VPS]]
[[Category:VPS admin|Wikimedia VPS]]
[[Category:VPS admin|Wikimedia VPS]]
[[Category:Toolforge]]

Revision as of 09:52, 14 May 2019

This page describes how monitoring works as deployed and managed by the WMCS team, for both Cloud VPS and Toolforge.

Deployment

There are 2 physical servers:

Both servers get applied the puppet role role::wmcs::monitoring (modules/role/manifests/wmcs/monitoring.pp), which get them ready to collect metrics using a software stack composed of carbon, graphite and friends.

Although the ideal would be for both servers to collect and serve metrics at the same time using a cluster approach, right now only the master actually works. The cold standby fetch metrics using rsync from the master (/srv/carbon/whisper/), so in case of a failover we could rebuild the service without much metrics loss.

These bits are located at modules/profile/manifests/wmcs/monitoring.pp.

Metrics Retention

Our metrics retention policy is 90 days. There are two cronjobs for the _graphite user that are running on labmon1001 for this task:

  • archive-deleted-instances: Moves data from deleted instances to /srv/carbon/whisper/archived_metrics
  • delete-old-instance-archives: Deletes archived data that is older than 90 days

This prevents the /srv partition from becoming full.

The archive-instances script logs operations to /var/log/graphite/instance-archiver.log

Monitoring for Cloud VPS

There are metrics per project.

Monitoring for Toolforge

There are metrics for every node in the Toolforge cluster.

Dashboards and handy links

If you want to get an overview of what's going on the Cloud VPS infra, open these links:

Datacenter What Mechanism Comments Link
eqiad NFS servers icinga labstore1xxx servers [1]
eqiad Cloud VPS main services icinga service servers, non virts [2]
codfw Cloud VPS labtest servers icinga all physical servers [3]
eqiad Toolforge basic alerts grafana some interesting metrics from Toolforge [4]
eqiad Toolforge grid status custom tool jobs running on Toolforge's grid [5]
any cloud servers icinga all physical servers with the cloudXXXX naming scheme [6]
eqiad Cloud VPS eqiad1 capacity grafana capacity planning [7]
eqiad labstore1004/labstore1005 grafana load & general metrics [8]
eqiad Cloud VPS eqiad1 grafana load & general metrics [9]
eqiad Cloud VPS eqiad1 grafana internal openstack metrics [10]
eqiad Cloud VPS eqiad1 grafana hypervisor metrics from openstack [11]
eqiad Cloud VPS memcache grafana cloudservices servers [12]
eqiad Toolforge grafana Arturo's metrics [13]
eqiad Cloud HW eqiad icinga Icinga group for WMCS in eqiad [14]
Datacenter What Mechanism Comments Link

See also