You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Usability VM cluster monitoring wishes
Revision as of 04:30, 4 July 2015 by imported>Alex Monk
![]() | This page contains historical information. It is probably no longer true. |
Information on the VMs is at http://ryandlane.com/wiki/Category:Servers
Monitoring
Ganglia
- Put tesla in misc group in Ganglia
- Requires trickery: tesla runs ESXi hence can't run gmond for the box itself (only for individual VMs), monitoring through SNMP is the only option
- Put all VMs in a VM group in Ganglia
Nagios
- Check for CPU > 95% on every VM and on tesla itself
- Check for real mem usage > 75% on every VM and on tesla itself
- Disk space checks on every VM and on tesla itself
- VMs should have a lower percentual threshold, say 80% rather than the usual 95% or 97%, because they have small root partitions
- tesla's threshold should be 90%
- HTTP checks on all VMs that serve HTTP
- prototype.wikimedia.org set up already, commons.prototype.wikimedia.org isn't. Currently no other VMs serving HTTP
- HTTP check on grid.tesla.usability.wikimedia.org:4444/console (Selenium server)
Notification
- Notify Ryan, Roan and possibly ops people by SMS when any CRITICAL status on the Nagios checks above persists for more than 5 minutes (to prevent triggering text messages when Nagios just flaps, hope this is possible).