You are browsing a read-only backup copy of Wikitech. The live site can be found at

Usability VM cluster monitoring wishes

From Wikitech-static
Revision as of 04:30, 4 July 2015 by imported>Alex Monk
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Information on the VMs is at



  • Put tesla in misc group in Ganglia
    • Requires trickery: tesla runs ESXi hence can't run gmond for the box itself (only for individual VMs), monitoring through SNMP is the only option
  • Put all VMs in a VM group in Ganglia


  • Check for CPU > 95% on every VM and on tesla itself
  • Check for real mem usage > 75% on every VM and on tesla itself
  • Disk space checks on every VM and on tesla itself
    • VMs should have a lower percentual threshold, say 80% rather than the usual 95% or 97%, because they have small root partitions
    • tesla's threshold should be 90%
  • HTTP checks on all VMs that serve HTTP
    • set up already, isn't. Currently no other VMs serving HTTP
  • HTTP check on (Selenium server)


  • Notify Ryan, Roan and possibly ops people by SMS when any CRITICAL status on the Nagios checks above persists for more than 5 minutes (to prevent triggering text messages when Nagios just flaps, hope this is possible).