You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Jump to navigation Jump to search
This page contains historical information. It is probably no longer true.
Proposed monitoring procedure
- Check nagios for new alerts.
- Fix simple issues such as daemons that need restarting or servers that can be rebooted remotely.
- Note any issues which need on-site attention at datacenter tasks.
- Pass responsibility for any more complex software issues to a competent staff member.
- Capacity check. Make sure key metrics such as application CPU utilisation and disk space usage are not approaching dangerous limits.
- Publish a report detailing the times at which Nagios was checked, the issues noted, and any people notified. Or, make this information available continuously, for review on a weekly basis.
- Another team member should check the report and make sure that the monitoring done was of an appropriate standard.
One to two months:
- Capacity review. Analyse capacity metrics and report your findings. Notify the team of upcoming performance bottlenecks which might require hardware purchases.
- Report any long-term issues which have been left unfixed.