You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
|This page contains historical information. It is probably no longer true.|
GridMonitor is a replacement for the ageing servmon monitor. A GridMonitor installation consists of an arbitrary number of instances, one per node, linked via a mesh (i.e. every node is connected to every other). One instance is randomly elected as a master node (possibly +/- a weight for preferred masters). Each monitor monitors its local node; anything unexpected happening raises an audit event, which is logged to the master node, and optionally results in some action being taken to resolve the issue. The master node handles logging, for example to a logfile or to an IRC channel.
As the monitoring is distributed over each node, there is no single point of failure, and a crash of one node will not cause the entire system to fail. Failure of the master node results in a new master being elected.
It would be possible to implement a system in gridmonitor which would observe, for example, typical DB load at certain times of day, and raise an audit event if it is outside the usual threshold. This would allow it to adapt automatically to its environment without having to be reconfigured.
Other options would include warnings of non-typical login times or IPs for users, ...
New querybane format:
<querybane> <rules> <rule> <name>long-running sleeping threads</name> <servers> <allservers /> </servers> <min-threads>1</minthreads> <min-last-threads1</min-last-threads> <lowest-position>1</lowest-position> <users> <user>wikiuser</user> </users> <commands> <command>Sleep</command> </commands> <min-run-time>900</min-run-time> <query-types></query-types> <!-- empty string = any --> </rule> </rules> </querybane>