You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Monitoring is a multi-faceted topic covering testing and collecting metrics to audit the availability and performance of networks, servers, and clustered applications, as well as processing collected data for fault detection, notification, graphing, capacity planning, and analytics.
- Operations: to discover and diagnose problems or attacks/compromises, and for capacity planning
- Developers: for debugging, for discovery of problems with features/systems
- Other departments: to track business metrics
- Users: to check status of outages/components
- Availability monitoring
- Performance monitoring
- Business metric analytics
- Security incident detection
- Security auditing
- Easy to add metrics
- Easy for non-ops to make alerts and graphs
- Detect aberrant behavior without describing static thresholds for each metric: Holt-Winters forecasting or similar
- Clustering strategy for capacity, HA, and multi-datacenter support
- Consistent and consolidated metrics collection and storage: one agent, one storage engine
Monitoring is both art and science. The art of monitoring involves making value judgements. Qualified users should be able to perform these tasks without making changes to Puppet:
- define computed metrics
- define dependencies
- define graphs and dashboards
- define alert conditions
The science of monitoring should be handled by software managed by Puppet, and agnostic to the metrics themselves:
- collect the data
- transport the data
- store the data
- generate events based on the data
- generate alerts based on conditions
- generate notifications based on dependencies
- display the data: provide UI, draw graphs
Contemporary design of a large-scale monitoring infrastructure divides the task among several subcomponents:
- Agent / sensor / collector: A daemon or service normally run on each node to be monitored, periodically collecting metrics from the kernel and application-specific plugins and forwarding them to a broker or directly to a storage engine. These should replace Gmond, NSCA, NRPE.
- Collectd (C) and Diamond (Python) are both popular packages which deliver similar sets of metrics.
- Aggregator / broker: Routes all locally collected metrics to a queue, event processor, or storage engine. May batch or summarize the data.
- Statsd: Summarizes data (sum, avg) to minimize network load, which may or may not be acceptable. Can cache metrics in case of network fail.
- Collectd: Native networking support supports multicast, crypto, proxying. Accepts statsd-format metrics via input plugin. Caching optional with AMQP output.
- RabbitMQ: Collectd, Diamond, and Statsd can all output via AMQP to RabbitMQ, decoupling metric submission from metric processing.
- Kafka: Already in use for Analytics, offers log replay. Perhaps this service should be used for all metrics. Would require writing plugins for Statsd or Collectd.
- Storage engine: RRD files, the popular metric storage of the last decade, are showing their age. Two challengers have appeared:
- Carbon/Whisper: Graphite's time series database service and storage format
- OpenTSDB: An open source implementation similar to Google's Borgmon storage engine, runs atop HBase.
- State engine / event processor: In addition to narrowly focused event processing packages, this functionality together with a poller forms the core of many monolithic monitoring packages. There is the possibility to process events as they received by the storage engine, or by asynchronously examining stored metrics:
- Icinga: Can actively poll services, indirectly poll via check_carbon, or operate asynchronously via NSCA, event broker plugins, etc. Clustering design contains a SPOF called the Central Monitoring Server.
- Icinga + Mod-Gearman: Gearman implements a custom agent and broker with the goal of accelerating active checks. Unclear if polling is delegated to workers. Retains central server SPOF.
- Riemann Purpose-built event processor. Able to listen to events sent over Carbon's plaintext protocol (precisely how does this work?), as well as record events to Graphite. Documentation on scaling is not encouraging.
- Sensu: A custom agent plus "monitoring router", which uses RabbitMQ and Redis to avoid SPOFs. Running more than one server is supported and recommended.
- Shinken Modular rewrite of Nagios in Python which seems to have no SPOFs.
- Notifier: Icinga/Shinken/Sensu all have in-built notification systems based on email gateways. The leading alternative is Pagerduty, a paid service who offer an API and SLA.
- Visualizer: If all monitoring metrics are stored in Carbon, adding dashboards is easy. Finding one which presents a structured way to drill down from cluster to host to metric in a manner similar to Ganglia has been a bit more challenging.
- Grafana: Aims to present data from Carbon similarly to Kibana's presentation of Logstash data. Dashboards may be defined interactively as well as via Puppet file templates, to provide customized views as well as per-cluster views.
Targets may be described in terms of layers, and sources within those layers.
- Linux kernel
- System daemons (ntpd, puppet, sshd..)
- Applications (Apache, Kafka..)
- Cluster service:
- External Storage
A separate article has been created to identify all targets: Monitoring sources
A separate article has been created to list the dozens of relevant tools discussed or in use: Monitoring package survey
- Routers and PDUs are monitored via SNMP by LibreNMS and Torrus
- Routers syslog to LibreNMS
- Network latency is measured by Smokeping
- Router config changes are watched by RANCID
- External reachability monitored by Nimsoft Cloud Monitor
- Physical info stored in Netbox, manually input
- Icinga for availability monitoring via active checks as well as NRPE and NSCA on select hosts
- Syslog goes to logstash in PMTPA, no alerting
- Cluster services:
- Database dashboard: dbtree
- Performance dashboards: Grafana, Graphite, Gdash
- Webrequest analytics infrastructure: Kraken (limited deployment)
- Which apps deliver stats to Carbon? Which apps use jmxtrans? Where is sqstat used? Do they all communicate via Statsd?
- Which apps deliver logs to Logstash?
- Anomaly detection
- Store data in Carbon/Whisper rather than RRD wherever possible
- Send all logs to Logstash
- Consolidate all logs, generate alerts
- Clients may submit metrics without pre-configuration of monitoring server to accept them
- IDS/IPS, send stats to Carbon, generate alerts
Eliminate Ganglia, NRPE (make redundant with Graphite et al.)
- Eliminate Torrus (appears redundant with LibreNMS)
- Use a broker to decouple metric submission from processing and storage
- Agents or broker cache and resubmit data in case of network outage
- Is it worth the effort to dump historical data from RRDs to import into Carbon?
- Virtualization-aware monitoring to eliminate separate Icinga instance for Labs?
- Anomaly detection: Skyline, Oculus
- Performance dashboard: Grafana
- IDS/IPS: Fail2ban
- Logging: all logs delivered to Logstash, use Graphite output to send stats to Carbon
- State engine: Icinga + Mod-Gearman, Sensu, or Shinken
- Stats collecting agent: Collectd, or Statsd + Diamond
- Network monitoring: Dbeacon
- Chase's presentation on all-python monitoring using Diamond, Pystatsd, and Carbon
- The State of Open Source Monitoring - Jason Dixon
- "#monitoringsucks" - A grassroots campaign for better monitoring tools, distilled in these blog posts
- "Counter-rant" response to #monitoringsucks - Dave Josephsen
- Why Monitoring Sucks -- For Now - Cliff Moon
- "Monalytics: Online Monitoring and Analytics for Managing Large Scale Data Centers"