You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
Revision as of 15:50, 17 February 2020 by imported>CDanis (→‎IRC notification: bring up to date with the past N years)
Jump to navigation Jump to search

Icinga ( ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things. Basically, automated testing of our site that screams and sends up alarms when it fails. It originated as a fork of the earlier project "Nagios", from which WMF transitioned in 2013.

It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There are two levels of alarms (warning, critical) and the notification system is fully customizable (groups of users, notified by email / irc / pager, stop notifying after x alarm...).

Our installation can be found at which is currently an alias to machine icinga1001.

(April 2013) The rest of this page needs to be updated for icinga

Quick summary

  • In order to set downtime / ack alerts you need to login which is done over https
  • Nagios configuration files are automatically generated by /home/w/conf/nagios/conf.php (on any host with NFS home mounted) and synched over to Spence.
  • MRTG was setup and will display Nagios usage data (useful to see if Nagios is actually doing what it is supposed to be doing) - -
  • Ganglia and Wikitech are loosely integrated with most hosts (G and W icons next to the host name respectively) and will display the Ganglia data of that host or its associated wikitech page if SPOF.
  • There is an icinga-wm bot in #wikimedia-operations that will echo whatever Icinga alerts on (see below)
  • Merlin (Module for Endless Redundancy and Load balancing In Nagios) was installed and configured but is not being used at this time.
  • Nagios has shortcuts in the side panel for most of Wikimedia's infrastructure monitoring
  • On Spence, the nagios install is located in /usr/local/nagios.
  • On Spence, the nagios config files are located in /etc/nagios (but you probably don't want to edit anything there)
  • On Spence, the HTTP interface is configured in /usr/local/nagios/share


On the server

Install package from source found at (both core and plugins packages are needed)

After installing, do this:

cp /home/wikipedia/conf/nagios/* /etc/nagios/
service start nagios

and you're away.

On each client


apt-get update
apt-get -y install nagios-nrpe-server nagios-plugins
scp fenari:/home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg
invoke-rc.d nagios-nrpe-server restart


pkgadd -d
pkgadd -d

The right answers are: all, yes, all, yes, yes.

mv /lib/svc/method/nagios-nrpe /lib/svc/method/nagios-nrpe.old
sed 's/nrpe.cfg -dn/nrpe.cfg -d/' /lib/svc/method/nagios-nrpe.old > /lib/svc/method/nagios-nrpe
chmod a+x /lib/svc/method/nagios-nrpe
scp fenari:/home/wikipedia/conf/nagios/nrpe-solaris.cfg /etc/opt/ts/nrpe/nrpe.cfg
scp fenari:/home/wikipedia/conf/nagios/check-zfs /opt/local/bin/
svcadm -v enable nrpe

If you're installing on a server with no internet access, you can use a local path to the pkg file instead.



  * Logo: yeah... had to put it somewhere to show our 'leetness, so it is in /usr/local/nagios/share/images on Spence
  * Theme: I prefer a black theme. This is controlled in the CSS in /usr/local/nagios/share/stylesheets
  * Links to other services: this is controlled by /usr/local/nagios/share/side.php


  * Merlin

Merlin Is an addon for Nagios that provides ease of integration and redundancy across multiple Nagios instances. Usually, we will want to have a Nagios installation in each Datacenter, and each instance should be able to talk to the other, share data, and act as a backup should one fail. This is in essence what Merlin offers. The interresting thing about Merlin is that it stores everything in a mysql DB, from host config, to statuses. This is a lot easier to use to parse data that Nagios' own files, which is why it was installed in the first place. However, at this moment nothing is making use of Merlin, and it is just there 'in case'. Find more information about Merlin at

  * Ganglia / wikitech integration

I wrote a little perl script that parses ganglia data from Spence (/var/lib/ganglia) and tries its best to match up Ganlia hostnames with Nagios hosts definition. In most cases it will work as advertised. The same goes for wikitech. Most servers don't have a wikitech entry associated with them, but some do. Most legacy systems and SPOF should have an entry.

This script is located on Nagios in /etc/nagios/ and runs automagically when sync in called.

Paging / Alert System Details

Our icinga installation sends SMS pages via email to SMS gateways. The file listing each contact is in the 'nagios' directory in private repo modules. Look for 'contacts.cfg' All alerts are emailed to a

  • Most USA based cellular carriers tend to offer an email to sms gateway. Use these if they are offered, as they tend to not charge for use.
    • T-Mobile =, Verizon =
  • International (outside the USA) cellular carriers will need to use the email address.
  • If the icinga master changes, the login/sender details will need to be changed on the AQL Portal with user/pass in pwstore. Failure to change this will result in no alerts going out via AQL.
    • Their system is setup to not just allow anyone to email via their gateway, it has to be setup for the host sending the alerts.


There are two ways to setup monitoring: using the old PHP script, and using Puppet.

PHP script

There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. fenari. The configurator writes to a file called hosts.cfg in the current directory.

cd /home/wikipedia/conf/nagios

Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.

Other configuration should be done by editing the *.cfg files on NFS and then copying to spence. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS. (Note: the Sync command actually replicates every .cfg to Spence)

If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host (Spence):

nagios -v /etc/nagios/nagios.cfg

The error messages can be cryptic at times.


Puppet is being integrated with Nagios as well, in file manifests/nagios.pp. To monitor the availability of a host, simply define the following anywhere under its node definition (i.e. in site.pp or included classes):

monitor_host { $hostname: }

To monitor a service, e.g. SSH, use something like the following:

monitor_service { "ssh": description => "SSH status", check_command => "check_ssh" }

Custom Checks

Custom checks can be found scattered throughout the operations/puppet git repo. Large concentrations of them can be found in these paths:

Many custom checks send HTTP requests to check the health of services. It's important that such checks follow meta::User-Agent policy to identify their traffic, and so that we don't inadvertently block monitoring requests when we need to shut off / ratelimit harmful bot traffic. The User-Agent sent by Icinga checks should be wmf-icinga/<script_name> (


See Alerts (with notifications via Icinga).

To monitor an grafana dashboard's alerts, use something like the following:

monitoring::grafana_alert { 'db/my-dashboard': contact_group   => 'my-team', }


To add a user or update a password:

  1. Log in to Spence
  2. Run htpasswd /usr/local/nagios/etc/htpasswd.users <user>

IRC notification

Icinga appends messages to several different files /var/log/icinga/irc*.log, and ircecho (which runs as a systemd service) maps lines appended there to channels.

Sometimes the bot is wedged and a systemctl restart ircecho will likely fix it.


Hostgroups are configured on operations/puppet repository, on hieradata/common/monitoring.yaml

Acknowledgement logic

From Nagios Wiki (but this was just on Google Cache and the original site seemed gone, so pasted it here)

  • There is a difference between sticky and non-sticky acknowledgements
From Nagios 3.2.3.

Assuming you have a service with notifications enabled for all states with a max retry attempts of 1, these are the notifications you should get based on the following transitions:

#service in OK
#service goes into WARNING - notification sent
#non-sticky acknowledgement applied
#service goes into CRITICAL. Acknowledgement removed. Notification sent
#non-sticky acknowledgement applied
#service goes into WARNING. Acknowledgement removed. Notification sent
#non-sticky acknowledgement applied
#service goes into CRITICAL. Acknowledgement removed. Notification sent
#service goes into OK. Recovery notification sent 

This is the flow if sticky acknowledgements are used:

#service in OK
#service goes into WARNING - notification sent
#sticky acknowledgement applied
#service goes into CRITICAL. No notification sent
#service goes into WARNING. No notification sent
#service goes into CRITICAL. No notification sent
#service goes into OK. Recovery notification sent 

Scheduling downtimes with a shell command

Modern approach centralized:

  • From one of the cluster management hosts (cumin[12]001 as of August 2019) run the sre.hosts.downtime cookbook. See sudo cookbook sre.hosts.downtime -h for all the related info.

Modern approach:

  • Check which is the icinga host (host
  • Ssh to the icinga host, be root or use sudo with the command
  • /usr/local/bin/icinga-downtime -h short-hostname -r "why are you rebooting this host"
This form of the command schedules a downtime of 2 hours
  • Add -d num_seconds if you want to schedule a different length downtime
  • Output from the command will show you that icinga processes downtimes for the host (one log entry) and for all services on that host (second log entry).
  • There is no script for removing downtime, so choose a reasonable length.

Old approach:

Put multiple hosts into a scheduled downtime, from now on for the next 3 days. Example used on Labs Nagios:

nagios command file is a named pipe at /var/lib/nagios/rw/nagios.cmd

for host in huggle-wa-w1 puppet-lucid turnkey-1 pad2 webserver-lcarr asher1 dumpster01 dumps-4 ; do
printf "[%lu] SCHEDULE_HOST_DOWNTIME;$host;$(date +%s);1332479449;1;0;259200;Dzahn;down to save memory on virt3 having RAM issues\n" $(date +%s) \
> /var/lib/nagios/rw/nagios.cmd ; done

After a few seconds you should see something like this in /var/log/icinga/icinga.log (on icinga1001)

[1332220596] HOST DOWNTIME ALERT: dumpster01;STARTED; Host has entered a period of scheduled downtime
Command Format:

quote: If the "fixed" argument is set to one (1), downtime will start and end at the times specified by the "start" and "end" arguments. Otherwise, downtime will begin between the "start" and "end" times and last for "duration" seconds. The "start" and "end" arguments are specified in time_t format (seconds since the UNIX epoch). The specified host downtime can be triggered by another downtime entry if the "trigger_id" is set to the ID of another scheduled downtime entry. Set the "trigger_id" argument to zero (0) if the downtime for the specified host should not be triggered by another downtime entry.

Removing downtimes with a shell command

All downtimes related to a host, including all its services, can be removed as follows. Note that the host variable is the name reported by hostname (not the FQDN returned with --fqdn). For example: cp4021.

echo -n "[$(date +'%s')] DEL_DOWNTIME_BY_HOST_NAME;$host" > /var/lib/icinga/rw/icinga.cmd

Adding a new contact

If you want to add a new contact to Icinga you either need root privileges on production servers or ask somebody who does (members of the Operations team).

To request it from the Operations team, create a Phabricator ticket with the tags "Operations","Monitoring","Icinga" and describe the contact you want to be added.

You should add which notification method you want for this contact (email, IRC, SMS) and if you want 24/7 notifications or specific time periods only. If you want specific timeperiod you can pick (or add) one from ./modules/nagios_common/files/timeperiods.cfg in the public puppet repo.

Here is an example contact and other options you have.

Once you have your new contact you can use it as a _member of a contactgroup_ in any manifest in the public puppet repo. For this part you don't need the access to the private repo anymore, you can just upload changes to Gerrit and find somebody to merge them.

Disabling notifications programmatically

There many scenarios in which we may want a role to run its puppet logic to fully or partially provision its configuration, but not create alerts. Examples of this are:

  • Hosts in the process of being installed, set up and not yet servicing real traffic (but need to run puppet before they are 100% ready)
  • Decommissioned hosts
  • hosts with hardware problems for an extended period of time, and for which manual downtime would be inappropiate.
  • Canary hosts (like cp1008 aka pink unicorn)
  • Spare systems that are running but are not really doing anything (aka spare::system role)

For discussions about this topic, see task T151632

To disable notifications on a host, set profile::base::notifications on hiera to disabled (it defaults to enabled). This is intended as a temporary (even if extended on time) measure- if no check should exist when the server is in full production, just do not add it in the first place or change its LEVEL.

Failover Icinga between the active and passive servers

[As of Jan. 2018] Icinga is currently installed in an active/passive configuration on (eqiad, usually active) and (codfw, usually passive). Use to check which one is the active one at any given time with: dig +short

To failover between the two servers, follow these steps:

  • Prepare a Puppet patch similar to
  • Prepare a DNS patch similar to Confirm the TTL is low (5M) to allow for quick reverting if needed.
  • If a new server name, double check that the email sender address is whitelisted in the AQL portal. That is the Mail2SMS service we use to turn email notifications into pages. Find the credentials in the "aql" file in pwstore.
  • Announce the failover on IRC with some advance, and plan to avoid SWAT, Puppet SWAT and other ongoing maintenances or outages. Also avoid the time of the root's crontab on the passive host to sync Icinga state files (currently at minute 33 of each hour).
  • Check that on the passive host the NSCA process doesn't have a tremendous number of subprocesses, as a precaution stop it, verify that all child processes were killed (or proceed to killall them) and start it again.
  • Log on #wikimedia-operations the start of the failover
  • Disable Puppet on both hosts: sudo cumin 'A:icinga' "disable-puppet 'Failover Icinga - $USER'"
  • Merge, submit and puppet-merge the Puppet patch
  • Enable and run Puppet on the previously active server to make it passive, check for errors: sudo run-puppet-agent -e "Failover Icinga - $USER"
  • On the previously passive server, run the script to sync Icinga state files: sudo sync_icinga_state
  • Merge and submit the DNS patch and deploy the change, see DNS#authdns-update or ask in #wikimedia-traffic
  • On the previously passive server, ensure that the DNS is updated, if not wait for the TTL to expire: dig +short
  • Enable and run Puppet on the previously passive server to make it active, check for errors: sudo run-puppet-agent -e "Failover Icinga - $USER"
  • Check that is running properly. If you see a red message "Notifications are disabled", your browser is most likely still pointing to the old active server. Flush the DNS cache or wait the TTL to expire.
  • Log on #wikimedia-operations the end of the failover

Check validity of the Icinga's config

sudo /usr/sbin/icinga -v /etc/icinga/icinga.cfg

Meta-monitoring of Icinga itself

We're currently externally monitoring Icinga with a custom script. For the details see Wikitech-static#Meta-monitoring.


To avoid alerts from external meta-monitoring, meta-monitoring should be disabled on Wikitech-static before restarting normally with systemctl. Details on how to disable meta-monitoring can be found here: Service_restarts#Icinga

IRC bot

How to add some but not all notifications to a specific IRC channel.

The class used is profile::icinga::ircbot which uses ::ircecho and is included in profile::icinga. Server, nickname and port are configured in Hiera in hieradata/role/common/alerting_host.yaml. The tcpircbot class is unrelated though also included on the Icinga server.

  • Create 2 custom notification commands (modules/nagios_common/templates/notification_commands.cfg.erb), notify-service-by-irc-dcops and notify-host-by-irc-<YOUR CHANNEL>. So one for services and one for hosts. Copy the commandline from existing "by-irc" commands but make sure the output gets appended to a new log file, /irc-<YOUR CHANNEL>.log
  • Create a new Icinga contact (private repo, modules/secret/secrets/nagios/contacts.cfg), irc-<YOUR CHANNEL>. Copy an existing "irc-" contact but adjust host_notification_command and service_notification_command to use your new commands (and logfile).
  • Create a new Icinga contactgroup (public repo, modules/nagios_common/files/contactgroups.cfg) for datacenter ops and add the special contact you created to it (and optional the human members of your group so they get notified too). Check Icinga config is ok after running puppet (icinga -v /etc/icinga/icinga.cfg)
  • In puppet identify the monitoring::service / nrpe::monitor_service classes that should notify to to this channel (or make new ones) and add the new contactgroup as a parameter to them.
  • On the Icinga server go to /var/log/icinga/ and check if the new logfile has been created (you may have to touch it manually the very first time) and alerts get logged to it
  • Configure ircecho (modules/profile/manifests/icinga/ircbot.pp) to map the logfile to your IRC channel ($ircecho_logs, see existing examples) (restart ircecho?)