You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Analytics/Systems/Managing systemd timers: Difference between revisions
imported>Elukey |
imported>Milimetric |
||
Line 99: | Line 99: | ||
It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:<syntaxhighlight lang="bash"> | It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:<syntaxhighlight lang="bash"> | ||
sudo systemctl reset-failed name-of-the-timer.service | sudo systemctl reset-failed name-of-the-timer.service | ||
</syntaxhighlight>It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state | </syntaxhighlight>It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state is not important at this stage). | ||
In the case of timers that run multiple commands, like refinery-sqoop-whole-mediawiki, see if you can finish running the commands manually and reset the alert as above. There's nothing special about these timers that can help resume operation or anything like that. | |||
===Check logs for a systemd timer=== | ===Check logs for a systemd timer=== |
Revision as of 14:55, 5 July 2019
In phab:T172532 we decided to move away as much as possible from cron jobs since we were struggling to alarm when a job failed. Most of our recurrent jobs are running on an-coord1001 (analytics cluster coordinator), and the majority of them are systemd timers.
![]() | Timers are different from cron-jobs noticeably in that they automatically restart failed jobs. If you need a timer to be stopped and executed later, first disable the timer (see section below), then kill the currently running job, and finally re-enable the timer at will. |
Anatomy of a Systemd timer
A systemd timer is composed by two parts: a unit and the timer itself. Let's pick an example:
elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.timer
# /lib/systemd/system/hdfs-balancer.timer
[Unit]
Description=Periodic execution of hdfs-balancer.service
[Timer]
Unit=hdfs-balancer.service
# Accuracy sets the maximum time interval around the execution time we want to allow
AccuracySec=15sec
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=0
[Install]
WantedBy=multi-user.target
elukey@an-coord1001:~$ sudo systemctl cat hdfs-balancer.service
# /lib/systemd/system/hdfs-balancer.service
[Unit]
Description=Run the HDFS balancer script to keep HDFS blocks replicated in the most redundant and efficient way.
[Service]
User=hdfs
ExecStart=/usr/local/bin/hdfs-balancer
In this case, we are checking how it is configured the HDFS balancer timer. The .timer file references a systemd unit (the .service file), that it is the one doing the work, and a pattern of execution (in this case, every day at 6 AM UTC). The .service file defines a command, together with other info like environment variables and username that needs to execute it. This command that .service points to can be generated by puppet, so peek into it if you need to see the details, for example: sudo -u hdfs cat /usr/local/bin/refinery-sqoop-mediawiki
.
We alarm on the return code of the .service Systemd unit: if it gets executed and the script mentioned by ExecStart returns a non-zero exit code, Nagios will alert us.
All these configuration files seem a bit confusing, but they are good from the operations point of view. A systemd .service can be started/restarted any time, basically forcing the execution of its script. In this case:
sudo systemctl start hdfs-balancer.service
forces the hdfs-balancer to execute. Where do we find logs? At the moment the majority of them are only on journald, but they will be logged on files as well as soon as possible. If you want to check journald for a particular .service:
# start from the first log registered
sudo journalctl -u hdfs-balancer
# start from a specific moment of time
sudo journalctl -u hdfs-balancer --since '3 days ago'
# show only a specific timeframe
sudo journalctl --since "2018-10-21 23:15:00" --until "2018-10-21 23:20:00"
# tail -f equivalent
sudo journalctl -u hdfs-balancer --follow
An example of debugging for db1107: https://phabricator.wikimedia.org/T207165
Operations
List all timers running on the host
systemctl list-timers
Example of output from stat1006:
elukey@stat1006:~$ systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
Wed 2019-03-27 06:43:00 UTC 2min 54s left Wed 2019-03-27 06:28:00 UTC 12min ago prometheus-node-exporter-apt.timer prometheus-node-exporter-apt.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-cx.timer reportupdater-cx.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-edit-beta-features.timer reportupdater-edit-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-ee-beta-features.timer reportupdater-ee-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-ee.timer reportupdater-ee.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-flow-beta-features.timer reportupdater-flow-beta-features.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-flow.timer reportupdater-flow.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-language.timer reportupdater-language.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-mt_engines.timer reportupdater-mt_engines.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-page-creation.timer reportupdater-page-creation.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-pingback.timer reportupdater-pingback.service
Wed 2019-03-27 07:00:00 UTC 19min left Wed 2019-03-27 06:00:00 UTC 40min ago reportupdater-published_cx2_translations.timer reportupdater-published_cx2_translations.service
Wed 2019-03-27 07:09:00 UTC 28min left Wed 2019-03-27 06:39:00 UTC 1min 6s ago phpsessionclean.timer phpsessionclean.service
Wed 2019-03-27 07:11:26 UTC 31min left Tue 2019-03-26 07:11:26 UTC 23h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Wed 2019-03-27 07:49:31 UTC 1h 9min left Wed 2019-03-27 02:59:18 UTC 3h 40min ago apt-daily.timer apt-daily.service
Thu 2019-03-28 06:37:18 UTC 23h left Wed 2019-03-27 06:20:06 UTC 19min ago apt-daily-upgrade.timer apt-daily-upgrade.service
It is easy to see all the timings for a certain systemd timer and the units it runs. For example, the reportupdater-cx
job has two units:
reportupdater-cx.timer
-> this is basically the equivalent of the cron definition (you can inspect it viasystemctl cat reportupdater-cx.timer
and see its config)reportupdater-cx.service
-> this is the command to run together with its environment etc.., that is called/managed by the above.timer
unit.
Check which user runs a timer
Everything always starts with a systemctl list-timers
(as explained above) to find the .timer unit to check. If we keep the stat1006 example above, let's say we want to check what user runs reportupdater-cx.timer
: we'll need to inspect the related .service unit (since that one is the "real" command executed, the .timer one is only the cron-like definition and we don't care who runs it):
elukey@stat1006:~$ systemctl cat reportupdater-cx
# /lib/systemd/system/reportupdater-cx.service
[Unit]
Description=Report Updater job for cx
[Service]
User=stats
SyslogIdentifier=reportupdater-cx
ExecStart=/usr/bin/python /srv/reportupdater/reportupdater/update_reports.py -l info /srv/reportupdater/jobs/limn-language-data/cx /srv/reportupdater/output/metrics/cx
You need to look for the User=something
definition. In this case, the unit is run by stats
.
Disable a Systemd timer
![]() | Very important: to keep the timer disabled/stopped puppet needs to be disabled for the time being. |
sudo systemctl disable name-of-the-timer.timer
sudo systemctl stop name-of-the-timer.timer
The former prevents the timer to be started after boot, the latter removes it from the regular scheduling. After you have done the above steps, verify with the following command that the timer is not listed anymore:
systemctl list-timers | grep name-of-the-timer
Finally don't forget to check if an icinga alarm was fired, and follow the next section accordingly if needed.
Shut off / Reset Icinga alert already fired
It might happen that an alert has fired for a timer but you just want to shut it off, because you have already been taken care of the problem (for example, executing a manual run, etc..). This is what is needed to do:
sudo systemctl reset-failed name-of-the-timer.service
It is important to reset the service unit, not the timer one, since it is only the former that causes the alarm (the timer unit's state is not important at this stage).
In the case of timers that run multiple commands, like refinery-sqoop-whole-mediawiki, see if you can finish running the commands manually and reset the alert as above. There's nothing special about these timers that can help resume operation or anything like that.
Check logs for a systemd timer
The first step is always to use systemctl list-timers, as explained above, to gather the name of the timer to check. In the above stat1006 example, let's pick reportupdater-cx.timer
. The logs will be available simply doing:
# Note that we are not using the .timer unit, but the .service one!
# The .service suffix is assumed if no one is specified.
sudo journalctl -u reportupdater-cx
Usually for important jobs we log stuff to disk, since journald by default logs to a tmpfs partition (that stays in memory and gets flushed on each reboot). In order to see what is the log configuration, two things can be done:
- Check the puppet definition of the systemd timer, it will list the log directory.
- Check rsyslog.d configuration on the host itself.
Example for the former:
systemd::timer::job { "reportupdater-${title}":
ensure => $ensure,
description => "Report Updater job for ${title}",
command => "/usr/bin/python ${::reportupdater::path}/update_reports.py -l info ${query_path} ${output_path}",
interval => {
'start' => 'OnCalendar',
'interval' => $interval
},
user => $::reportupdater::user,
monitoring_enabled => $monitoring_enabled,
monitoring_contact_groups => 'analytics',
logging_enabled => true,
logfile_basedir => $::reportupdater::log_path, <==================
logfile_name => 'syslog.log',
logfile_owner => $::reportupdater::user,
logfile_group => $::reportupdater::user,
logfile_perms => 'all',
syslog_force_stop => true,
syslog_identifier => "reportupdater-${title}",
}
About the latter: why rsyslog? Wasn't it journald? We have a rsyslog hooked in into journald so we can add specific rules to create log files. For example, let's check the reportupdater-cx
job:
elukey@stat1006:~$ cat /etc/rsyslog.d/
10-exporter.conf 20-reportupdater-edit-beta-features.conf 20-reportupdater-flow-beta-features.conf 20-reportupdater-mt-engines.conf 20-reportupdater-published-cx2-translations.conf
10-puppet-agent.conf 20-reportupdater-ee-beta-features.conf 20-reportupdater-flow.conf 20-reportupdater-page-creation.conf 30-remote-syslog.conf
20-reportupdater-cx.conf 20-reportupdater-ee.conf 20-reportupdater-language.conf 20-reportupdater-pingback.conf README
elukey@stat1006:~$ cat /etc/rsyslog.d/20-reportupdater-cx.conf
# rsyslogd(8) configuration file for services.
# This file is managed by Puppet.
:programname, startswith, "reportupdater-cx" /srv/reportupdater/log/reportupdater-cx/syslog.log
& stop
elukey@stat1006:~$ ls -l /srv/reportupdater/log/reportupdater-cx/syslog.log
-rw-r--r-- 1 stats stats 647027 Mar 27 06:00 /srv/reportupdater/log/reportupdater-cx/syslog.log
In this case, the journald logs are also replicated to /srv/reportupdater/log/reportupdater-cx/syslog.log