You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Cumin
Automation and orchestration framework written in Python
Features
For a general description of Cumin's features, see https://github.com/wikimedia/cumin/blob/master/README.md
For a complete list of the latest changes, see https://github.com/wikimedia/cumin/blob/master/CHANGELOG.md
The TL;DR quick summary of Cumin features, relevant to the usage inside WMF are:
- Select target hosts by hostname and/or querying PuppetDB for any applied Puppet Resource or Fact. Only one main Resource per PuppetDB query can be specified and it's not possible to mix resources and facts selection in the same query. To overcome this limit is possible to use the general grammar and combine multiple subqueries to achieve the same result.
- Execute any number of arbitrary commands via SSH on the selected target hosts in parallel in an orchestrated way (see below) grouping the output for the hosts that have the same output.
- Can be used directly as a CLI or as a Python 2 library (Python 3 support is dependent on the ClusterShell library, whose support is WIP).
- In the near future a more higher-level tool will be developed that will use Cumin and other libraries to perform common automation and orchestration tasks inside WMF.
Host selection
Our production configuration uses PuppetDB as default backend, meaning that by default each host selection query is parsed as a PuppetDB query, and only if the parsing fails it will be re-parsed with the general grammar. This allows to use everyday queries without additional syntax, while leaving the full power of composing subqueries when needed.
For WMCS instead the default configuration uses OpenStack as the default backend.
When using the CLI, the --dry-run
option is useful to just check which hosts matches the query without executing any command, if no commands are specified this option is enabled automatically.
PuppetDB host selection
- Match hosts by exact FDQN:
einsteinium.wikimedia.org
with it's FQDNeinsteinium.wikimedia.org,neodymium.eqiad.wmnet
comma-separated list of FQDNs
- Match hosts by FQDN with a simple globbing:
wdqs2*
matches all the hosts with hostname starting withwdqs2
hence all the Wikidata Query Service hosts in codfw.wdqs2*.codfw.wmnet
is a more formal way to specify it.wdqs2* or pc2*
matches the same of the above plus the codfw's Parser Cache hosts, it's basically a sets union.
- Match hosts by hostname using the ClusterShell NodeSet syntax:
db[2016-2019,2023,2028-2029,2033].codfw.wmnet
define a specific list of hosts in a compact format.cp[2001-2026].codfw.wmnet and cp[2021-2026].codfw.wmnet
matches only 6 hosts,cp[2021-2026].codfw.wmnet
, it's basically a sets intersection.
- Puppet Fact selection:
F:memorysize_mb ~ "^[2-3][0-9][0-9][0-9][0-9]"
selects all the hosts that have between 20000MB and 39999MB of RAM as exported by facter.F:lsbdistid = Ubuntu and analytics*
selects all the hosts with hostname that starts withanalytics
that have Ubuntu as OS.
- Puppet Resource selection. Any host reachable by Cumin includes the
profile::cumin::target
Puppet class, to which some variable and tags were added in order to expose to PuppetDB the datacenter, the cluster and all the roles applied to each host. See it's usage in some of these examples:R:File = /etc/ssl/localcerts/api.svc.eqiad.wmnet.chained.crt
selects all the hosts in which Puppet manages this specific file resourceR:Service::Node
selects all the hosts that have theService::Node
resource included, as it works for custom-defined resources tooR:Class = Mediawiki::Nutcracker and *.eqiad.wmnet
selects all the hosts that have the Puppet ClassMediawiki::Nutcracker
applied and the hostname ending in.eqiad.wmnet
, that is a quick hack to select a single datacenter if there are no hosts of the type.wikimedia.org
involved.R:class = profile::cumin::target and R:class%cluster = cache_upload and R:class%site = codfw
allows to overcome the above limitation and selects all the hosts in thecodfw
datacenter that are part of thecache_upload
cluster.R:class = role::cache::misc or R:class = role::cache::maps
selects all the hosts that have either the rolecache::misc
or the rolecache::maps
.R:class = profile::cumin::target and R:class%site = codfw and (R:class@tag = role::cache::maps or R:class@tag = role::cache::misc)
this syntax allows to mix a selection over roles with specific sites and clusters.R:Class ~ "(?i)role::cache::(upload|maps)" and *.ulsfo.wmnet
selects all the cache upload and maps hosts in ulsfo, the(?i)
allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses PostgreSQL as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path.R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = slave
selects all the hosts that have theR:Class = Role::Mariadb::Groups
class with the parametermysql_group
with valuecore
and the parametermysql_role
with valueslave
.
- Special all hosts matching:
*
!!!ATTENTION: use extreme caution with this selector!!!
OpenStack backend
project:deployment-prep
: selects all the hosts in thedeployment-prep
(a.k.a. beta) project.project:deployment-prep name:kafka
: selects all the hosts in thedeployment-prep
project that havekafka
in the name. OpenStack do a regex search.project:deployment-prep name:"^deployment-kafka[0-9]+$"
: selects all the hosts in thedeployment-prep
project that matches the regex.- Additional
key:value
parameters can be added, separated by space, according to the OpenStack list-servers API. - Special all hosts in all projects matching:
*
!!!ATTENTION: use extreme caution with this selector!!! - To mix multiple selections the general grammar can be used:
O{project:project1} or O{project:project2}
General grammar host selection
- Backend query: anything inside
I{}
, whereI
is the backend identifier, is treated as a subquery to be parsed and executed with the chosen backend to gather its results. The available backend identifier are:P{}
: PuppetDB backendO{}
: OpenStack backendD{}
: Direct backend
- Aliases: aliases are defined in
/etc/cumin/aliases.yaml
and the file is provisioned by Puppet. To use an alias in the query just useA:alias_name
, wherealias_name
is the key in thealiases.yaml
file. It will be replaced with its value before parsing the query. The alias replacement is recursive to allow nesting aliases. - Aggregation: the subqueries can be aggregated through the boolean operators
and
,or
,and not
,xor
and with parentheses()
for maximum flexibility. - Example:
P{R:Class = Mediawiki::Nutcracker} and (D{host[10-20]} or A:alias_name)
Command execution
There are various options that allow to control how the command execution will be performed. Keep in mind that Cumin by default assumes that any command executed was successful if it has an exit status code of 0, a failure otherwise.
- Success threshold (default: 100%): consider the current parallel execution a failure only if the percentage of success is below this threshold. Useful when running multiple commands and/or using the execution in batches. Take into account that during the execution of a single command, if no batches were specified, the command will be executed on all the hosts and the success threshold checked only at the end. By default Cumin expects a 100% of success, a single failure will consider the execution failed. The CLI option is
-p 0-100, --success-percentage 0-100
. - Execute in batches (default: no batches, no sleep): by default Cumin schedule the execution in parallel on all the selected hosts. It is possible to specify to execute instead in batches. The batch execution mode of Cumin is with a sliding window of size N with an optional sleep of S seconds between hosts, with this workflow:
- It starts executing on the first batch of N hosts
- As soon as one host finishes the execution, if the success threshold is still met, schedule the execution on the next host in S seconds.
- At most N hosts will be executing the commands in parallel and the success threshold is check at each host completion.
- The CLI options are
-b BATCH_SIZE, --batch-size BATCH_SIZE
and-s BATCH_SLEEP, --batch-sleep BATCH_SLEEP
and their default values are the number of hosts for the size and 0 seconds for the sleep.
- Mode of execution (no default): when executing multiple commands, Cumin requires to specify a mode of execution. In the CLI there are two available modes: sync and async. In the library, in addition to those two modes, one can specify also a custom one. The CLI option is
-m {sync,async}, --mode {sync,async}
.- sync execution:
- execute the first command in parallel on all hosts, also considering the batch and success threshold parameters.
- at the end of the execution, if the success threshold is met, start with the execution of the second command, and so on.
- This allows to ensure that the first command was executed successfully on all hosts before proceeding with the next. Typical usage is when orchestrating changes across a cluster.
- async execution:
- execute all the commands in sequence in each host, independently from one to each other, also considering the batch and success threshold parameters.
- The execution on any given host is interrupted at the first command that fails.
- It is kinda equivalent to an execution with a single command of the form
command1 && command 2 && ... command N
.
- sync execution:
- Ignore exit codes: there are situations in which the exit status of an executed command is not important (like when debugging stuff with grep) and showing it as a failure just make the output harder to read. In those cases the
-x, --ignore-exit-codes
option can be used, that assumes that every command executed was successful. !!!ATTENTION: use caution with this selector!!! - Timeout (default unlimited): an optional timeout to be applied to the execution of each command in each host, by default Cumin doesn't timeout. The CLI option is
-t TIMEOUT, --timeout TIMEOUT
. - Global timeout (default unlimited): an optional global timeout to the whole execution with Cumin, by default Cumin doesn't timeout. The CLI option is
--global-timeout GLOBAL_TIMEOUT
.
Output handling
Cumin's output can be modified using those options. At the moment all those options can be used only when a single command is executed. This limitation will be fixed in a future release.
- Formatted output: it's possible to tell Cumin to print the output of the executed commands in a more parsable way, using the
-o {txt,json}, --output {txt,json}
option. When using this option the separator_____FORMATTED_OUTPUT_____
will be printed after the normal Cumin output and after it the output of the executed commands will be printed in the desired format, for each host, the usual Cumin de-duplication of output does not apply to the formatted output. To just extract the formatted output append2> /dev/null | awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }'
to the Cumin command. This limitation will be fixed in a future release. If you want to keep thestderr
output just skip the/dev/null
redirection. The available formats are:txt
: using this format will prepend the${HOSTNAME}:
to each line of output for that host, keeping the existing newlines.json
: using this format will print a JSON dictionary where the keys are the hostnames and the value is a string with the whole output of the host.
- Interactive mode: if you want to manipulate the results with the power of Python, using the
-i, --interactive
option Cumin will drop into a Python REPL session at the end of the execution, having direct access to Cumin's objects for further processing.
WMF installation
Production infrastructure
In the WMF production infrastructure, Cumin masters are installed via Puppet's Role::Cumin::Master
role, that is currently included in the Role::Cluster::Management
role. Cumin can be executed in any of those hosts and requires sudo privileges or being root. Cumin can access any production host that includes the Profile::Cumin::Target
profile as root (all production hosts as of now), hence is a very powerful but also a potentially very dangerous tool, be very careful while using it. The current Cumin's masters from where it can be executed are:
Cumin master hosts in production |
---|
neodymium.eqiad.wmnet
|
sarin.codfw.wmnet
|
The default Cumin backend is configured to be PuppetDB and the default transport ClusterShell (SSH). The capability of Cumin to query PuppetDB as a backend allow to select hosts in a very powerful and precise way, querying for any Puppet resources or facts.
If running commands on hosts only in one of the DC where there is a Cumin master consider running it from the local Cumin master to slightly speed up the execution.
WMCS Cloud VPS infrastructure
In the WMCS infrastructure, Cumin masters are installed via Puppet's Profile::Openstack::Main::Cumin::Master
profile, that is currently included Role::Labs::Puppetmaster::Frontend
and Role::Labs::Puppetmaster::Frontend
roles. Cumin can be executed in any of those hosts and requires sudo privileges or being root. Cumin can access any Cloud VPS that includes the Profile::Openstack::Main::Cumin::Target
profile as root (all Cloud VPS as of now), hence is a very powerful but also a potentially very dangerous tool, be very careful while using it. The current Cumin's masters from where it can be executed are:
Cumin master hosts in WMCS Cloud VPS |
---|
labpuppetmaster1001.wikimedia.org
|
labpuppetmaster1002.wikimedia.org
|
Cumin CLI examples in the WMF production infrastructure
Run Puppet discarding the output
To run Puppet on a set of hosts without getting the output, just relying on the exit code, one host at the time, sleeping 5 seconds between one host and the next, proceeding to the next host only if the current one succeeded. Do not use puppet agent -t
because that includes the --detailed-exitcodes
option that returns exit codes > 0 also in successful cases:
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent -q'
3 hosts will be targeted:
wdqs[2001-2003].codfw.wmnet
Confirm to continue [y/n]? y
===== NO OUTPUT =====
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [02:24<00:00, 46.03s/hosts]
FAIL | | 0% (0/3) [02:24<?, ?hosts/s]
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent -q'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Run Puppet keeping the output
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent'
Disable Puppet
To disable Puppet in a consistent way, waiting for the completion of any in flight puppet runs:
$ sudo cumin 'wdqs2*' "disable-puppet 'Reason why was disabled - T12345 - ${USER}'"
Enable Puppet
To enable Puppet only on the hosts where was disabled with the same message:
$ sudo cumin 'wdqs2*' "enable-puppet 'Reason why was disabled - T12345 - ${USER}'"
Run Puppet only if last run failed
Might happen that a change merged in Puppet causes Puppet to fail on a number of hosts. Once the issue is fixed, without the need to wait for the next Puppet run, an easy way to quickly fix Puppet on all the failed hosts is to run the following command. It will exit immediately if the last puppet run was successful and run puppet only on the host where it failed and is of course enabled. the -p 95
option is to take into account that some hosts might be down/unreachable without making cumin fail. Remove the -q
if you want to get the output, although might be very verbose based on the number of hosts that failed last run:
sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only'
Check if systemd service is running
$ sudo cumin 'P{R:class = role::mediawiki::appserver::api} and A:codfw' 'systemctl is-active hhvm.service'
55 hosts will be targeted:
mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(55) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
----- OUTPUT of 'systemctl is-active hhvm.service' -----
active
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 148.77hosts/s]
FAIL | | 0% (0/55) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) for command: 'systemctl is-active hhvm.service'.
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Reboot
Given that Cumin uses SSH as a transport, just running reboot will most likely left the connection hanging and it will not properly return to Cumin. To overcome this, in order to issue a proper reboot through Cumin (or SSH in general for that matters), use one of the following commands:
# Issue a reboot detaching standard input, output and error in background and exiting with a 0 exit code.
'nohup reboot &> /dev/null & exit'
# Schedule a reboot in 1 minute from now.
'echo "reboot" | at -M now + 1 minute'
Check TLS certificate
Print a TLS certificate from all the hosts that have that specific Puppet-managed file to ensure that is the same on all hosts and to verify its details. The expected output in case all the hosts have the same certificate is only one block with the certificate content with the number and list of the hosts that have it on top:
$ sudo cumin 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl x509 -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -text -noout'
Check TLS private key
Ensure that the private key of a certificate matches the certificate itself on all the hosts that have a specific certificate, can be done in two ways:
- Using the async mode only one line of output is expected, the matching MD5 for all the hosts for both the certificate and the private key.
- Using the sync mode instead 2 lines of grouped output are expected, one for the first command and one for the second one, leaving the user to match those.
$ sudo cumin -m async 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl pkey -pubout -in /etc/ssl/private/api.svc.codfw.wmnet.key | openssl md5' 'openssl x509 -pubkey -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -noout | openssl md5'
55 hosts will be targeted:
mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(110) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
----- OUTPUT -----
(stdin)= c51627f0b52a4dc70d693acdfdf4384a
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 89.83hosts/s]
FAIL | | 0% (0/55) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Check MySQL semi-sync replication status
Check semi-sync replication status (number of connected clients) on all core mediawiki master databases:
$ sudo cumin 'R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients'\""
Troubleshooting Production issues
PuppetDB is down
When PuppetDB on is not working for some reason (host down, software problems, etc.) cumin will fail to match hosts based on compound expressions. The Direct backend will still work with the --backend direct
option or using the global grammar syntax with D{}
but it might make sense to fallback to the secondary PuppetDB host. That is easily done in /etc/cumin/config.yaml
, in the puppetdb
section, amend the:
host: nitrogen.eqiad.wmnet
to be:
host: nihal.codfw.wmnet
and try to see if cumin is working again.