You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
Revision as of 13:19, 7 April 2017 by imported>Volans (formatting)
Jump to navigation Jump to search

Automation and orchestration framework written in Python


For a general description of Cumin's features, see

The TL;DR quick summary of Cumin features, relevant to the usage inside WMF are:

  • Select target hosts by hostname and/or querying PuppetDB for any applied Puppet Resource or Fact. At the moment only one main Resource per query can be specified and mixed queries for resources and facts are currently not supported, see the DISCLAIMER below.
  • Execute any number of arbitrary commands via SSH on the selected target hosts in parallel in an orchestrated way (see below) grouping the output for the hosts that have the same output.
  • Can be used directly as a CLI or as a Python 2 library.
  • In the near future a more higher-level tool will be developed that will use Cumin and other libraries to perform common automation and orchestration tasks inside WMF.

Host selection

When using the CLI, the --dry-run option is useful to just check which hosts matches the query without executing any command, although a command must be specified in the command line. This requirement will be removed in a future release.

  • Match hosts by exact FDQN:
    • with it's FQDN
    •,neodymium.eqiad.wmnet comma-separated list of FQDNs
  • Match hosts by FQDN with a simple globbing:
    • wdqs2* matches all the hosts with hostname starting with wdqs2 hence all the Wikidata Query Service hosts in codfw. wdqs2*.codfw.wmnet is a more formal way to specify it.
    • wdqs2* or pc2* matches the same of the above plus the codfw's Parser Cache hosts.
  • Match hosts by hostname using the ClusterShell NodeSet syntax:
    • db[2016-2019,2023,2028-2029,2033].codfw.wmnet define a specific list of hosts in a compact format.
  • Puppet Fact selection:
    • F:memorysize_mb ~ "^[2-3][0-9][0-9][0-9][0-9]" selects all the hosts that have between 20000MB and 39999MB of RAM as exported by facter.
    • F:lsbdistid = Ubuntu and analytics* selects all the hosts with hostname that starts with analytics that have Ubuntu as OS
  • Puppet Resource selection. Any host reachable by Cumin includes the profile::cumin::target Puppet class, to which some variable and tags were added in order to expose to PuppetDB the datacenter, the cluster and all the roles applied to each host. See it's usage in some of these examples:
    • R:File = /etc/ssl/localcerts/api.svc.eqiad.wmnet.chained.crt selects all the hosts in which Puppet manages this specific file resource
    • R:Class = Mediawiki::Nutcracker and *.eqiad.wmnet selects all the hosts that have the Puppet Class Mediawiki::Nutcracker applied and the hostname ending in .eqiad.wmnet, that is a quick hack to select a single datacenter if there are no hosts of the type involved.
    • R:class = profile::cumin::target and R:class%cluster = cache_upload and R:class%site = codfw allows to overcome the above limitation and selects all the hosts in the codfw datacenter that are part of the cache_upload cluster.
    • R:class = role::cache::misc or R:class = role::cache::maps selects all the hosts that have either the role cache::misc or the role cache::maps.
    • R:class = profile::cumin::target and R:class%site = codfw and (R:class@tag = role::cache::maps or R:class@tag = role::cache::misc) this syntax allows to mix a selection over roles with specific sites and clusters.
    • R:Class ~ "(?i)role::cache::(upload|maps)" and *.ulsfo.wmnet selects all the cache upload and maps hosts in ulsfo, the (?i) allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses PostgreSQL as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path.
    • R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = slave selects all the hosts that have the R:Class = Role::Mariadb::Groups class with the parameter mysql_group with value core and the parameter mysql_role with value slave.
  • Special all hosts matching: * !!!ATTENTION: use extreme caution with this selector!!!

Command execution

There are various options that allow to control how the command execution will be performed. Keep in mind that Cumin assumes that any command executed was successful if it has an exit status code of 0, a failure otherwise.

  • Success threshold (default: 100%): consider the current parallel execution a failure only if the percentage of success is below this threshold. Useful when running multiple commands and/or using the execution in batches. Take into account that during the execution of a single command, if no batches were specified, the command will be executed on all the hosts and the success threshold checked only at the end. By default Cumin expects a 100% of success, a single failure will consider the execution failed. The CLI option is -p 0-100, --success-percentage 0-100.
  • Execute in batches (default: no batches, no sleep): by default Cumin schedule the execution in parallel on all the selected hosts. It is possible to specify to execute instead in batches. The batch execution mode of Cumin is with a sliding window of size N with an optional sleep of S seconds between hosts, with this workflow:
    • It starts executing on the first batch of N hosts
    • As soon as one host finishes the execution, if the success threshold is still met, schedule the execution on the next host in S seconds.
    • At most N hosts will be executing the commands in parallel and the success threshold is check at each host completion.
    • The CLI options are -b BATCH_SIZE, --batch-size BATCH_SIZE and -s BATCH_SLEEP, --batch-sleep BATCH_SLEEP and their default values are the number of hosts for the size and 0 seconds for the sleep.
  • Mode of execution (no default): when executing multiple commands, Cumin requires to specify a mode of execution. In the CLI there are two available modes: sync and async. In the library, in addition to those two modes, one can specify also a custom one. The CLI option is -m {sync,async}, --mode {sync,async}.
    • sync execution:
      • execute the first command in parallel on all hosts, also considering the batch and success threshold parameters.
      • at the end of the execution, if the success threshold is met, start with the execution of the second command, and so on.
      • This allows to ensure that the first command was executed successfully on all hosts before proceeding with the next. Typical usage is when orchestrating changes across a cluster.
    • async execution:
      • execute all the commands in sequence in each host, independently from one to each other, also considering the batch and success threshold parameters.
      • The execution on any given host is interrupted at the first command that fails.
      • It is kinda equivalent to an execution with a single command of the form command1 && command 2 && ... command N.
  • Timeout (default unlimited): an optional global timeout to the whole execution with Cumin, by default Cumin doesn't timeout. The CLI option is -t TIMEOUT, --timeout TIMEOUT.

WMF installation

Production infrastructure

In the WMF production infrastructure, Cumin masters are installed via Puppet's Role::Cumin::Master role, that is currently included in the Role::Cluster::Management role. Cumin can be executed in any of those hosts and requires sudo privileges or being root. Cumin can access any production host that includes the Profile::Cumin::Target profile as root (all production hosts as of now), hence is a very powerful but also a potentially very dangerous tool, be very careful while using it. The current Cumin's masters from where it can be executed are:

Cumin master hosts

The default Cumin backend is configured to be PuppetDB and the default transport ClusterShell (SSH). The capability of Cumin to query PuppetDB as a backend allow to select hosts in a very powerful and precise way, querying for any Puppet resource or fact with some limitation that will be removed, see the DISCLAIMER above.

If running commands on hosts only in one of the DC where there is a Cumin master consider running it from the local Cumin master to slightly speed up the execution.

Cumin CLI examples in the WMF infrastructure

  • Run Puppet on a set of hosts without getting the output, just relying on the exit code, one host at the time, sleeping 5 seconds between one host and the next, proceeding to the next host only if the current one succeeded. Do not use puppet agent -t because that includes the --detailed-exitcodes option that returns exit codes > 0 also in successful cases:
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent -q'
3 hosts will be targeted:
Confirm to continue [y/n]? y
===== NO OUTPUT =====
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [02:24<00:00, 46.03s/hosts]
FAIL |                                                                                                         |   0% (0/3) [02:24<?, ?hosts/s]
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent -q'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
  • To run Puppet and get the output use instead:
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent'
  • To disable Puppet in a consistent way, waiting for the completion of any in flight puppet runs:
$ sudo cumin 'wdqs2*' "disable-puppet 'Reason why was disabled - T12345 - ${USER}'"
  • To enable Puppet only on the hosts where was disabled with the same message:
$ sudo cumin 'wdqs2*' "enable-puppet 'Reason why was disabled - T12345 - ${USER}'"
  • Verify if a systemd service is running in a cluster:
$ sudo cumin 'R:class = role::mediawiki::appserver::api and *.codfw.wmnet' 'systemctl is-active hhvm.service'
55 hosts will be targeted:
Confirm to continue [y/n]? y
===== NODE GROUP =====
(55) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
----- OUTPUT of 'systemctl is-active hhvm.service' -----
PASS |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 148.77hosts/s]
FAIL |                                                                                                         |   0% (0/55) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) for command: 'systemctl is-active hhvm.service'.
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
  • Print a TLS certificate from all the hosts that have that specific Puppet-managed file to ensure that is the same on all hosts and to verify its details. The expected output in case all the hosts have the same certificate is only one block with the certificate content with the number and list of the hosts that have it on top:
$ sudo cumin 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl x509 -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -text -noout'
  • Ensure that the private key of a certificate matches the certificate itself on all the hosts that have a specific certificate, can be done in two ways:
    • Using the async mode only one line of output is expected, the matching MD5 for all the hosts for both the certificate and the private key.
    • Using the sync mode instead 2 lines of grouped output are expected, one for the first command and one for the second one, leaving the user to match those.
$ sudo cumin -m async 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl pkey -pubout -in /etc/ssl/private/api.svc.codfw.wmnet.key | openssl md5' 'openssl x509 -pubkey -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -noout | openssl md5'
55 hosts will be targeted:
Confirm to continue [y/n]? y
===== NODE GROUP =====
(110) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
----- OUTPUT -----
(stdin)= c51627f0b52a4dc70d693acdfdf4384a
PASS |████████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 89.83hosts/s]
FAIL |                                                                                                         |   0% (0/55) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
  • Check semi-sync replication status (number of connected clients) on all core mediawiki master databases:
$ sudo cumin 'R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients'\""