You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Cumin: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Filippo Giunchedi
imported>Ryan Kemper
(43 intermediate revisions by 18 users not shown)
Line 1: Line 1:
'''<big>Automation and orchestration framework written in Python</big>'''
'''<big>Automation and orchestration framework written in Python</big>'''


==== Features ====
{{TOCright}}
For a general description of Cumin's features, see https://github.com/wikimedia/cumin/blob/master/README.md


For a complete list of the latest changes, see https://github.com/wikimedia/cumin/blob/master/CHANGELOG.md
==Resources==


*For a general description of Cumin's features, see the [https://doc.wikimedia.org/cumin/master/introduction.html documentation introduction].
*For a complete list of the latest changes, see the [https://doc.wikimedia.org/cumin/master/release.html release notes] page.
*For an introductory video see the talk made at FOSDEM 2018: [https://fosdem.org/2018/schedule/event/cumin_automation/ Cumin: Flexible and Reliable Automation for the Fleet].
==Features==
The '''TL;DR''' quick summary of Cumin features, relevant to the usage inside WMF are:
The '''TL;DR''' quick summary of Cumin features, relevant to the usage inside WMF are:
* '''Select''' target hosts by hostname and/or querying PuppetDB for any applied Puppet Resource or Fact. Only one main Resource per PuppetDB query can be specified and it's not possible to mix resources and facts selection in the same query. To overcome this limit is possible to use the general grammar and combine multiple subqueries to achieve the same result.
* '''Execute''' any number of arbitrary commands via SSH on the selected target hosts in parallel in an orchestrated way (see below) grouping the output for the hosts that have the same output.
* Can be used directly as a '''CLI''' or as a '''Python 2 library''' (Python 3 support is dependent on the ClusterShell library, whose support is WIP).
* In the near future a more higher-level tool will be developed that will use Cumin and other libraries to perform common automation and orchestration tasks inside WMF.


== Host selection ==
*'''Select''' target hosts in the production environment by hostname and/or querying PuppetDB for any applied Puppet Resource or Fact. Only one main Resource per PuppetDB query can be specified and it's not possible to mix resources and facts selection in the same query. To overcome this limit is possible to use the general grammar and combine multiple subqueries to achieve the same result. For the [[Portal:Cloud_VPS|WMCS Cloud VPS]] environment the OpenStack API are used. Please note that when using multiple host selection queries from the CLI the entire set of queries should be enclosed in quotes, as shown in the [[Cumin#Cumin CLI examples in the WMF production infrastructure|examples below]].
*'''Execute''' any number of arbitrary commands via SSH on the selected target hosts in parallel in an orchestrated way (see below) grouping the output for the hosts that have the same output.
*Can be used directly as a '''CLI''' or as a '''Python 3 library.'''
*A more higher-level tool perform common automation and orchestration tasks inside WMF that expose also Cumin as a library is available, see [[Spicerack]].
 
==Host selection==
Our '''production''' configuration uses PuppetDB as default backend, meaning that by default each host selection query is parsed as a PuppetDB query, and only if the parsing fails it will be re-parsed with the general grammar. This allows to use everyday queries without additional syntax, while leaving the full power of composing subqueries when needed.
Our '''production''' configuration uses PuppetDB as default backend, meaning that by default each host selection query is parsed as a PuppetDB query, and only if the parsing fails it will be re-parsed with the general grammar. This allows to use everyday queries without additional syntax, while leaving the full power of composing subqueries when needed.


Line 19: Line 24:
When using the CLI, the <code>--dry-run</code> option is useful to just check which hosts matches the query without executing any command, if no commands are specified this option is enabled automatically.
When using the CLI, the <code>--dry-run</code> option is useful to just check which hosts matches the query without executing any command, if no commands are specified this option is enabled automatically.


=== PuppetDB host selection ===
Pleae note that when using multiple host selection queries from the CLI the entire set of queries should be enclosed in quotes, as shown in [[Cumin#Cumin CLI examples in the WMF production infrastructure|the examples below]].
* Match hosts by exact '''FDQN''':
 
** <code>einsteinium.wikimedia.org</code> with it's FQDN  
===PuppetDB host selection===
** <code>einsteinium.wikimedia.org,neodymium.eqiad.wmnet</code> comma-separated list of FQDNs  
 
* Match hosts '''by FQDN''' with a simple '''globbing''':
*Match hosts by exact '''FDQN''':
** <code>'''wdqs2*'''</code>  matches all the hosts with hostname starting with <code>wdqs2</code> hence all the Wikidata Query Service hosts in codfw. <code>wdqs2*.codfw.wmnet</code> is a more formal way to specify it.  
**<code>einsteinium.wikimedia.org</code> with it's FQDN
** <code>'''wdqs2* or pc2*'''</code>  matches the same of the above plus the codfw's Parser Cache hosts, it's basically a sets union.
**<code>einsteinium.wikimedia.org,neodymium.eqiad.wmnet</code> comma-separated list of FQDNs
* Match hosts '''by hostname''' using the '''[http://clustershell.readthedocs.io/en/latest/api/NodeSet.html#ClusterShell.NodeSet.NodeSet ClusterShell NodeSet] syntax''':  
*Match hosts '''by FQDN''' with a simple '''globbing''':
** <code>'''db[2016-2019,2023,2028-2029,2033].codfw.wmnet'''</code> define a specific list of hosts in a compact format.
**<code>'''wdqs2*'''</code>  matches all the hosts with hostname starting with <code>wdqs2</code> hence all the Wikidata Query Service hosts in codfw. <code>wdqs2*.codfw.wmnet</code> is a more formal way to specify it.
** <code>'''cp[2001-2026].codfw.wmnet and cp[2021-2026].codfw.wmnet'''</code> matches only 6 hosts, <code>cp[2021-2026].codfw.wmnet</code>, it's basically a sets intersection.
**<code>'''wdqs2* or pc2*'''</code>  matches the same of the above plus the codfw's Parser Cache hosts; it's basically a sets union.
* '''Puppet Fact''' selection:
*Match hosts '''by hostname''' using the '''[http://clustershell.readthedocs.io/en/latest/api/NodeSet.html#ClusterShell.NodeSet.NodeSet ClusterShell NodeSet] syntax''':  
** <code>'''F:memorysize_mb ~ "^[2-3][0-9][0-9][0-9][0-9]"'''</code> selects all the hosts that have between 20000MB and 39999MB of RAM as exported by facter.
**<code>'''db[2016-2019,2023,2028-2029,2033].codfw.wmnet'''</code> define a specific list of hosts in a compact format.
** <code>'''F:lsbdistid = Ubuntu and analytics*'''</code> selects all the hosts with hostname that starts with <code>analytics</code> that have Ubuntu as OS.
**<code>'''cp[2001-2026].codfw.wmnet and cp[2021-2026].codfw.wmnet'''</code> matches only 6 hosts, <code>cp[2021-2026].codfw.wmnet</code>, it's basically a sets intersection.  
* '''Puppet Resource''' selection. Any host reachable by Cumin includes the <code>profile::cumin::target</code> Puppet class, to which some variable and tags were added in order to expose to PuppetDB the datacenter, the cluster and all the roles applied to each host. See it's usage in some of these examples:
 
** <code>'''R:File = /etc/ssl/localcerts/api.svc.eqiad.wmnet.chained.crt'''</code> selects all the hosts in which Puppet manages this specific file resource
For example: if we set A=[2001-2026] and B=[2021-2026] what elements do we have in A that we also have in B? 2021,2022,2023,2024,2025 and 2026 are the elements that we have in A and B. This is called the intersection of A and B.  
** <code>'''R:Service::Node'''</code> selects all the hosts that have the <code>Service::Node</code> resource included, as it works for custom-defined resources too
*'''Puppet Fact''' selection:
** <code>'''R:Class = Mediawiki::Nutcracker and *.eqiad.wmnet'''</code> selects all the hosts that have the Puppet Class <code>Mediawiki::Nutcracker</code> applied and the hostname ending in <code>.eqiad.wmnet</code>, that is a quick hack to select a single datacenter if there are no hosts of the type <code>.wikimedia.org</code> involved.
**<code>'''F:memorysize_mb >= 24000 and F:memorysize_mb <= 64000'''</code> selects all the hosts that have between 24000MB and 64000MB of RAM as exported by facter.
** <code>'''R:class = profile::cumin::target and R:class%cluster = cache_upload and R:class%site = codfw'''</code> allows to overcome the above limitation and selects all the hosts in the <code>codfw</code> datacenter that are part of the <code>cache_upload</code> cluster.
**<code>'''F:filesystems ~ xfs'''</code> selects all the hosts that have an XFS filesystem, while using <code>'''~ "^xfs$"'''</code> would have been equivalent of <code>'''= xfs'''</code>.
** <code>'''R:class = role::cache::misc or R:class = role::cache::maps'''</code> selects all the hosts that have either the role <code>cache::misc</code> or the role <code>cache::maps</code>.
**<code>'''F:lsbdistid = Debian and analytics*'''</code> selects all the hosts with hostname that starts with <code>analytics</code> that have Debian as OS.
** <code>'''R:class = profile::cumin::target and R:class%site = codfw and (R:class@tag = role::cache::maps or R:class@tag = role::cache::misc)'''</code> this syntax allows to mix a selection over roles with specific sites and clusters.
**<code>'''F:virtual = physical'''</code> selects all physical hosts, whereas <code>'''* and not F:virtual = physical'''</code> will do the opposite.
** <code>'''R:Class ~ "(?i)role::cache::(upload|maps)" and *.ulsfo.wmnet'''</code> selects all the cache upload and maps hosts in ulsfo, the <code>(?i)</code> allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses PostgreSQL as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path.
*'''Puppet Resource''' selection. Any host reachable by Cumin includes the <code>profile::cumin::target</code> Puppet class, to which some variable and tags were added in order to expose to PuppetDB the datacenter, the cluster and all the roles applied to each host. See it's usage in some of these examples:
** <code>'''R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = slave'''</code> selects all the hosts that have the <code>R:Class = Role::Mariadb::Groups</code> class with the parameter <code>mysql_group</code> with value <code>core</code> and the parameter <code>mysql_role</code> with value <code>slave</code>.
**<code>'''R:File = /etc/ssl/localcerts/api.svc.eqiad.wmnet.chained.crt'''</code> selects all the hosts in which Puppet manages this specific file resource
* Special '''all hosts''' matching: <code>*</code> '''!!!ATTENTION: use extreme caution with this selector!!!'''
**<code>'''R:Service::Node'''</code> selects all the hosts that have the <code>Service::Node</code> resource included, as it works for custom-defined resources too
**<code>'''R:Class = Nginx'''</code> selects all the hosts that have the Puppet Class  <code>Nginx</code> applied.
**<code>'''R:Class = mediawiki::web::prod_sites'''</code> selects all the hosts that have the Puppet Class  <code>mediawiki::web::prod_sites</code> applied.
**<code>'''C:nginx and *.eqiad.wmnet'''</code> uses the Class shortcut and selects all the hosts that have the Puppet Class <code>Nginx</code> applied and the hostname ending in <code>.eqiad.wmnet</code>, that is a quick hack to select a single datacenter if there are no hosts of the type <code>.wikimedia.org</code> involved.
**<code>'''P:cumin::target%cluster = cache_upload and R:class%site = codfw'''</code> allows to overcome the above limitation and selects all the hosts in the <code>codfw</code> datacenter that are part of the <code>cache_upload</code> cluster, using the shortcut for profiles <code>P:</code>.
**<code>'''O:cache::upload or O:cache::text'''</code> selects all the hosts that have either the role <code>cache::upload</code> or the role <code>cache::text</code>, using the shortcut for roles <code>O:</code>.
**<code>'''P:cumin::target%site = codfw and (R:class@tag = role::cache::text or R:class@tag = role::cache::upload)'''</code> this syntax allows to mix a selection over roles with specific sites and clusters.
**<code>'''R:Class ~ "(?i)role::cache::(upload|text)" and *.ulsfo.wmnet'''</code> selects all the cache upload and text hosts in ulsfo, the <code>(?i)</code> allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses [[PostgreSQL]] as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path.
**<code>'''O:Mariadb::Groups%mysql_group = core and R:Class%mysql_role = slave'''</code> selects all the hosts that have the <code>R:Class = Role::Mariadb::Groups</code> class with the parameter <code>mysql_group</code> with value <code>core</code> and the parameter <code>mysql_role</code> with value <code>slave</code>. Currently you cannot filter based on boolean parameters ([[phab:T161545|T161545]])
*To mix in the same query Puppet Resources and Puppet Facts or combine multiple Puppet Resources or multiple Puppet Facts, use the [[Cumin#Global grammar host selection|Global Grammar]] host selection explained below.
*Special '''all hosts''' matching: <code>*</code> '''!!!ATTENTION: use extreme caution with this selector!!!'''
 
===OpenStack backend===
 
*<code>'''project:deployment-prep'''</code>: selects all the hosts in the <code>deployment-prep</code> (a.k.a. beta) project.
*<code>'''project:deployment-prep name:kafka'''</code>: selects all the hosts in the <code>deployment-prep</code> project that have <code>kafka</code> in the name. OpenStack do a regex search on its side, so <code>kafka</code> here will match all <code>akafka</code>, <code>kafkaz</code> and <code>akafkaz</code>. If you want to match exactly <code>kafka</code>, you need to use <code>name:^kafka$'</code>.
*<code>'''project:deployment-prep name:"^deployment-kafka[0-9]+$"'''</code>: selects all the hosts in the <code>deployment-prep</code> project that matches the regex.
*Additional <code>'''key:value'''</code> parameters can be added, separated by space, according to the [https://developer.openstack.org/api-ref/compute/#list-servers OpenStack list-servers API].  Valid options are 'reservation_id', 'name', 'status', 'image', 'flavor', 'ip', 'changes-since', 'all_tenants'.
*Special '''all hosts in all projects''' matching: <code>*</code> '''!!!ATTENTION: use extreme caution with this selector!!!'''
*To mix multiple selections the general grammar can be used: <code>O{project:project1} or O{project:project2}</code>
 
===SSH Known Hosts files backend===
 
*<code>'''cumin1001.eqiad.wmnet'''</code>: Simple selection of a specific FQDN
*<code>'''mw13[15-22].eqiad.wmnet,mw2222.codfw.wmnet'''</code>: ClusterShell syntax for hosts expansion and comma-separated multiple FQDNs
*<code>'''*.wikimedia.org'''</code>: ClusterShell syntax for hosts globbing
*<code>'''mw13*.eqiad.wmnet or (mw22*.codfw.wmnet and not (mw2222* or mw2224*))'''</code>: A complex selection
*<code>'''*.*'''</code>: All FQDN hostnames, to avoid including also the short hostnames present in the known hosts files.
 
=== Direct backend ===
Using the direct backend each hostname is used verbatim, so it must be the FQDN of the host in order to work.
 
* '''<code>cumin1001.eqiad.wmnet</code>''': Simple selection of a specific FQDN
* '''<code>mw13[15-22].eqiad.wmnet,mw2222.codfw.wmnet</code>''': ClusterShell syntax for hosts expansion and comma-separated multiple FQDNs
 
===HostFile backend (Enabled only in Cloud VPS)===
 
*This is a custom backend enabled through Cumin's plugin features only in Cloud VPS (<code>labs-puppetmaster</code> and other VPS cumin masters)
*It replicates the functionality provided by <code>clush</code>'s <code>--hostfile</code> or <code>--machinefile</code> option, and may be used to specify a path to a file containing a list of single hosts, node sets or node groups, separated by newlines.
*The host selection query looks like: <code>F{/home/user/hosts.list}</code> and can be mixed with other backends using the general Cumin grammar.
 
===Global grammar host selection===


=== OpenStack backend ===
*'''Backend query''': anything inside <code>I{}</code>, where <code>I</code> is the backend identifier, is treated as a subquery to be parsed and executed with the chosen backend to gather its results. The available backend identifier are:
* <code>'''project:deployment-prep'''</code>: selects all the hosts in the <code>deployment-prep</code> (a.k.a. beta) project.
**'''<code>P{}</code>''': PuppetDB backend
* <code>'''project:deployment-prep name:kafka'''</code>: selects all the hosts in the <code>deployment-prep</code> project that have <code>kafka</code> in the name. OpenStack do a regex search.
**'''<code>O{}</code>''': OpenStack backend
* <code>'''project:deployment-prep name:"^deployment-kafka[0-9]+$"'''</code>: selects all the hosts in the <code>deployment-prep</code> project that matches the regex.
**<code>'''K{}'''</code>: SSH Known hosts files backend
* Additional <code>'''key:value'''</code> parameters can be added, separated by space, according to the [https://developer.openstack.org/api-ref/compute/#list-servers OpenStack list-servers API].
**'''<code>D{}</code>''': Direct backend
* Special '''all hosts in all projects''' matching: <code>*</code> '''!!!ATTENTION: use extreme caution with this selector!!!'''
**<code>'''F{}'''</code>: HostFile backend (available only in Cloud VPS)
* To mix multiple selections the general grammar can be used: <code>O{project:project1} or O{project:project2}</code>
*'''Aliases:''' aliases are defined in <code>/etc/cumin/aliases.yaml</code> and the file is provisioned by Puppet. To use an alias in the query just use <code>'''A:alias_name'''</code>, where <code>alias_name</code> is the key in the <code>aliases.yaml</code> file. It will be replaced with its value before parsing the query. '''The alias replacement is recursive to allow nesting aliases.'''
*'''Aggregation''': the subqueries can be aggregated through the boolean operators <code>'''and'''</code>, <code>'''or'''</code>, <code>'''and not'''</code>, <code>'''xor'''</code> and with parentheses <code>'''()'''</code>for maximum flexibility.
*'''Examples''':
**<code>'''(P{O:Ganeti} or P{O:Gerrit}) and P{F:is_virtual = true}'''</code>: all hosts with the <code>Ganeti</code> or <code>Gerrit</code> Puppet role and the Puppet fact <code>is_virtual</code> with value <code>true</code>.
**<code>'''P{O:Ganeti} and A:eqiad'''</code>: all the hosts with the <code>Ganeti</code> Puppet role in the <code>eqiad</code> datacenter.
**<code>'''P{C:Mediawiki::Nutcracker} and (P{host[10-20]*} or A:alias_name)'''</code>: all hosts with the <code>Mediawiki::Nutcracker</code> Puppet class and matching either a given alias <code>alias_name</code> or hostnames that start with <code>host[10-20]</code>.
**<code>'''O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad1.wikimedia.cloud,deployment-imagescaler01.deployment-prep.eqiad1.wikimedia.cloud}'''</code>: all hosts in the <code>deployment-prep</code> Openstack project but not the two listed there explicitly by FQDN.


=== General grammar host selection ===
== Host list manipulation ==
* '''Backend query''': anything inside <code>I{}</code>, where <code>I</code> is the backend identifier, is treated as a subquery to be parsed and executed with the chosen backend to gather its results. The available backend identifier are:
On all Cumin master hosts the ClusterShell CLI tool <code>nodeset</code> is installed to allow for an easy manipulation of the host list from the ClusterShell syntax to any arbitrary syntax. See <code>man nodeset</code> for more details. A typical example is to expand the host list: <code>nodeset -e -S '\n' HOST_LIST</code> (where the <code>HOST_LIST</code> is the list of hosts returned by Cumin.
** '''<code>P{}</code>''': PuppetDB backend
** '''<code>O{}</code>''': OpenStack backend
** '''<code>D{}</code>''': Direct backend
* '''Aliases:''' aliases are defined in <code>/etc/cumin/aliases.yaml</code> and the file is provisioned by Puppet. To use an alias in the query just use <code>'''A:alias_name'''</code>, where <code>alias_name</code> is the key in the <code>aliases.yaml</code> file. It will be replaced with its value before parsing the query. '''The alias replacement is recursive to allow nesting aliases.'''
* '''Aggregation''': the subqueries can be aggregated through the boolean operators <code>'''and'''</code>, <code>'''or'''</code>, <code>'''and not'''</code>, <code>'''xor'''</code> and with parentheses <code>'''()'''</code>for maximum flexibility.
* '''Example''': <code>'''P{R:Class = Mediawiki::Nutcracker} and (D{host[10-20]} or A:alias_name)'''</code>


== Command execution ==
==Command execution==
There are various options that allow to control how the command execution will be performed. Keep in mind that Cumin by default assumes that any command executed was successful if it has an exit status code of 0, a failure otherwise.
There are various options that allow to control how the command execution will be performed. Keep in mind that Cumin by default assumes that any command executed was successful if it has an exit status code of 0, a failure otherwise.
* '''Success threshold (default: 100%)''': consider the current parallel execution a failure only if the percentage of success is below this threshold. Useful when running multiple commands and/or using the execution in batches. Take into account that during the execution of a single command, if no batches were specified, the command will be executed on all the hosts and the success threshold checked only at the end. By default Cumin expects a 100% of success, a single failure will consider the execution failed. The CLI option is <code>'''-p 0-100''', --success-percentage 0-100</code>.
* '''Execute in batches (default: no batches, no sleep)''': by default Cumin schedule the execution in parallel on all the selected hosts. It is possible to specify to execute instead in batches. The batch execution mode of Cumin is with a sliding window of size '''N''' with an optional sleep of '''S''' seconds between hosts, with this workflow:
** It starts executing on the first batch of '''N''' hosts
** As soon as one host finishes the execution, if the success threshold is still met, schedule the execution on the next host in '''S''' seconds.
** At most '''N''' hosts will be executing the commands in parallel and the success threshold is check at each host completion.
** The CLI options are <code>'''-b BATCH_SIZE''', --batch-size BATCH_SIZE</code> and <code>'''-s BATCH_SLEEP''', --batch-sleep BATCH_SLEEP</code> and their default values are the number of hosts for the size and 0 seconds for the sleep.


* '''Mode of execution (no default)''': when executing multiple commands, Cumin requires to specify a mode of execution. In the CLI there are two available modes: '''sync''' and '''async'''. In the library, in addition to those two modes, one can specify also a custom one. The CLI option is <code>'''-m {sync,async}''', --mode {sync,async}</code>.
*'''Success threshold (default: 100%)''': consider the current parallel execution a failure only if the percentage of success is below this threshold. Useful when running multiple commands and/or using the execution in batches. Take into account that during the execution of a single command, if no batches were specified, the command will be executed on all the hosts and the success threshold checked only at the end. By default Cumin expects a 100% of success, a single failure will consider the execution failed. The CLI option is <code>'''-p 0-100''', --success-percentage 0-100</code>.
** '''sync execution''':
*'''Execute in batches (default: no batches, no sleep)''': by default Cumin schedule the execution in parallel on all the selected hosts. It is possible to specify to execute instead in batches. The batch execution mode of Cumin is with a sliding window of size '''N''' with an optional sleep of '''S''' seconds between hosts, with this workflow:
*** execute the first command in parallel on all hosts, also considering the batch and success threshold parameters.
**It starts executing on the first batch of '''N''' hosts
*** at the end of the execution, if the success threshold is met, start with the execution of the second command, and so on.
**As soon as one host finishes the execution, if the success threshold is still met, schedule the execution on the next host in '''S''' seconds.
*** This allows to ensure that the first command was executed successfully on all hosts before proceeding with the next. Typical usage is when orchestrating changes across a cluster.
**At most '''N''' hosts will be executing the commands in parallel and the success threshold is check at each host completion.
** '''async execution''':
**The CLI options are <code>'''-b BATCH_SIZE''', --batch-size BATCH_SIZE</code> and <code>'''-s BATCH_SLEEP''', --batch-sleep BATCH_SLEEP</code> and their default values are the number of hosts for the size and 0 seconds for the sleep.
*** execute all the commands in sequence in each host, independently from one to each other, also considering the batch and success threshold parameters.
 
*** The execution on any given host is interrupted at the first command that fails.
*'''Mode of execution (no default)''': when executing multiple commands, Cumin requires to specify a mode of execution. In the CLI there are two available modes: '''sync''' and '''async'''. In the library, in addition to those two modes, one can specify also a custom one. The CLI option is <code>'''-m {sync,async}''', --mode {sync,async}</code>.
*** It is kinda equivalent to an execution with a single command of the form <code>command1 && command 2 && ... command N</code>.
**'''sync execution''':
* '''Ignore exit codes''': there are situations in which the exit status of an executed command is not important (like when debugging stuff with grep) and showing it as a failure just make the output harder to read. In those cases the <code>'''-x''', --ignore-exit-codes</code> option can be used, that assumes that every command executed was successful. '''!!!ATTENTION: use caution with this selector!!!'''
***execute the first command in parallel on all hosts, also considering the batch and success threshold parameters.
* '''Timeout (default unlimited):''' an optional timeout to be applied to the execution of each command in each host, by default Cumin doesn't timeout. The CLI option is <code>'''-t TIMEOUT''', --timeout TIMEOUT</code>.
***at the end of the execution, if the success threshold is met, start with the execution of the second command, and so on.
* '''Global timeout (default unlimited):''' an optional global timeout to the whole execution with Cumin, by default Cumin doesn't timeout. The CLI option is <code>'''--global-timeout GLOBAL_TIMEOUT'''</code>.
***This allows to ensure that the first command was executed successfully on all hosts before proceeding with the next. Typical usage is when orchestrating changes across a cluster.
**'''async execution''':
***execute all the commands in sequence in each host, independently from one to each other, also considering the batch and success threshold parameters.
***The execution on any given host is interrupted at the first command that fails.
***It is kinda equivalent to an execution with a single command of the form <code>command1 && command 2 && ... command N</code>.
*'''Ignore exit codes''': there are situations in which the exit status of an executed command is not important (like when debugging stuff with grep) and showing it as a failure just make the output harder to read. In those cases the <code>'''-x''', --ignore-exit-codes</code> option can be used, that assumes that every command executed was successful. '''!!!ATTENTION: use caution with this selector!!!'''
*'''Timeout (default unlimited):''' an optional timeout to be applied to the execution of each command in each host, by default Cumin doesn't timeout. The CLI option is <code>'''-t TIMEOUT''', --timeout TIMEOUT</code>.
*'''Global timeout (default unlimited):''' an optional global timeout to the whole execution with Cumin, by default Cumin doesn't timeout. The CLI option is <code>'''--global-timeout GLOBAL_TIMEOUT'''</code>.


== Output handling ==
==Output handling==
Cumin's output can be modified using those options. At the moment all those options can be used only when a single command is executed. ''This limitation will be fixed in a future release.''
Cumin's output can be modified using those options. At the moment all those options can be used only when a single command is executed. ''This limitation will be fixed in a future release.''
* '''Formatted output''': it's possible to tell Cumin to print the output of the executed commands in a more parsable way, using the <code>'''-o {txt,json}''', --output {txt,json}</code> option. When using this option the separator <code>_____FORMATTED_OUTPUT_____</code> will be printed after the normal Cumin output and after it the output of the executed commands will be printed in the desired format, for each host, the usual Cumin de-duplication of output does not apply to the formatted output. To just extract the formatted output append <code>2> /dev/null | awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }'</code> to the Cumin command. ''This limitation will be fixed in a future release.'' If you want to keep the <code>stderr</code> output just skip the <code>/dev/null</code> redirection. The available formats are:
 
** '''<code>txt</code>''': using this format will prepend the <code>${HOSTNAME}:</code>  to each line of output for that host, keeping the existing newlines.<syntaxhighlight lang="shell">
*'''Formatted output''': it's possible to tell Cumin to print the output of the executed commands in a more parsable way, using the <code>'''-o {txt,json}''', --output {txt,json}</code> option. When using this option the separator <code>_____FORMATTED_OUTPUT_____</code> will be printed after the normal Cumin output and after it the output of the executed commands will be printed in the desired format, for each host, the usual Cumin de-duplication of output does not apply to the formatted output. To just extract the formatted output you can append <code>| awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }'</code> to the Cumin command in the general case. ''This limitation will be fixed in v5.0.0''. If <code>stderr</code> is not needed you can combine various options to make the extraction even simpler, see the example below. The available formats are:
cumin --force -o txt 'A:cumin' 'date' | awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }' | tee -a example.out
**'''<code>txt</code>''': using this format will prepend the <code>${HOSTNAME}:</code>  to each line of output for that host, keeping the existing newlines.
**<code>'''json'''</code>: using this format will print a JSON dictionary where the keys are the hostnames and the value is a string with the whole output of the host<syntaxhighlight lang="shell">
cumin --force --no-progress --no-color -o txt 'A:cumin' 'date' 2>/dev/null | tail -n "+2" | tee -a example.out
</syntaxhighlight>
</syntaxhighlight>
** <code>'''json'''</code>: using this format will print a JSON dictionary where the keys are the hostnames and the value is a string with the whole output of the host.
 
* '''Interactive mode''': if you want to manipulate the results with the power of Python, using the <code>'''-i''', --interactive</code> option Cumin will drop into a Python REPL session at the end of the execution, having direct access to Cumin's objects for further processing.
* '''Interactive mode''': if you want to manipulate the results with the power of Python, using the <code>'''-i''', --interactive</code> option Cumin will drop into a Python REPL session at the end of the execution, having direct access to Cumin's objects for further processing.


== WMF installation ==
==WMF installation==


=== Production infrastructure ===
===Production infrastructure===
In the WMF production infrastructure, Cumin masters are installed via Puppet's <code>Role::Cumin::Master</code> role, that is currently included in the <code>Role::Cluster::Management</code> role. Cumin can be executed in any of those hosts and requires '''sudo''' privileges or being root. Cumin can access any production host that includes the <code>Profile::Cumin::Target</code> profile as root (all production hosts as of now), hence is a very powerful but also a potentially very dangerous tool, '''be very careful''' while using it. The current Cumin's masters from where it can be executed are:
In the WMF Production infrastructure, Cumin masters are installed via Puppet's <code>Role::Cumin::Master</code> role, that is currently included in the <code>Role::Cluster::Management</code> role. Cumin can be executed in any of those hosts and requires '''sudo''' privileges or being root. Cumin can access any production host that includes the <code>Profile::Cumin::Target</code> profile as root (all production hosts as of now), hence is a very powerful but also a potentially very dangerous tool, '''be very careful''' while using it. The current Cumin's masters from where it can be executed are:
{| class="wikitable"
{| class="wikitable"
!Cumin master hosts in production
!Cumin master hosts in production
|-
|-
|<code>neodymium.eqiad.wmnet</code>
|<code>cumin1001.eqiad.wmnet</code>
|-
|-
|<code>sarin.codfw.wmnet</code>
|<code>cumin2002.codfw.wmnet</code>
|}
|}
The default Cumin backend is configured to be PuppetDB and the default transport ClusterShell (SSH). The capability of Cumin to query PuppetDB as a backend allow to select hosts in a very powerful and precise way, querying for any Puppet resources or facts.
The default Cumin backend is configured to be PuppetDB and the default transport ClusterShell (SSH). The capability of Cumin to query PuppetDB as a backend allow to select hosts in a very powerful and precise way, querying for any Puppet resources or facts.
Line 106: Line 155:
If running commands on hosts only in one of the DC where there is a Cumin master consider running it from the local Cumin master to slightly speed up the execution.
If running commands on hosts only in one of the DC where there is a Cumin master consider running it from the local Cumin master to slightly speed up the execution.


=== WMCS Cloud VPS infrastructure ===
Please note that the Openstack backend is currently not available for cumin masters in production.
In the WMCS infrastructure, Cumin masters are installed via Puppet's <code>Profile::Openstack::Main::Cumin::Master</code> profile, that is currently included <code>Role::Labs::Puppetmaster::Frontend</code> and <code>Role::Labs::Puppetmaster::Frontend</code> roles. Cumin can be executed in any of those hosts and requires '''sudo''' privileges or being root. Cumin can access any Cloud VPS that includes the <code>Profile::Openstack::Main::Cumin::Target</code> profile as root (all Cloud VPS as of now), hence is a very powerful but also a potentially very dangerous tool, '''be very careful''' while using it. The current Cumin's masters from where it can be executed are:
 
 
===Rootless production infrastructure===
In addition to the Cumin masters available for users with global root, there's also a separate installation for rootless operation based on Kerberos, that gets installed via the
<code>cluster::managementunpriv</code> role. Rootless cumin can only access production host which have been enabled for rootless Cumin, see below. The current rootless Cumin masters are:
{| class="wikitable"
!Cumin master hosts in production
|-
|<code>cuminunpriv1001.eqiad.wmnet</code>
|}
 
After logging in,you need to run "kinit" which activates your Kerberos ticket. Following that you can run commands on kerberos-enabled hosts (see below) as your standard user:
 
<pre>
jmm@cuminunpriv1001:~$ cumin A:installserver 'uname -v'
7 hosts will be targeted:
apt[1001,2001].wikimedia.org,install[1003,2003,3001,4001,5001].wikimedia.org
Ok to proceed on 7 hosts? Enter the number of affected hosts to confirm or "q" to quit 7
===== NODE GROUP =====
(7) apt[1001,2001].wikimedia.org,install[1003,2003,3001,4001,5001].wikimedia.org
----- OUTPUT of 'uname -v' -----
#1 SMP Debian 4.19.171-2 (2021-01-30)
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (7/7) [00:03<00:00,  1.32hosts/s]
FAIL |                                                                                                                                                |  0% (0/7) [00:03<?, ?hosts/s]
100.0% (7/7) success ratio (>= 100.0% threshold) for command: 'uname -v'.
100.0% (7/7) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
</pre>
 
==== Making hosts manageable with rootless Cumin ====
Rootless cumin can only access production host which include the <code>profile::base::cuminunpriv</code> profile. In addition a host principal/keytab need to be created, following the docs at
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_keytab_for_a_service (the example given there for sretest1001 creates the necessary keytab/principal).
 
Finally you need to make a Puppet change to the role of the service:
* Include profile::base::cuminunpriv
* Pass the following via Hiera:
 
<pre>
profile::base::ssh_server_settings:
  enable_kerberos: true
</pre>
 
===WMCS Cloud VPS infrastructure===
There are dedicated Cumin masters in the [[Portal:Cloud VPS/Admin/Cloudinfra|restricted cloudinfra project]] for Cloud VPS admin use. Cumin can access any Cloud VPS VMs (except special VMs managed by [[Trove]]). Hence is a very powerful but also a potentially very dangerous tool, '''be very careful''' while using it.
 
The current Cumin's masters from where it can be executed are:
 
{| class="wikitable"
{| class="wikitable"
!Cumin master hosts in WMCS Cloud VPS
!Cumin master hosts in Cloud VPS
|-
|-
|<code>labpuppetmaster1001.wikimedia.org</code>
|<code>cloud-cumin-03.cloudinfra.eqiad1.wikimedia.cloud</code>
|-
|-
|<code>labpuppetmaster1002.wikimedia.org</code>
|<code>cloud-cumin-04.cloudinfra.eqiad1.wikimedia.cloud</code>
|}
|}
== Cumin CLI examples in the WMF production infrastructure ==


==== Run Puppet discarding the output ====
Please note that the PuppetDB backend is currently not available for the cloud-wide cumin masters in Cloud VPS.
 
'''WARNING''': Be careful if you are specifying a batch size (<code>-b</code>) in Cumin. Values above '''3''' will probably cause timeouts while establishing the SSH connections.
 
===WMCS Cloud VPS single project installation===
Independently of the above global installations, Cumin can be also easily installed inside a Cloud VPS project, see the detailed instructions in [[Help:Cumin master]].
 
==Cumin CLI examples in the WMF production infrastructure==
 
====Run Puppet discarding the output====
To run '''Puppet''' on a set of hosts '''without getting the output''', just relying on the exit code, one host at the time, sleeping 5 seconds between one host and the next, proceeding to the next host only if the current one succeeded. '''Do not use''' <code>puppet agent -t</code> because that includes the <code>--detailed-exitcodes</code> option that returns exit codes > 0 also in successful cases:<syntaxhighlight lang="shell-session">
To run '''Puppet''' on a set of hosts '''without getting the output''', just relying on the exit code, one host at the time, sleeping 5 seconds between one host and the next, proceeding to the next host only if the current one succeeded. '''Do not use''' <code>puppet agent -t</code> because that includes the <code>--detailed-exitcodes</code> option that returns exit codes > 0 also in successful cases:<syntaxhighlight lang="shell-session">
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent -q'
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent -q'
Line 130: Line 233:
</syntaxhighlight>
</syntaxhighlight>


==== Run Puppet keeping the output ====
====Run Puppet keeping the output====
<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent'
$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent'
</syntaxhighlight>
</syntaxhighlight>


==== Disable Puppet ====
====Disable Puppet====
To '''disable Puppet''' in a consistent way, waiting for the completion of any in flight puppet runs:<syntaxhighlight lang="shell-session">
To '''disable Puppet''' in a consistent way, waiting for the completion of any in flight puppet runs:<syntaxhighlight lang="shell-session">
$ sudo cumin 'wdqs2*' "disable-puppet 'Reason why was disabled - T12345 - ${USER}'"
$ sudo cumin 'wdqs2*' "disable-puppet 'Reason why was disabled - T12345 - ${USER}'"
</syntaxhighlight>
</syntaxhighlight>


==== Enable Puppet ====
====Enable Puppet====
To '''enable Puppet''' only on the hosts where was disabled with the same message:<syntaxhighlight lang="shell-session">
To '''enable Puppet''' only on the hosts where was disabled with the same message:<syntaxhighlight lang="shell-session">
$ sudo cumin 'wdqs2*' "enable-puppet 'Reason why was disabled - T12345 - ${USER}'"
$ sudo cumin 'wdqs2*' "enable-puppet 'Reason why was disabled - T12345 - ${USER}'"
</syntaxhighlight>
</syntaxhighlight>


==== Run Puppet only if last run failed ====
====Run Puppet only if last run failed====
Might happen that a change merged in Puppet causes Puppet to fail on a number of hosts. Once the issue is fixed, without the need to wait for the next Puppet run, an easy way to quickly fix Puppet on all the failed hosts is to run the following command. It will exit immediately if the last puppet run was successful and run puppet only on the host where it failed and is of course enabled. the <code>-p 95</code> option is to take into account that some hosts might be down/unreachable without making cumin fail. Remove the <code>-q</code> if you want to get the output, although might be very verbose based on the number of hosts that failed last run:<syntaxhighlight lang="shell">
Might happen that a change merged in Puppet causes Puppet to fail on a number of hosts. Once the issue is fixed, without the need to wait for the next Puppet run, an easy way to quickly fix Puppet on all the failed hosts is to run the following command. It will exit immediately if the last puppet run was successful and run puppet only on the host where it failed and is of course enabled. the <code>-p 95</code> option is to take into account that some hosts might be down/unreachable without making cumin fail. Remove the <code>-q</code> if you want to get the output, although might be very verbose based on the number of hosts that failed last run:<syntaxhighlight lang="shell">
sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only'
sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only'
</syntaxhighlight>
</syntaxhighlight>


==== Check if systemd service is running ====
====Check if systemd service is running====
<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
$ sudo cumin 'P{R:class = role::mediawiki::appserver::api} and A:codfw' 'systemctl is-active hhvm.service'
$ sudo cumin 'A:mw-api and A:codfw' 'systemctl is-active php7.2-fpm.service'
55 hosts will be targeted:
64 hosts will be targeted:
mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
mw[2251-2253,2261-2262,2283-2300,2302,2304,2306,2308,2317-2324,2326,2328,2330,2332,2334,2350,2352,2354,2356,2358,2360,2362,2364,2366,2368,2370,2372,2374,2376,2396-2405].codfw.wmnet
Confirm to continue [y/n]? y
Ok to proceed on 64 hosts? Enter the number of affected hosts to confirm or "q" to quit 64
===== NODE GROUP =====
===== NODE GROUP =====
(55) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
(64) mw[2251-2253,2261-2262,2283-2300,2302,2304,2306,2308,2317-2324,2326,2328,2330,2332,2334,2350,2352,2354,2356,2358,2360,2362,2364,2366,2368,2370,2372,2374,2376,2396-2405].codfw.wmnet
----- OUTPUT of 'systemctl is-active hhvm.service' -----
----- OUTPUT of 'systemctl is-act...p7.2-fpm.service' -----
active
active
================
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 148.77hosts/s]
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (64/64) [00:00<00:00, 95.26hosts/s]
FAIL |                                                                                                         |  0% (0/55) [00:00<?, ?hosts/s]
FAIL |                                                                                                                     |  0% (0/64) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) for command: 'systemctl is-active hhvm.service'.
100.0% (64/64) success ratio (>= 100.0% threshold) for command: 'systemctl is-act...p7.2-fpm.service'.
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
100.0% (64/64) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
</syntaxhighlight>
</syntaxhighlight>


==== Reboot ====
====Reboot====
Given that Cumin uses SSH as a transport, just running '''reboot''' will most likely left the connection hanging and it will not properly return to Cumin. To overcome this, in order to issue a proper reboot through Cumin (or SSH in general for that matters), use one of the following commands:<syntaxhighlight lang="shell">
Given that Cumin uses SSH as a transport, just running '''reboot''' will most likely leave the connection hanging and it will not properly return to Cumin. To overcome this, use the '''reboot-host''' command (which nohups the reboot command in a wrapper) which we ship via puppet:<syntaxhighlight lang="shell">
# Issue a reboot detaching standard input, output and error in background and exiting with a 0 exit code.
$ cumin <hosts> 'reboot-host'
'nohup reboot &> /dev/null & exit'
 
# Schedule a reboot in 1 minute from now.
'echo "reboot" | at -M now + 1 minute'
</syntaxhighlight>
</syntaxhighlight>


==== Check TLS certificate ====
====Check TLS certificate====
Print a TLS certificate from all the hosts that have that specific Puppet-managed file to ensure that is the same on all hosts and to verify its details. The expected output in case all the hosts have the same certificate is only one block with the certificate content with the number and list of the hosts that have it on top:<syntaxhighlight lang="shell-session">
Print a TLS certificate from all the hosts that have that specific Puppet-managed file to ensure that is the same on all hosts and to verify its details. The expected output in case all the hosts have the same certificate is only one block with the certificate content with the number and list of the hosts that have it on top:<syntaxhighlight lang="shell-session">
$ sudo cumin 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl x509 -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -text -noout'
$ sudo cumin 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl x509 -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -text -noout'
</syntaxhighlight>
</syntaxhighlight>


==== Check TLS private key ====
====Check TLS private key====
Ensure that the private key of a certificate matches the certificate itself on all the hosts that have a specific certificate, can be done in two ways:
Ensure that the private key of a certificate matches the certificate itself on all the hosts that have a specific certificate, can be done in two ways:
* Using the '''async''' mode only one line of output is expected, the matching MD5 for all the hosts for both the certificate and the private key.
 
* Using the '''sync''' mode instead 2 lines of grouped output are expected, one for the first command and one for the second one, leaving the user to match those.
*Using the '''async''' mode only one line of output is expected, the matching MD5 for all the hosts for both the certificate and the private key.
*Using the '''sync''' mode instead 2 lines of grouped output are expected, one for the first command and one for the second one, leaving the user to match those.
<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
$ sudo cumin -m async 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl pkey -pubout -in /etc/ssl/private/api.svc.codfw.wmnet.key | openssl md5' 'openssl x509 -pubkey -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -noout | openssl md5'
$ sudo cumin -m async 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl pkey -pubout -in /etc/ssl/private/api.svc.codfw.wmnet.key | openssl md5' 'openssl x509 -pubkey -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -noout | openssl md5'
Line 200: Line 300:
</syntaxhighlight>
</syntaxhighlight>


==== Check MySQL semi-sync replication status ====
====Check MySQL semi-sync replication status====
Check semi-sync replication status (number of connected clients) on all core mediawiki master databases:<syntaxhighlight lang="shell-session">
Check semi-sync replication status (number of connected clients) on all core mediawiki master databases:<syntaxhighlight lang="shell-session">
$ sudo cumin 'R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients'\""
$ sudo cumin 'O:Mariadb::Groups%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients'\""
</syntaxhighlight>
</syntaxhighlight>


==== Upgrade Debian packages ====
====Upgrade Debian packages====


From time to time it is necessary to roll out new versions of Debian packages. If the version isn't explicitly stated in puppet with <tt>ensure</tt> then the package won't be normally upgraded.
From time to time it is necessary to roll out new versions of Debian packages. If the version isn't explicitly stated in puppet with <tt>ensure</tt> then the package won't be normally upgraded.
Line 219: Line 319:
</syntaxhighlight>
</syntaxhighlight>


= Troubleshooting Production issues =
====Check EtcdConfig for MediaWiki hosts====
Check that the part of <code>mediawiki-config</code> that comes from etcd, and is exposed as <code>wmf-config</code> in siteinfo is in sync with etcd checking that the <code>lastModifiedIndex</code> is the last one. The <code>lastModifiedIndex</code> is '''expected to be different in the two etcd clusters''' (eqiad vs codfw).<syntaxhighlight lang="shell">
$ sudo cumin 'A:mw or A:mw-api' "curl -sx \$(hostname -f):80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2' | jq -r '.query.general[\"wmf-config\"].wmfEtcdLastModifiedIndex'"
</syntaxhighlight>To check the actual content of the wmf-config section of the configuration, just remove the <code>.wmfEtcdLastModifiedIndex</code> at the end.


== PuppetDB is down ==
====Target all hosts running Stretch====
<syntaxhighlight lang="shell">
$ sudo cumin 'A:stretch'
</syntaxhighlight>


When PuppetDB on is not working for some reason (host down, software problems, etc.) cumin will fail to match hosts based on compound expressions. The Direct backend will still work with the <code>--backend direct</code> option or using the global grammar syntax with <code>D{}</code> but it might make sense to fallback to the secondary PuppetDB host. That is easily done in <code>/etc/cumin/config.yaml</code>, in the <code>puppetdb</code> section, amend the:
=Troubleshooting Production issues=


     host: nitrogen.eqiad.wmnet  
====PuppetDB is down====
 
If PuppetDB is not working for some reason (host down, software problems, etc.) Cumin will fail to match hosts based on compound expressions. The Known Hosts and Direct backends will still work using the global grammar syntax with <code>K{}</code> (see [[Cumin#SSH Known Hosts files backend]]) and <code>D{}</code> respectively. Alternatively the <code>--backend knownhosts</code> and <code>--backend direct</code> option can also be used respectively. It might make sense in any case to fallback to the secondary PuppetDB host. For a quick fix disable puppet and edit <code>/etc/cumin/config.yaml</code>, in the <code>puppetdb</code> section, amend the:
 
     host: puppetdb1001.eqiad.wmnet


to be:  
to be:  


     host: nihal.codfw.wmnet
     host: puppetdb2001.codfw.wmnet
 
and try to see if cumin is working again. For a permanent fix adjust the hiera variable <code>profile::cumin::master::puppetdb_host</code>.
 
=How to contribute to Cumin=
 
====Did you find a bug?====
 
*Check if it was already reported or resolved looking at [[phab:maniphest/query/jyF7TiSa75j3/|Cumin All Issues]] on Wikimedia's Phabricator. Open issues can be listed at [[phab:maniphest/query/d4JlAydvp_Il/|Cumin Open Issues]].
 
*If you're unable to find a related issue, open a new one in Phabricator:
**For '''WMF''' '''Production''' and '''WMCS''' Cloud VPS global installations, use [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=Operations-Software-Development&title=Cumin%3A%20&subscribers=Volans this template].
**For any Cumin usage '''outside WMF''' infrastructure and for WMCS Cloud VPS '''single project installations''', use [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=Operations-Software-Development&title=Cumin%3A%20&subscribers=Volans&description=%7C%20Cumin%20version%20%7C%20%3F%3F%3F%20%7C%0A%7C%20Python%20version%20%7C%20%3F%3F%3F%20%7C%0A%7C%20Operating%20System%20%7C%20%3F%3F%3F%20%7C%0A%0AIssue%3A%0A%3F%3F%3F%0A%0AConfiguration%20and%20aliases%20(REMOVE%20ANY%20PRIVATE%20INFORMATION)%3A%0A%60%60%60%0A%3F%3F%3F%0A%60%60%60%0A%0ASteps%20to%20reproduce%3A%0A%23%20%3F%3F%3F%0A this template].
 
====Contributing to the code====
 
*Cumin's development is done on [https://gerrit.wikimedia.org/r/#/projects/operations/software/cumin,dashboards/default:recent Wikimedia's Gerrit]. A [[mw:Developer access|developer account]] is needed to send a pull request.
*If it's a patch to fix a bug, make sure that a bug was already reported and confirmed, see the above section.
*if it's a patch for a new feature, make sure to open an issue first and that there is consensus to add this new feature to Cumin.
*Check the [https://doc.wikimedia.org/cumin/master/development.html Development page] in Cumin's documentation to have a quick overview of its structure.
*Make sure that all the tests are passing, including the integration tests, if possible.
*Make sure to reference the existing open issue(s) related to the patch adding a line with <code>Bug: T00001</code> as the last line of your commit message (replacing the issue ID with the correct one(s)).
*If you need help, feel free to contact [[mw:User:RCoccioli_(WMF)]].
 
==See Also==


and try to see if cumin is working again.
*[[Spicerack]]
*[[Spicerack/Cookbooks]]


[[Category:Software]]
[[Category:Deployment]]
[[Category:SRE Infrastructure Foundations]]

Revision as of 02:35, 4 June 2022

Automation and orchestration framework written in Python

Resources

Features

The TL;DR quick summary of Cumin features, relevant to the usage inside WMF are:

  • Select target hosts in the production environment by hostname and/or querying PuppetDB for any applied Puppet Resource or Fact. Only one main Resource per PuppetDB query can be specified and it's not possible to mix resources and facts selection in the same query. To overcome this limit is possible to use the general grammar and combine multiple subqueries to achieve the same result. For the WMCS Cloud VPS environment the OpenStack API are used. Please note that when using multiple host selection queries from the CLI the entire set of queries should be enclosed in quotes, as shown in the examples below.
  • Execute any number of arbitrary commands via SSH on the selected target hosts in parallel in an orchestrated way (see below) grouping the output for the hosts that have the same output.
  • Can be used directly as a CLI or as a Python 3 library.
  • A more higher-level tool perform common automation and orchestration tasks inside WMF that expose also Cumin as a library is available, see Spicerack.

Host selection

Our production configuration uses PuppetDB as default backend, meaning that by default each host selection query is parsed as a PuppetDB query, and only if the parsing fails it will be re-parsed with the general grammar. This allows to use everyday queries without additional syntax, while leaving the full power of composing subqueries when needed.

For WMCS instead the default configuration uses OpenStack as the default backend.

When using the CLI, the --dry-run option is useful to just check which hosts matches the query without executing any command, if no commands are specified this option is enabled automatically.

Pleae note that when using multiple host selection queries from the CLI the entire set of queries should be enclosed in quotes, as shown in the examples below.

PuppetDB host selection

  • Match hosts by exact FDQN:
    • einsteinium.wikimedia.org with it's FQDN
    • einsteinium.wikimedia.org,neodymium.eqiad.wmnet comma-separated list of FQDNs
  • Match hosts by FQDN with a simple globbing:
    • wdqs2* matches all the hosts with hostname starting with wdqs2 hence all the Wikidata Query Service hosts in codfw. wdqs2*.codfw.wmnet is a more formal way to specify it.
    • wdqs2* or pc2* matches the same of the above plus the codfw's Parser Cache hosts; it's basically a sets union.
  • Match hosts by hostname using the ClusterShell NodeSet syntax:
    • db[2016-2019,2023,2028-2029,2033].codfw.wmnet define a specific list of hosts in a compact format.
    • cp[2001-2026].codfw.wmnet and cp[2021-2026].codfw.wmnet matches only 6 hosts, cp[2021-2026].codfw.wmnet, it's basically a sets intersection.

For example: if we set A=[2001-2026] and B=[2021-2026] what elements do we have in A that we also have in B? 2021,2022,2023,2024,2025 and 2026 are the elements that we have in A and B. This is called the intersection of A and B.

  • Puppet Fact selection:
    • F:memorysize_mb >= 24000 and F:memorysize_mb <= 64000 selects all the hosts that have between 24000MB and 64000MB of RAM as exported by facter.
    • F:filesystems ~ xfs selects all the hosts that have an XFS filesystem, while using ~ "^xfs$" would have been equivalent of = xfs.
    • F:lsbdistid = Debian and analytics* selects all the hosts with hostname that starts with analytics that have Debian as OS.
    • F:virtual = physical selects all physical hosts, whereas * and not F:virtual = physical will do the opposite.
  • Puppet Resource selection. Any host reachable by Cumin includes the profile::cumin::target Puppet class, to which some variable and tags were added in order to expose to PuppetDB the datacenter, the cluster and all the roles applied to each host. See it's usage in some of these examples:
    • R:File = /etc/ssl/localcerts/api.svc.eqiad.wmnet.chained.crt selects all the hosts in which Puppet manages this specific file resource
    • R:Service::Node selects all the hosts that have the Service::Node resource included, as it works for custom-defined resources too
    • R:Class = Nginx selects all the hosts that have the Puppet Class Nginx applied.
    • R:Class = mediawiki::web::prod_sites selects all the hosts that have the Puppet Class mediawiki::web::prod_sites applied.
    • C:nginx and *.eqiad.wmnet uses the Class shortcut and selects all the hosts that have the Puppet Class Nginx applied and the hostname ending in .eqiad.wmnet, that is a quick hack to select a single datacenter if there are no hosts of the type .wikimedia.org involved.
    • P:cumin::target%cluster = cache_upload and R:class%site = codfw allows to overcome the above limitation and selects all the hosts in the codfw datacenter that are part of the cache_upload cluster, using the shortcut for profiles P:.
    • O:cache::upload or O:cache::text selects all the hosts that have either the role cache::upload or the role cache::text, using the shortcut for roles O:.
    • P:cumin::target%site = codfw and (R:class@tag = role::cache::text or R:class@tag = role::cache::upload) this syntax allows to mix a selection over roles with specific sites and clusters.
    • R:Class ~ "(?i)role::cache::(upload|text)" and *.ulsfo.wmnet selects all the cache upload and text hosts in ulsfo, the (?i) allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses PostgreSQL as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path.
    • O:Mariadb::Groups%mysql_group = core and R:Class%mysql_role = slave selects all the hosts that have the R:Class = Role::Mariadb::Groups class with the parameter mysql_group with value core and the parameter mysql_role with value slave. Currently you cannot filter based on boolean parameters (T161545)
  • To mix in the same query Puppet Resources and Puppet Facts or combine multiple Puppet Resources or multiple Puppet Facts, use the Global Grammar host selection explained below.
  • Special all hosts matching: * !!!ATTENTION: use extreme caution with this selector!!!

OpenStack backend

  • project:deployment-prep: selects all the hosts in the deployment-prep (a.k.a. beta) project.
  • project:deployment-prep name:kafka: selects all the hosts in the deployment-prep project that have kafka in the name. OpenStack do a regex search on its side, so kafka here will match all akafka, kafkaz and akafkaz. If you want to match exactly kafka, you need to use name:^kafka$'.
  • project:deployment-prep name:"^deployment-kafka[0-9]+$": selects all the hosts in the deployment-prep project that matches the regex.
  • Additional key:value parameters can be added, separated by space, according to the OpenStack list-servers API. Valid options are 'reservation_id', 'name', 'status', 'image', 'flavor', 'ip', 'changes-since', 'all_tenants'.
  • Special all hosts in all projects matching: * !!!ATTENTION: use extreme caution with this selector!!!
  • To mix multiple selections the general grammar can be used: O{project:project1} or O{project:project2}

SSH Known Hosts files backend

  • cumin1001.eqiad.wmnet: Simple selection of a specific FQDN
  • mw13[15-22].eqiad.wmnet,mw2222.codfw.wmnet: ClusterShell syntax for hosts expansion and comma-separated multiple FQDNs
  • *.wikimedia.org: ClusterShell syntax for hosts globbing
  • mw13*.eqiad.wmnet or (mw22*.codfw.wmnet and not (mw2222* or mw2224*)): A complex selection
  • *.*: All FQDN hostnames, to avoid including also the short hostnames present in the known hosts files.

Direct backend

Using the direct backend each hostname is used verbatim, so it must be the FQDN of the host in order to work.

  • cumin1001.eqiad.wmnet: Simple selection of a specific FQDN
  • mw13[15-22].eqiad.wmnet,mw2222.codfw.wmnet: ClusterShell syntax for hosts expansion and comma-separated multiple FQDNs

HostFile backend (Enabled only in Cloud VPS)

  • This is a custom backend enabled through Cumin's plugin features only in Cloud VPS (labs-puppetmaster and other VPS cumin masters)
  • It replicates the functionality provided by clush's --hostfile or --machinefile option, and may be used to specify a path to a file containing a list of single hosts, node sets or node groups, separated by newlines.
  • The host selection query looks like: F{/home/user/hosts.list} and can be mixed with other backends using the general Cumin grammar.

Global grammar host selection

  • Backend query: anything inside I{}, where I is the backend identifier, is treated as a subquery to be parsed and executed with the chosen backend to gather its results. The available backend identifier are:
    • P{}: PuppetDB backend
    • O{}: OpenStack backend
    • K{}: SSH Known hosts files backend
    • D{}: Direct backend
    • F{}: HostFile backend (available only in Cloud VPS)
  • Aliases: aliases are defined in /etc/cumin/aliases.yaml and the file is provisioned by Puppet. To use an alias in the query just use A:alias_name, where alias_name is the key in the aliases.yaml file. It will be replaced with its value before parsing the query. The alias replacement is recursive to allow nesting aliases.
  • Aggregation: the subqueries can be aggregated through the boolean operators and, or, and not, xor and with parentheses ()for maximum flexibility.
  • Examples:
    • (P{O:Ganeti} or P{O:Gerrit}) and P{F:is_virtual = true}: all hosts with the Ganeti or Gerrit Puppet role and the Puppet fact is_virtual with value true.
    • P{O:Ganeti} and A:eqiad: all the hosts with the Ganeti Puppet role in the eqiad datacenter.
    • P{C:Mediawiki::Nutcracker} and (P{host[10-20]*} or A:alias_name): all hosts with the Mediawiki::Nutcracker Puppet class and matching either a given alias alias_name or hostnames that start with host[10-20].
    • O{project:deployment-prep} and not D{deployment-logstash02.deployment-prep.eqiad1.wikimedia.cloud,deployment-imagescaler01.deployment-prep.eqiad1.wikimedia.cloud}: all hosts in the deployment-prep Openstack project but not the two listed there explicitly by FQDN.

Host list manipulation

On all Cumin master hosts the ClusterShell CLI tool nodeset is installed to allow for an easy manipulation of the host list from the ClusterShell syntax to any arbitrary syntax. See man nodeset for more details. A typical example is to expand the host list: nodeset -e -S '\n' HOST_LIST (where the HOST_LIST is the list of hosts returned by Cumin.

Command execution

There are various options that allow to control how the command execution will be performed. Keep in mind that Cumin by default assumes that any command executed was successful if it has an exit status code of 0, a failure otherwise.

  • Success threshold (default: 100%): consider the current parallel execution a failure only if the percentage of success is below this threshold. Useful when running multiple commands and/or using the execution in batches. Take into account that during the execution of a single command, if no batches were specified, the command will be executed on all the hosts and the success threshold checked only at the end. By default Cumin expects a 100% of success, a single failure will consider the execution failed. The CLI option is -p 0-100, --success-percentage 0-100.
  • Execute in batches (default: no batches, no sleep): by default Cumin schedule the execution in parallel on all the selected hosts. It is possible to specify to execute instead in batches. The batch execution mode of Cumin is with a sliding window of size N with an optional sleep of S seconds between hosts, with this workflow:
    • It starts executing on the first batch of N hosts
    • As soon as one host finishes the execution, if the success threshold is still met, schedule the execution on the next host in S seconds.
    • At most N hosts will be executing the commands in parallel and the success threshold is check at each host completion.
    • The CLI options are -b BATCH_SIZE, --batch-size BATCH_SIZE and -s BATCH_SLEEP, --batch-sleep BATCH_SLEEP and their default values are the number of hosts for the size and 0 seconds for the sleep.
  • Mode of execution (no default): when executing multiple commands, Cumin requires to specify a mode of execution. In the CLI there are two available modes: sync and async. In the library, in addition to those two modes, one can specify also a custom one. The CLI option is -m {sync,async}, --mode {sync,async}.
    • sync execution:
      • execute the first command in parallel on all hosts, also considering the batch and success threshold parameters.
      • at the end of the execution, if the success threshold is met, start with the execution of the second command, and so on.
      • This allows to ensure that the first command was executed successfully on all hosts before proceeding with the next. Typical usage is when orchestrating changes across a cluster.
    • async execution:
      • execute all the commands in sequence in each host, independently from one to each other, also considering the batch and success threshold parameters.
      • The execution on any given host is interrupted at the first command that fails.
      • It is kinda equivalent to an execution with a single command of the form command1 && command 2 && ... command N.
  • Ignore exit codes: there are situations in which the exit status of an executed command is not important (like when debugging stuff with grep) and showing it as a failure just make the output harder to read. In those cases the -x, --ignore-exit-codes option can be used, that assumes that every command executed was successful. !!!ATTENTION: use caution with this selector!!!
  • Timeout (default unlimited): an optional timeout to be applied to the execution of each command in each host, by default Cumin doesn't timeout. The CLI option is -t TIMEOUT, --timeout TIMEOUT.
  • Global timeout (default unlimited): an optional global timeout to the whole execution with Cumin, by default Cumin doesn't timeout. The CLI option is --global-timeout GLOBAL_TIMEOUT.

Output handling

Cumin's output can be modified using those options. At the moment all those options can be used only when a single command is executed. This limitation will be fixed in a future release.

  • Formatted output: it's possible to tell Cumin to print the output of the executed commands in a more parsable way, using the -o {txt,json}, --output {txt,json} option. When using this option the separator _____FORMATTED_OUTPUT_____ will be printed after the normal Cumin output and after it the output of the executed commands will be printed in the desired format, for each host, the usual Cumin de-duplication of output does not apply to the formatted output. To just extract the formatted output you can append | awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }' to the Cumin command in the general case. This limitation will be fixed in v5.0.0. If stderr is not needed you can combine various options to make the extraction even simpler, see the example below. The available formats are:
    • txt: using this format will prepend the ${HOSTNAME}: to each line of output for that host, keeping the existing newlines.
    • json: using this format will print a JSON dictionary where the keys are the hostnames and the value is a string with the whole output of the host
      cumin --force --no-progress --no-color -o txt 'A:cumin' 'date' 2>/dev/null | tail -n "+2" | tee -a example.out
      
  • Interactive mode: if you want to manipulate the results with the power of Python, using the -i, --interactive option Cumin will drop into a Python REPL session at the end of the execution, having direct access to Cumin's objects for further processing.

WMF installation

Production infrastructure

In the WMF Production infrastructure, Cumin masters are installed via Puppet's Role::Cumin::Master role, that is currently included in the Role::Cluster::Management role. Cumin can be executed in any of those hosts and requires sudo privileges or being root. Cumin can access any production host that includes the Profile::Cumin::Target profile as root (all production hosts as of now), hence is a very powerful but also a potentially very dangerous tool, be very careful while using it. The current Cumin's masters from where it can be executed are:

Cumin master hosts in production
cumin1001.eqiad.wmnet
cumin2002.codfw.wmnet

The default Cumin backend is configured to be PuppetDB and the default transport ClusterShell (SSH). The capability of Cumin to query PuppetDB as a backend allow to select hosts in a very powerful and precise way, querying for any Puppet resources or facts.

If running commands on hosts only in one of the DC where there is a Cumin master consider running it from the local Cumin master to slightly speed up the execution.

Please note that the Openstack backend is currently not available for cumin masters in production.


Rootless production infrastructure

In addition to the Cumin masters available for users with global root, there's also a separate installation for rootless operation based on Kerberos, that gets installed via the cluster::managementunpriv role. Rootless cumin can only access production host which have been enabled for rootless Cumin, see below. The current rootless Cumin masters are:

Cumin master hosts in production
cuminunpriv1001.eqiad.wmnet

After logging in,you need to run "kinit" which activates your Kerberos ticket. Following that you can run commands on kerberos-enabled hosts (see below) as your standard user:

jmm@cuminunpriv1001:~$ cumin A:installserver 'uname -v'
7 hosts will be targeted:
apt[1001,2001].wikimedia.org,install[1003,2003,3001,4001,5001].wikimedia.org
Ok to proceed on 7 hosts? Enter the number of affected hosts to confirm or "q" to quit 7
===== NODE GROUP =====
(7) apt[1001,2001].wikimedia.org,install[1003,2003,3001,4001,5001].wikimedia.org
----- OUTPUT of 'uname -v' -----
#1 SMP Debian 4.19.171-2 (2021-01-30)
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (7/7) [00:03<00:00,  1.32hosts/s]
FAIL |                                                                                                                                                |   0% (0/7) [00:03<?, ?hosts/s]
100.0% (7/7) success ratio (>= 100.0% threshold) for command: 'uname -v'.
100.0% (7/7) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Making hosts manageable with rootless Cumin

Rootless cumin can only access production host which include the profile::base::cuminunpriv profile. In addition a host principal/keytab need to be created, following the docs at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_keytab_for_a_service (the example given there for sretest1001 creates the necessary keytab/principal).

Finally you need to make a Puppet change to the role of the service:

  • Include profile::base::cuminunpriv
  • Pass the following via Hiera:
profile::base::ssh_server_settings:
  enable_kerberos: true

WMCS Cloud VPS infrastructure

There are dedicated Cumin masters in the restricted cloudinfra project for Cloud VPS admin use. Cumin can access any Cloud VPS VMs (except special VMs managed by Trove). Hence is a very powerful but also a potentially very dangerous tool, be very careful while using it.

The current Cumin's masters from where it can be executed are:

Cumin master hosts in Cloud VPS
cloud-cumin-03.cloudinfra.eqiad1.wikimedia.cloud
cloud-cumin-04.cloudinfra.eqiad1.wikimedia.cloud

Please note that the PuppetDB backend is currently not available for the cloud-wide cumin masters in Cloud VPS.

WARNING: Be careful if you are specifying a batch size (-b) in Cumin. Values above 3 will probably cause timeouts while establishing the SSH connections.

WMCS Cloud VPS single project installation

Independently of the above global installations, Cumin can be also easily installed inside a Cloud VPS project, see the detailed instructions in Help:Cumin master.

Cumin CLI examples in the WMF production infrastructure

Run Puppet discarding the output

To run Puppet on a set of hosts without getting the output, just relying on the exit code, one host at the time, sleeping 5 seconds between one host and the next, proceeding to the next host only if the current one succeeded. Do not use puppet agent -t because that includes the --detailed-exitcodes option that returns exit codes > 0 also in successful cases:

$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent -q'
3 hosts will be targeted:
wdqs[2001-2003].codfw.wmnet
Confirm to continue [y/n]? y
===== NO OUTPUT =====
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [02:24<00:00, 46.03s/hosts]
FAIL |                                                                                                         |   0% (0/3) [02:24<?, ?hosts/s]
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent -q'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Run Puppet keeping the output

$ sudo cumin -b 1 -s 5 'wdqs2*' 'run-puppet-agent'

Disable Puppet

To disable Puppet in a consistent way, waiting for the completion of any in flight puppet runs:

$ sudo cumin 'wdqs2*' "disable-puppet 'Reason why was disabled - T12345 - ${USER}'"

Enable Puppet

To enable Puppet only on the hosts where was disabled with the same message:

$ sudo cumin 'wdqs2*' "enable-puppet 'Reason why was disabled - T12345 - ${USER}'"

Run Puppet only if last run failed

Might happen that a change merged in Puppet causes Puppet to fail on a number of hosts. Once the issue is fixed, without the need to wait for the next Puppet run, an easy way to quickly fix Puppet on all the failed hosts is to run the following command. It will exit immediately if the last puppet run was successful and run puppet only on the host where it failed and is of course enabled. the -p 95 option is to take into account that some hosts might be down/unreachable without making cumin fail. Remove the -q if you want to get the output, although might be very verbose based on the number of hosts that failed last run:

sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only'

Check if systemd service is running

$ sudo cumin 'A:mw-api and A:codfw' 'systemctl is-active php7.2-fpm.service'
64 hosts will be targeted:
mw[2251-2253,2261-2262,2283-2300,2302,2304,2306,2308,2317-2324,2326,2328,2330,2332,2334,2350,2352,2354,2356,2358,2360,2362,2364,2366,2368,2370,2372,2374,2376,2396-2405].codfw.wmnet
Ok to proceed on 64 hosts? Enter the number of affected hosts to confirm or "q" to quit 64
===== NODE GROUP =====
(64) mw[2251-2253,2261-2262,2283-2300,2302,2304,2306,2308,2317-2324,2326,2328,2330,2332,2334,2350,2352,2354,2356,2358,2360,2362,2364,2366,2368,2370,2372,2374,2376,2396-2405].codfw.wmnet
----- OUTPUT of 'systemctl is-act...p7.2-fpm.service' -----
active
================
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (64/64) [00:00<00:00, 95.26hosts/s]
FAIL |                                                                                                                      |   0% (0/64) [00:00<?, ?hosts/s]
100.0% (64/64) success ratio (>= 100.0% threshold) for command: 'systemctl is-act...p7.2-fpm.service'.
100.0% (64/64) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Reboot

Given that Cumin uses SSH as a transport, just running reboot will most likely leave the connection hanging and it will not properly return to Cumin. To overcome this, use the reboot-host command (which nohups the reboot command in a wrapper) which we ship via puppet:

$ cumin <hosts> 'reboot-host'

Check TLS certificate

Print a TLS certificate from all the hosts that have that specific Puppet-managed file to ensure that is the same on all hosts and to verify its details. The expected output in case all the hosts have the same certificate is only one block with the certificate content with the number and list of the hosts that have it on top:

$ sudo cumin 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl x509 -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -text -noout'

Check TLS private key

Ensure that the private key of a certificate matches the certificate itself on all the hosts that have a specific certificate, can be done in two ways:

  • Using the async mode only one line of output is expected, the matching MD5 for all the hosts for both the certificate and the private key.
  • Using the sync mode instead 2 lines of grouped output are expected, one for the first command and one for the second one, leaving the user to match those.
$ sudo cumin -m async 'R:File = /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt' 'openssl pkey -pubout -in /etc/ssl/private/api.svc.codfw.wmnet.key | openssl md5' 'openssl x509 -pubkey -in /etc/ssl/localcerts/api.svc.codfw.wmnet.chained.crt -noout | openssl md5'
55 hosts will be targeted:
mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(110) mw[2120-2147,2200-2223,2251-2253].codfw.wmnet
----- OUTPUT -----
(stdin)= c51627f0b52a4dc70d693acdfdf4384a
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████| 100% (55/55) [00:00<00:00, 89.83hosts/s]
FAIL |                                                                                                         |   0% (0/55) [00:00<?, ?hosts/s]
100.0% (55/55) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Check MySQL semi-sync replication status

Check semi-sync replication status (number of connected clients) on all core mediawiki master databases:

$ sudo cumin 'O:Mariadb::Groups%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW GLOBAL STATUS like 'Rpl_semi_sync_master_clients'\""

Upgrade Debian packages

From time to time it is necessary to roll out new versions of Debian packages. If the version isn't explicitly stated in puppet with ensure then the package won't be normally upgraded.

NB Make sure to test the upgrade on a few selected nodes first.

NB2 Even better, use debdeploy in production to run package upgrades.

cumin 'HOSTS' 'DEBIAN_FRONTEND=noninteractive apt-get -q -y --assume-no -o \
  DPkg::Options::="--force-confdef" -o DPkg::Options::="--force-confold" \
  install PACKAGE'

Check EtcdConfig for MediaWiki hosts

Check that the part of mediawiki-config that comes from etcd, and is exposed as wmf-config in siteinfo is in sync with etcd checking that the lastModifiedIndex is the last one. The lastModifiedIndex is expected to be different in the two etcd clusters (eqiad vs codfw).

$ sudo cumin 'A:mw or A:mw-api' "curl -sx \$(hostname -f):80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2' | jq -r '.query.general[\"wmf-config\"].wmfEtcdLastModifiedIndex'"

To check the actual content of the wmf-config section of the configuration, just remove the .wmfEtcdLastModifiedIndex at the end.

Target all hosts running Stretch

$ sudo cumin 'A:stretch'

Troubleshooting Production issues

PuppetDB is down

If PuppetDB is not working for some reason (host down, software problems, etc.) Cumin will fail to match hosts based on compound expressions. The Known Hosts and Direct backends will still work using the global grammar syntax with K{} (see Cumin#SSH Known Hosts files backend) and D{} respectively. Alternatively the --backend knownhosts and --backend direct option can also be used respectively. It might make sense in any case to fallback to the secondary PuppetDB host. For a quick fix disable puppet and edit /etc/cumin/config.yaml, in the puppetdb section, amend the:

   host: puppetdb1001.eqiad.wmnet

to be:

   host: puppetdb2001.codfw.wmnet

and try to see if cumin is working again. For a permanent fix adjust the hiera variable profile::cumin::master::puppetdb_host.

How to contribute to Cumin

Did you find a bug?

  • If you're unable to find a related issue, open a new one in Phabricator:
    • For WMF Production and WMCS Cloud VPS global installations, use this template.
    • For any Cumin usage outside WMF infrastructure and for WMCS Cloud VPS single project installations, use this template.

Contributing to the code

  • Cumin's development is done on Wikimedia's Gerrit. A developer account is needed to send a pull request.
  • If it's a patch to fix a bug, make sure that a bug was already reported and confirmed, see the above section.
  • if it's a patch for a new feature, make sure to open an issue first and that there is consensus to add this new feature to Cumin.
  • Check the Development page in Cumin's documentation to have a quick overview of its structure.
  • Make sure that all the tests are passing, including the integration tests, if possible.
  • Make sure to reference the existing open issue(s) related to the patch adding a line with Bug: T00001 as the last line of your commit message (replacing the issue ID with the correct one(s)).
  • If you need help, feel free to contact mw:User:RCoccioli_(WMF).

See Also