You are browsing a read-only backup copy of Wikitech. The live site can be found at

Wikimedia Cloud Services team/EnhancementProposals/Operational Automation

From Wikitech-static
< Wikimedia Cloud Services team‎ | EnhancementProposals
Revision as of 17:58, 23 December 2020 by imported>David Caro (Created page with "THIS PROPOSAL WRITE-UP IS IN PROGRESS We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review all our operational procedures. Some examples of such procedures are:

  • Adding a new member to a toolforge instance etcd cluster.
  • Bootstrapping a new toolforge instance.
  • Searching for the host where a backup is kept.
  • Provisioning a new cloudvirt node.
  • Re-image all/a set of the cloudvirt nodes.
  • Manage non-automated upgrades.
  • Take down a cloudvirt node for maintenance.

Problem statement

All these tasks still require manual operations, following a runbook whenever available, that easily get outdated, are prone to human error and require considerable attention to execute.


After reviewing several automation tools (spicerack, saltstack, puppet, ...), and doing a quick POC (see, ansible seems to be the most appropriate tool at this time.

Then, this proposes having a repository with all the playbooks, roles, modules and collections (explained a bit more in detail later). The generic modules and roles can be later moved to other repositories for sharing if they are found useful for other groups.

Repository structure

   ├── ansible.cfg
   ├── ansible_collections
   ├── collections.yml
   ├── inventory.ini
   ├── playbooks
   ├── requirements.txt
   └── requirements.yml


This is the file with the generic configuration, for example the path to the inventory.


Default list of hosts to act on, currently only the control node as that's the one from which we execute actions on the cloud.


Folder with the set of top-level playbooks, one for each task, if any of these should be reused by another playbook, better move the tasks to a role and just import the role.


Python modules needed to run the playbooks.


List of external collections needed to run the playbooks.


Tree with all the custom collections, and where ansible-galaxy will install the external dependencies.

Collections have their own plugins (modules and libraries) and roles (sets of tasks and variables). Currently there's no support for playbook re-usage.

More information here:


One of the advantages of ansible is that it actually runs on your laptop, that allows to reuse the credentials for ssh and Openstack from your laptop, not having to install them on any other machine. For the POC it uses a plaintext file (../passwordfile) but we can use an encrypted file, prompt, or some other mechanism.


These scripts are focused on operational procedures, so it would not be a big issue if they are not idempotent, but whenever possible, specially at the module/role level, we should try to make them idempotent, that is, that running the same task twice should be possible, and when it makes sense (ex. add node <new_node> to the cluster), do nothing if there's nothing to do.


Sharing code

If we are able to create generic enough modules/roles/playbooks we can easily move them to their own repositories and share either using ansible-galaxy repository or just sharing the modular repositories.

There's the possibility of creating some playbooks for users of the cloud, and though the current idea is to automate our own operational toil, being able to share the modules/roles opens that possibility too (though might be worth trying to avoid users having to use them).

Unattended automation

Should be relatively easy to setup a host with access to openstack (probably in the infra project, as it needs credentials for openstack/ssh) to run the exact same scripts using a bot account, for example, as first step in automatic disaster recovery.

Playing with the POC

For details on how to test the POC check the file in the patch: