You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/EnhancementProposals/Operational Automation: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
(Created page with "THIS PROPOSAL WRITE-UP IS IN PROGRESS We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review...")
 
imported>David Caro
 
(11 intermediate revisions by 6 users not shown)
Line 1: Line 1:
THIS PROPOSAL WRITE-UP IS IN PROGRESS
We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to '''automate, collect and review all our operational procedures'''. Some examples of such procedures are:
 
We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review all our operational procedures. Some examples of such procedures are:
* Adding a new member to a toolforge instance etcd cluster.
* Adding a new member to a toolforge instance etcd cluster.
* Bootstrapping a new toolforge instance.
* Bootstrapping a new toolforge instance.
Line 9: Line 7:
* Manage non-automated upgrades.
* Manage non-automated upgrades.
* Take down a cloudvirt node for maintenance.
* Take down a cloudvirt node for maintenance.


== Problem statement ==
== Problem statement ==
Line 15: Line 12:


== Proposal ==
== Proposal ==
After reviewing several automation tools (spicerack, saltstack, puppet, ...), and doing a quick POC (see https://gerrit.wikimedia.org/r/c/cloud/wmcs-ansible/+/647735), ansible seems to be the most appropriate tool at this time.
After reviewing several automation tools (spicerack, saltstack, puppet, ...), and doing a quick POC (see [[gerrit:647735]] and [[gerrit:658637]]) for the two more relevant (summary of the experience here https://etherpad.wikimedia.org/p/ansible_vs_spicerack), I've decided to propose spicerack as the de-facto tool for WMCS operational tasks automation.
 
=== Collaboration ===
The main advantage of choosing spicerack is collaboration with the rest of the SRE teams. <s>This comes with both the duty and privilege of becoming co-maintainers for spicerack and related projects, allowing us to have a say in the direction of the project and the use cases that will be supported. With the duty of driving, reviewing and maintaining the projects for all the users (including other SRE teams).</s>
 
=== Structure ===
The Spicerack ecosystem is split in several projects:
 
==== Cumin ====
[[Cumin]] is the lowermost layer, built on top of [https://clustershell.readthedocs.io/en/latest/ ClusterShell] takes care of translating host expressions to hosts, running commands in them (using whatever strategy is selected) and returning the results.
 
This library should be pretty stable and require little to no changes.
 
==== Wmflib ====
[[Python/Wmflib|Wmflib]] is a bundle of generic commonly used functions related to the wikimedia foundation, has some helpful decorators (ex. retry) and similar tools.
 
This library should be used for generic functions that are not bound to the spicerack library and can be reused in other non-spicerack wikimedia related projects.
 
==== Spicerack ====
[[Spicerack]] is the core library, contains more wikimedia specific libraries and a cli (cookbook) to interact with different services (ex. toolforge.etcd) and is meant to be used to store the core logic for any interaction with the services.
 
Here we will have to add, specially at the beginning, some libraries to interact with our services, here will also be where more of the re-usage of code and collaboration will happen. We should keep always in mind things here that can be used by other group around the foundation. Code in this library will be considerably tested, and no merges should happen without review.
 
==== Cookbooks ====
The [[Spicerack/Cookbooks]] repo contains the main recipes to execute, the only logic should be orchestration, and any service management related code should be eventually moved to the above Spicerack library.
 
This repository of cookbooks will be shared with the rest of the SRE group, but our specific cookbooks will go under the <code>cookbooks/wmcs</code> directory.
Any helper library should go under <code>cookbooks/wmcs/__init__.py</code> and we should periodically consider moving as much of the code from there to Spicerack.
 
=== Execution of the cookbooks ===
As of now, this cookbooks can be run locally, I'm actively considering how to provide a host/vm/... with a spicerack + cumin setup for easy usage and running long cookbooks, but as of right now, we can start locally.
 
If your cookbooks are only accessing bare metal machines, you can already run them on the cumin hosts for the wiki operations (ex. cumin1001.eqiad.wmnet), but those hosts have no access to the VMs as of writing this.
 
==== Local setup ====
To run locally the cookbooks, you will need to create a virtualenv and install the dependencies, for example:
 
<syntaxhighlight lang="shell-session">
$ python3 -m venv ~/.venvs/cookbooks
$ source ~/.venvs/cookbooks/bin/activate
</syntaxhighlight>
 
Then clone the cookbooks repo (see https://gerrit.wikimedia.org/r/admin/repos/operations/cookbooks):
 
<syntaxhighlight lang="shell-session">
$ git clone "https://$USER@gerrit.wikimedia.org/r/a/operations/cookbooks"
$ cd cookbooks
$ pip install -e .
</syntaxhighlight>
 
'''NOTE''': as of 2021-08-18 we are currently using the '''wmcs''' branch of the repo
 
'''NOTE''': as of 2022-07-21 you can use the installation script instead of manually configuring these files ([https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/utils/generate_wmcs_config.sh utils/generate_wmcs_config.sh])
 
Then create the relevant config files, one for cumin, wherever you prefer, I recommend <code>~/.config/spicerack/cumin_config.yaml</code>, with the contents:
 
<syntaxhighlight lang=yaml>
---
transport: clustershell
log_file: cumin.log  # feel free to change this to another path of your choosing
default_backend: direct
environment: {}
clustershell:
    ssh_options:
        # needed for vms that repeat a name
        - |
            -o StrictHostKeyChecking=no
            -o "UserKnownHostsFile=/dev/null"
            -o "LogLevel=ERROR"
</syntaxhighlight>
 
And another for spicerack itself (the cookbook cli), I'll use <code>~/.config/spicerack/cookbook_config.yaml</code>
 
<syntaxhighlight lang="yaml">
---
# adapt to wherever you cloned the repo
cookbooks_base_dir: ~/Work/wikimedia/operations/cookbooks
logs_base_dir:  /tmp/spicerack_logs
instance_params:
    # pending a pip release: cumin_config: ~/.config/spicerack/cumin-config.yaml
    # for now you'll have to use the full path, change YOURUSER with your actual user
    cumin_config: /home/YOURUSER/.config/spicerack/cumin_config.yaml
</syntaxhighlight>
 
 
With those config files, now you are able to run the client, from the root of the <code>operations/cookbooks</code> repository, you can list all the cookbooks:


Then, this proposes having a repository with all the playbooks, roles, modules and collections (explained a bit more in detail later). The generic modules and roles can be later moved to other repositories for sharing if they are found useful for other groups.
<syntaxhighlight lang=shell-session>
$ cookbook -c ~/.config/spicerack/cookbook_config.yaml --list
</syntaxhighlight>


=== Repository structure ===
=== cookbook naming proposal ===


NOTE: see the latest conventions in the [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/README README file]


    ├── ansible.cfg
Some naming rules:
    ├── ansible_collections
    ├── collections.yml
    ├── inventory.ini
    ├── playbooks
    ├── README.md
    ├── requirements.txt
    └── requirements.yml


==== ansible.cfg ====
# All the common used cookbooks for a service (tooforge, vps, ceph, ...) should be at the top level of that service, even if they are just a proxy for a lower level one.
This is the file with the generic configuration, for example the path to the inventory.
# Individual cookbooks that provide just a functionality or a subset of a functionality are considered ''library'' and should be grouped in a '''.lib.''' final submodule.
# Group cookbook packages by service, then technology, then subcomponent, if they are common through many services, then just technology.
# Use meaningful, specific, non-generic names, a verb and if needed a noun (err on the side of extra explicit).
# Reuse the keywords explicitly defined, see list below.


==== inventory.ini ====
Some known services and packages are:
Default list of hosts to act on, currently only the control node as that's the one from which we execute actions on the cloud.


==== playbooks ====
* <code>openstack</code>: lower layer of the infrastructure for the Cloud VPS service.
Folder with the set of top-level playbooks, one for each task, if any of these should be reused by another playbook, better move the tasks to a role and just import the role.
* <code>vps</code>: operations with Openstack APIs.
* <code>nfs</code>: NFS related stuff.
* <code>toolforge</code>: everything Toolforge.
* <code>toolforge.grid</code>: everything Toolforge grid infrastructure.
* <code>toolforge.k8s</code>: everything Toolforge kubernetes infrastructure.
* <code>toolforge.k8s.etcd</code>: everything Toolforge kubernetes etcd infrastructure.


==== requirements.txt ====
Some well known keywords:
Python modules needed to run the playbooks.


==== collections.yml ====
* <code>ensure</code>: makes sure a condition is met, and acts to fulfill it if not.
List of external collections needed to run the playbooks.
* <code>create</code>: every time a cookbook with this keyword runs, a new resource is created.
* <code>remove</code>: every time a cookbook with this keyword runs, a resource is deleted.
* <code>scale</code>: every time a cookbook with this keyword runs, a given service is scaled up (ex. a node is created and pooled).
* <code>downscale</code>: every time a cookbook with this keyword runs, a given service is down-scaled (ex. a node is drained and removed).
* <code>join</code>: used when a resource is configured to be part of a service, cluster or similar. May or may not be pooled.
* <code>pool</code>: start scheduling load in the resource.
* <code>depool</code>: stop scheduling load in the resource, running load might still be running.
* <code>drain</code>: remove any workload running in the resource, and prevent new one from getting scheduled.


==== ansible_collections ====
A good example:
Tree with all the custom collections, and where ansible-galaxy will install the external dependencies.


Collections have their own plugins (modules and libraries) and roles (sets of tasks and variables). Currently there's no support for playbook re-usage.
<pre>
wmcs.toolforge.scale_grid_exec
wmcs.toolforge.scale_grid_webgen
wmcs.toolforge.scale_grid_weblight


More information here:
wmcs.toolforge.grid.lib.get_cluster_status
https://docs.ansible.com/ansible/latest/dev_guide/developing_collections.html#developing-collections
wmcs.toolforge.grid.lib.reconfigure


=== Secrets ===
wmcs.toolforge.grid.node.lib.create_join_pool
One of the advantages of ansible is that it actually runs on your laptop, that allows to reuse the credentials for ssh and Openstack from your laptop, not having to install them on any other machine.
wmcs.toolforge.grid.node.lib.join 
For the POC it uses a plaintext file (../passwordfile) but we can use an encrypted file, prompt, or some other mechanism.
wmcs.toolforge.grid.node.lib.depool
wmcs.toolforge.grid.node.lib.pool
wmcs.toolforge.grid.node.lib.depool_remove
</pre>


=== Idempotency ===
A bad example:
These scripts are focused on operational procedures, so it would not be a big issue if they are not idempotent, but whenever possible, specially at the module/role level, we should try to make them idempotent, that is, that running the same task twice should be possible, and when it makes sense (ex. add node <new_node> to the cluster), do nothing if there's nothing to do.


=== Future ===
<pre>
==== Sharing code ====
wmcs.toolforge.scale                            <-- WRONG: scale what?
If we are able to create generic enough modules/roles/playbooks we can easily move them to their own repositories and share either using ansible-galaxy repository or just sharing the modular repositories.
wmcs.toolforge.reboot                            <-- WRONG: reboot what?
wmcs.toolforge.reboot_node                      <-- WRONG: this should probably be wmcs.toolforge.xxxx.node.lib.reboot instead


There's the possibility of creating some playbooks for users of the cloud, and though the current idea is to automate our own operational toil, being able to share the modules/roles opens that possibility too (though might be worth trying to avoid users having to use them).
wmcs.toolforge.grid.lib.add                      <-- WRONG: add what?
wmcs.toolforge.grid.lib.configuration            <-- WRONG: configure what?


==== Unattended automation ====
wmcs.toolforge.grid.node.lib.create_exec_node    <-- WRONG: this should probably be an entry-level cookbook (i.e. wmcs.toolforge.create_exec_node)
Should be relatively easy to setup a host with access to openstack (probably in the infra project, as it needs credentials for openstack/ssh) to run the exact same scripts using a bot account, for example, as first step in automatic disaster recovery.
</pre>


=== Playing with the POC ===
= Note =
For details on how to test the POC check the README.md file in the patch: https://gerrit.wikimedia.org/r/c/cloud/wmcs-ansible/+/647735
NOTE: This is a very premature proposal, this workflow will be improved, feel free to start any questions/discussions in the talks page or ping me directly ([[User:David Caro|David Caro]] ([[User talk:David Caro|talk]]) 17:21, 5 February 2021 (UTC)).

Latest revision as of 14:29, 21 July 2022

We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review all our operational procedures. Some examples of such procedures are:

  • Adding a new member to a toolforge instance etcd cluster.
  • Bootstrapping a new toolforge instance.
  • Searching for the host where a backup is kept.
  • Provisioning a new cloudvirt node.
  • Re-image all/a set of the cloudvirt nodes.
  • Manage non-automated upgrades.
  • Take down a cloudvirt node for maintenance.

Problem statement

All these tasks still require manual operations, following a runbook whenever available, that easily get outdated, are prone to human error and require considerable attention to execute.

Proposal

After reviewing several automation tools (spicerack, saltstack, puppet, ...), and doing a quick POC (see gerrit:647735 and gerrit:658637) for the two more relevant (summary of the experience here https://etherpad.wikimedia.org/p/ansible_vs_spicerack), I've decided to propose spicerack as the de-facto tool for WMCS operational tasks automation.

Collaboration

The main advantage of choosing spicerack is collaboration with the rest of the SRE teams. This comes with both the duty and privilege of becoming co-maintainers for spicerack and related projects, allowing us to have a say in the direction of the project and the use cases that will be supported. With the duty of driving, reviewing and maintaining the projects for all the users (including other SRE teams).

Structure

The Spicerack ecosystem is split in several projects:

Cumin

Cumin is the lowermost layer, built on top of ClusterShell takes care of translating host expressions to hosts, running commands in them (using whatever strategy is selected) and returning the results.

This library should be pretty stable and require little to no changes.

Wmflib

Wmflib is a bundle of generic commonly used functions related to the wikimedia foundation, has some helpful decorators (ex. retry) and similar tools.

This library should be used for generic functions that are not bound to the spicerack library and can be reused in other non-spicerack wikimedia related projects.

Spicerack

Spicerack is the core library, contains more wikimedia specific libraries and a cli (cookbook) to interact with different services (ex. toolforge.etcd) and is meant to be used to store the core logic for any interaction with the services.

Here we will have to add, specially at the beginning, some libraries to interact with our services, here will also be where more of the re-usage of code and collaboration will happen. We should keep always in mind things here that can be used by other group around the foundation. Code in this library will be considerably tested, and no merges should happen without review.

Cookbooks

The Spicerack/Cookbooks repo contains the main recipes to execute, the only logic should be orchestration, and any service management related code should be eventually moved to the above Spicerack library.

This repository of cookbooks will be shared with the rest of the SRE group, but our specific cookbooks will go under the cookbooks/wmcs directory. Any helper library should go under cookbooks/wmcs/__init__.py and we should periodically consider moving as much of the code from there to Spicerack.

Execution of the cookbooks

As of now, this cookbooks can be run locally, I'm actively considering how to provide a host/vm/... with a spicerack + cumin setup for easy usage and running long cookbooks, but as of right now, we can start locally.

If your cookbooks are only accessing bare metal machines, you can already run them on the cumin hosts for the wiki operations (ex. cumin1001.eqiad.wmnet), but those hosts have no access to the VMs as of writing this.

Local setup

To run locally the cookbooks, you will need to create a virtualenv and install the dependencies, for example:

$ python3 -m venv ~/.venvs/cookbooks
$ source ~/.venvs/cookbooks/bin/activate

Then clone the cookbooks repo (see https://gerrit.wikimedia.org/r/admin/repos/operations/cookbooks):

$ git clone "https://$USER@gerrit.wikimedia.org/r/a/operations/cookbooks"
$ cd cookbooks
$ pip install -e .

NOTE: as of 2021-08-18 we are currently using the wmcs branch of the repo

NOTE: as of 2022-07-21 you can use the installation script instead of manually configuring these files (utils/generate_wmcs_config.sh)

Then create the relevant config files, one for cumin, wherever you prefer, I recommend ~/.config/spicerack/cumin_config.yaml, with the contents:

---
transport: clustershell
log_file: cumin.log  # feel free to change this to another path of your choosing
default_backend: direct
environment: {}
clustershell:
    ssh_options:
        # needed for vms that repeat a name
        - |
            -o StrictHostKeyChecking=no
            -o "UserKnownHostsFile=/dev/null"
            -o "LogLevel=ERROR"

And another for spicerack itself (the cookbook cli), I'll use ~/.config/spicerack/cookbook_config.yaml

---
# adapt to wherever you cloned the repo
cookbooks_base_dir: ~/Work/wikimedia/operations/cookbooks
logs_base_dir:  /tmp/spicerack_logs
instance_params:
    # pending a pip release: cumin_config: ~/.config/spicerack/cumin-config.yaml
    # for now you'll have to use the full path, change YOURUSER with your actual user
    cumin_config: /home/YOURUSER/.config/spicerack/cumin_config.yaml


With those config files, now you are able to run the client, from the root of the operations/cookbooks repository, you can list all the cookbooks:

$ cookbook -c ~/.config/spicerack/cookbook_config.yaml --list

cookbook naming proposal

NOTE: see the latest conventions in the README file

Some naming rules:

  1. All the common used cookbooks for a service (tooforge, vps, ceph, ...) should be at the top level of that service, even if they are just a proxy for a lower level one.
  2. Individual cookbooks that provide just a functionality or a subset of a functionality are considered library and should be grouped in a .lib. final submodule.
  3. Group cookbook packages by service, then technology, then subcomponent, if they are common through many services, then just technology.
  4. Use meaningful, specific, non-generic names, a verb and if needed a noun (err on the side of extra explicit).
  5. Reuse the keywords explicitly defined, see list below.

Some known services and packages are:

  • openstack: lower layer of the infrastructure for the Cloud VPS service.
  • vps: operations with Openstack APIs.
  • nfs: NFS related stuff.
  • toolforge: everything Toolforge.
  • toolforge.grid: everything Toolforge grid infrastructure.
  • toolforge.k8s: everything Toolforge kubernetes infrastructure.
  • toolforge.k8s.etcd: everything Toolforge kubernetes etcd infrastructure.

Some well known keywords:

  • ensure: makes sure a condition is met, and acts to fulfill it if not.
  • create: every time a cookbook with this keyword runs, a new resource is created.
  • remove: every time a cookbook with this keyword runs, a resource is deleted.
  • scale: every time a cookbook with this keyword runs, a given service is scaled up (ex. a node is created and pooled).
  • downscale: every time a cookbook with this keyword runs, a given service is down-scaled (ex. a node is drained and removed).
  • join: used when a resource is configured to be part of a service, cluster or similar. May or may not be pooled.
  • pool: start scheduling load in the resource.
  • depool: stop scheduling load in the resource, running load might still be running.
  • drain: remove any workload running in the resource, and prevent new one from getting scheduled.

A good example:

wmcs.toolforge.scale_grid_exec
wmcs.toolforge.scale_grid_webgen
wmcs.toolforge.scale_grid_weblight

wmcs.toolforge.grid.lib.get_cluster_status
wmcs.toolforge.grid.lib.reconfigure

wmcs.toolforge.grid.node.lib.create_join_pool
wmcs.toolforge.grid.node.lib.join   
wmcs.toolforge.grid.node.lib.depool
wmcs.toolforge.grid.node.lib.pool
wmcs.toolforge.grid.node.lib.depool_remove

A bad example:

wmcs.toolforge.scale                             <-- WRONG: scale what?
wmcs.toolforge.reboot                            <-- WRONG: reboot what?
wmcs.toolforge.reboot_node                       <-- WRONG: this should probably be wmcs.toolforge.xxxx.node.lib.reboot instead

wmcs.toolforge.grid.lib.add                      <-- WRONG: add what?
wmcs.toolforge.grid.lib.configuration            <-- WRONG: configure what?

wmcs.toolforge.grid.node.lib.create_exec_node    <-- WRONG: this should probably be an entry-level cookbook (i.e. wmcs.toolforge.create_exec_node)

Note

NOTE: This is a very premature proposal, this workflow will be improved, feel free to start any questions/discussions in the talks page or ping me directly (David Caro (talk) 17:21, 5 February 2021 (UTC)).