You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Wikimedia Cloud Services team/EnhancementProposals/Operational Automation: Difference between revisions
imported>BryanDavis (→Local setup: show python3 native venv management in example command) |
imported>David Caro (Incomplete random thoughts) |
||
Line 101: | Line 101: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
=== cookbook naming proposal === | |||
Some naming rules: | |||
* (dcaro) All the common used cookbooks for a service* (tooforge, vps, ceph, ...) should be at the top level of that service, even if they are just a proxy for a lower level one (that's to allow easy discovery of the entry points): | |||
** <code>vmcs.toolforge.add_k8s_control_node</code> | |||
** <code>vmcs.toolforge.add_k8s_worker_node</code> | |||
* group by service, then technology, then subcomponent, if they are common through many services, then just technology: | |||
** <code>wmcs.toolforge.grid.</code> cookbooks for toolforge grid | |||
***<code>wmcs.toolforge.grid.queues.</code> cookbooks to manage grid queues specific for toolforge grid instance | |||
** <code>wmcs.toolforge.k8s.</code> cookbooks for toolforge k8s | |||
** <code>wmcs.vps.</code> cookbooks for generic operations @ Cloud VPS | |||
** <code>wmcs.openstack.</code> cookbooks related to the openstack infrastructure (i.e, cloudvirts etc) | |||
**<code>wmcs.k8s.</code> cookbooks related to generic k8s infrastructure | |||
* use meaningful names: (dcaro: this seems not specific enough, and quite subjective, maybe adding more pointers like) | |||
** <code>wmcs.toolforge.grid.reconfigure</code>, a cookbook that reconfigures the grid | |||
** <code>wmcs.toolforge.grid.depool_node</code>, a cookbook that depools a grid node | |||
**(dcaro) Use a verb for the cookbook name (the last section), and if needed a subject | |||
***Bad example without verb: <code>wmcs.toolforge.grid.configuration</code> | |||
***Good example with a verb: <code>wmcs.toolforge.grid.configure</code> | |||
***Better example with verb and subject: <code>wmcs.toolforge.grid.configure_queue_size</code> | |||
**(dcaro) Avoid using generic terms by themselves: | |||
***Bad example: <code>wmcs.toolforge.grid.setup</code> or <code>wmcs.toolforge.grid.add</code> | |||
***Good example: <code>wmcs.toolforge.grid.bootstrap_cluster</code> or <code>wmcs.toolforge.grid.set_initial_configuration</code> | |||
* mind some keywords: | |||
** <code>enroll</code>, self-explanatory, example: | |||
*** <code>wmcs.toolforge.grid.add_node_to_cluster</code> this could be named '''node_enroll''', the '''cluster''' keyword is redundant (already in the ''grid'' package). | |||
****dcaro: I don´t agree with this being self-explanatory, for me grid does not stright away mean cluster (maybe I'm missing some implicit knowledge there) and enroll does not straight away mean join a cluster | |||
** <code>create</code>, create a resource that didn't exist before, example: | |||
*** <code>wmcs.toolforge.grid.create_node</code>, creates a new node | |||
** <code>ensure</code>, make sure a resource is in the desire state, and act otherwise, example: | |||
*** <code>wmcs.toolforge.grid.queues.ensure_clean_of_errors</code>, makes sure the grid queues are clean of errors | |||
(dcaro): * here services are ill-defined, some proposal would be: | |||
* openstack: -> infra for the vps service | |||
* vps -> having a project in openstack | |||
* toolforge.k8s -> k8s infra for the toolforge service | |||
* toolforge.grid -> grid infra for the toolforge service | |||
* toolforge.jobs -> periodic jobs (grid + jobservice) | |||
* toolforge.webservice -> webservices (grid + k8s) | |||
Though I might be miking user with admin cookbooks/services here... | |||
(dcaro): I'm still thinking on how to distinguish between too specific cookbooks dealing with just a detail of a process, and user entry point cookbooks, and if/how to make then implementation unaware (ex. toolforge.k8s.scale_up would be ok?) | |||
= Note = | |||
NOTE: This is a very premature proposal, this workflow will be improved, feel free to start any questions/discussions in the talks page or ping me directly ([[User:David Caro|David Caro]] ([[User talk:David Caro|talk]]) 17:21, 5 February 2021 (UTC)). | NOTE: This is a very premature proposal, this workflow will be improved, feel free to start any questions/discussions in the talks page or ping me directly ([[User:David Caro|David Caro]] ([[User talk:David Caro|talk]]) 17:21, 5 February 2021 (UTC)). |
Revision as of 11:49, 11 January 2022
We currently use Puppet to automate most of our tasks, but it has it's limitations. We still need a tool to automate, collect and review all our operational procedures. Some examples of such procedures are:
- Adding a new member to a toolforge instance etcd cluster.
- Bootstrapping a new toolforge instance.
- Searching for the host where a backup is kept.
- Provisioning a new cloudvirt node.
- Re-image all/a set of the cloudvirt nodes.
- Manage non-automated upgrades.
- Take down a cloudvirt node for maintenance.
Problem statement
All these tasks still require manual operations, following a runbook whenever available, that easily get outdated, are prone to human error and require considerable attention to execute.
Proposal
After reviewing several automation tools (spicerack, saltstack, puppet, ...), and doing a quick POC (see gerrit:647735 and gerrit:658637) for the two more relevant (summary of the experience here https://etherpad.wikimedia.org/p/ansible_vs_spicerack), I've decided to propose spicerack as the de-facto tool for WMCS operational tasks automation.
Collaboration
The main advantage of choosing spicerack is collaboration with the rest of the SRE teams. This comes with both the duty and privilege of becoming co-maintainers for spicerack and related projects, allowing us to have a say in the direction of the project and the use cases that will be supported. With the duty of driving, reviewing and maintaining the projects for all the users (including other SRE teams).
Structure
The Spicerack ecosystem is split in several projects:
Cumin
Cumin is the lowermost layer, built on top of ClusterShell takes care of translating host expressions to hosts, running commands in them (using whatever strategy is selected) and returning the results.
This library should be pretty stable and require little to no changes.
Wmflib
Wmflib is a bundle of generic commonly used functions related to the wikimedia foundation, has some helpful decorators (ex. retry) and similar tools.
This library should be used for generic functions that are not bound to the spicerack library and can be reused in other non-spicerack wikimedia related projects.
Spicerack
Spicerack is the core library, contains more wikimedia specific libraries and a cli (cookbook) to interact with different services (ex. toolforge.etcd) and is meant to be used to store the core logic for any interaction with the services.
Here we will have to add, specially at the beginning, some libraries to interact with our services, here will also be where more of the re-usage of code and collaboration will happen. We should keep always in mind things here that can be used by other group around the foundation. Code in this library will be considerably tested, and no merges should happen without review.
Cookbooks
The Spicerack/Cookbooks repo contains the main recipes to execute, the only logic should be orchestration, and any service management related code should be eventually moved to the above Spicerack library.
This repository of cookbooks will be shared with the rest of the SRE group, but our specific cookbooks will go under the cookbooks/wmcs
directory.
Any helper library should go under cookbooks/wmcs/__init__.py
and we should periodically consider moving as much of the code from there to Spicerack.
Execution of the cookbooks
As of now, this cookbooks can be run locally, I'm actively considering how to provide a host/vm/... with a spicerack + cumin setup for easy usage and running long cookbooks, but as of right now, we can start locally.
If your cookbooks are only accessing bare metal machines, you can already run them on the cumin hosts for the wiki operations (ex. cumin1001.eqiad.wmnet), but those hosts have no access to the VMs as of writing this.
Local setup
To run locally the cookbooks, you will need to create a virtualenv and install the dependencies, for example:
$ python3 -m venv spicerack
$ source spicerack/bin/activate
Then clone the cookbooks repo (see https://gerrit.wikimedia.org/r/admin/repos/operations/cookbooks):
$ git clone "https://$USER@gerrit.wikimedia.org/r/a/operations/cookbooks"
$ cd cookbooks
$ pip install -e .
NOTE: as of 2021-08-18 we are currently using the wmcs branch of the repo
Then create the relevant config files, one for cumin, wherever you prefer, I recommend ~/.config/spicerack/cumin_config.yaml
, with the contents:
---
transport: clustershell
log_file: cumin.log # feel free to change this to another path of your choosing
default_backend: direct
environment: {}
clustershell:
ssh_options:
# needed for vms that repeat a name
- |
-o StrictHostKeyChecking=no
-o "UserKnownHostsFile=/dev/null"
-o "LogLevel=ERROR"
And another for spicerack itself (the cookbook cli), I'll use ~/.config/spicerack/cookbook_config.yaml
---
# adapt to wherever you cloned the repo
cookbooks_base_dir: ~/Work/wikimedia/operations/cookbooks
logs_base_dir: /tmp/spicerack_logs
instance_params:
# pending a pip release: cumin_config: ~/.config/spicerack/cumin-config.yaml
# for now you'll have to use the full path, change YOURUSER with your actual user
cumin_config: /home/YOURUSER/.config/spicerack/cumin_config.yaml
With those config files, now you are able to run the client, from the root of the operations/cookbooks
repository, you can list all the cookbooks:
$ cookbook -c ~/.config/spicerack/cookbook_config.yaml --list
cookbook naming proposal
Some naming rules:
- (dcaro) All the common used cookbooks for a service* (tooforge, vps, ceph, ...) should be at the top level of that service, even if they are just a proxy for a lower level one (that's to allow easy discovery of the entry points):
vmcs.toolforge.add_k8s_control_node
vmcs.toolforge.add_k8s_worker_node
- group by service, then technology, then subcomponent, if they are common through many services, then just technology:
wmcs.toolforge.grid.
cookbooks for toolforge gridwmcs.toolforge.grid.queues.
cookbooks to manage grid queues specific for toolforge grid instance
wmcs.toolforge.k8s.
cookbooks for toolforge k8swmcs.vps.
cookbooks for generic operations @ Cloud VPSwmcs.openstack.
cookbooks related to the openstack infrastructure (i.e, cloudvirts etc)wmcs.k8s.
cookbooks related to generic k8s infrastructure
- use meaningful names: (dcaro: this seems not specific enough, and quite subjective, maybe adding more pointers like)
wmcs.toolforge.grid.reconfigure
, a cookbook that reconfigures the gridwmcs.toolforge.grid.depool_node
, a cookbook that depools a grid node- (dcaro) Use a verb for the cookbook name (the last section), and if needed a subject
- Bad example without verb:
wmcs.toolforge.grid.configuration
- Good example with a verb:
wmcs.toolforge.grid.configure
- Better example with verb and subject:
wmcs.toolforge.grid.configure_queue_size
- Bad example without verb:
- (dcaro) Avoid using generic terms by themselves:
- Bad example:
wmcs.toolforge.grid.setup
orwmcs.toolforge.grid.add
- Good example:
wmcs.toolforge.grid.bootstrap_cluster
orwmcs.toolforge.grid.set_initial_configuration
- Bad example:
- mind some keywords:
enroll
, self-explanatory, example:wmcs.toolforge.grid.add_node_to_cluster
this could be named node_enroll, the cluster keyword is redundant (already in the grid package).- dcaro: I don´t agree with this being self-explanatory, for me grid does not stright away mean cluster (maybe I'm missing some implicit knowledge there) and enroll does not straight away mean join a cluster
create
, create a resource that didn't exist before, example:wmcs.toolforge.grid.create_node
, creates a new node
ensure
, make sure a resource is in the desire state, and act otherwise, example:wmcs.toolforge.grid.queues.ensure_clean_of_errors
, makes sure the grid queues are clean of errors
(dcaro): * here services are ill-defined, some proposal would be:
- openstack: -> infra for the vps service
- vps -> having a project in openstack
- toolforge.k8s -> k8s infra for the toolforge service
- toolforge.grid -> grid infra for the toolforge service
- toolforge.jobs -> periodic jobs (grid + jobservice)
- toolforge.webservice -> webservices (grid + k8s)
Though I might be miking user with admin cookbooks/services here...
(dcaro): I'm still thinking on how to distinguish between too specific cookbooks dealing with just a detail of a process, and user entry point cookbooks, and if/how to make then implementation unaware (ex. toolforge.k8s.scale_up would be ok?)
Note
NOTE: This is a very premature proposal, this workflow will be improved, feel free to start any questions/discussions in the talks page or ping me directly (David Caro (talk) 17:21, 5 February 2021 (UTC)).