You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Puppet/Pontoon

From Wikitech-static
Jump to navigation Jump to search

Reach out to godog / Filippo Giunchedi on #wikimedia-sre connect if you'd like more information and/or assistance

Quickstart (five minutes)

To get started with Pontoon you'll need a Puppet server and a name for your stack: copy modules/pontoon/files/bootstrap.sh to a newly provisioned Cloud VPS host and execute it as ./bootstrap.sh <stack name>. When the script has finished you'll be presented with instructions on what to do next. After that, read how to enroll a new host, and how to make roles work in Pontoon to help you get started.

Problem statement

We, as SRE at the Foundation, work together to maintain the production infrastructure up and running. A significant chunk of our work relates to making changes to our public Puppet repository. Routine changes have a wide range of intrusiveness to the infrastructure, consider the following:

  • fine-tuning parameters for production services
  • deploy new services
  • roll out operating system upgrades

Generally speaking, every change introduces risks that we have learned to accept. We have also deployed suitable mitigations to those risks such as:

  • code reviews
  • the Puppet compiler
  • Puppet realms other than 'production'.

In an ideal world we would be able to minimize risks on every change before going to production. Being able to test changes within a testbed stack (i.e. a "virtual production") greatly reduces risks and enables experimentation in a safe way.

Today

Setting up such stacks is possible today, but certainly not in a "disposable" fashion.

The word disposable in this context means that the stacks should be easy to set up and tear down, and are isolated from one another (i.e. self-contained as much as possible). For all intents and purposes the stacks resemble production, but receiving less (possibly zero) user traffic. Each stack also carries data that should be initialized and is stack-specific (e.g. private data).

SRE teams today set up WMCS instances to test changes, and roles are assigned via Horizon. Hiera data comes from different sources: Horizon and Puppet, and look up within hieradata isn't the same as production. The result is duplication of multiple variables and often times banging Hiera data and variables together until Puppet runs successfully.

This works but it requires duplication of variables and the resulting patch can't be applied to production as-is. In a perfect world we would have role assignment done the exact same way as production and Hiera variables looked up the same: a common default and override only what changes (e.g. domain names, hostnames, etc).

Pontoon

Pontoon (in the k8s nautical theme: a recreational floating device) explores the idea of disposable stacks as similar to production as possible. The key idea and goal being that the Puppet code base should not depend on hardcoded production-specific values.

Pontoon features include:

  • Role assignment happens by mapping a role to a list of hostnames that need such role.
  • The role mapping is used by Pontoon to drive its Puppet external node classifier (ENC). The ENC also supplies extra variables generated from the mapping.
  • Hostnames listed in the mapping will have their Puppet certificates automatically signed on the first Puppet run.
  • Load balancing and service discovery compatibility with production. See also the Services page for more information.

The explicit role to hostnames mapping enables meta-programming Puppet, which in turn enables replacing list of hostnames (e.g. firewall rules) in Hiera with variables containing "all hosts for role foobar" at catalog compile time.

As of April 2020 the implementation consists of the following:

  • A standalone puppet server with the production hiera-rchy. Two additional lookup files are provided too: one at the top of the hierarchy to be able to override production defaults and one at the bottom to be able to supply Pontoon-specific values, possibly auto generated. The latter file (auto.yaml) is used by Pontoon to work around some Hiera limitations (namely that variables from ENC are strings, whereas in some cases we need lists)
  • The standalone puppet server is driven by Pontoon's ENC.
  • Realm is 'labs', to keep subsystems like authentication working as expected.
  • Stack-specific data (e.g. the "root of trust", the Puppet CA, etc) must be initialized manually.

Benefits

Why go through all this trouble of isolated stacks similar to production? There are several benefits to having a bespoke testbed:

  • Lower overhead and faster iteration cycles for new ideas and services.
  • Increased confidence that a patch will work as expected once applied in production. For example being able to test distribution upgrades in isolation.
  • External reusability of the Puppet code base is improved as a bonus/side effect: we're factoring out assumptions about production and third parties can recreate similar environments. Similarly, contributing to the Puppet code base itself is made easier if recreating an isolated "mini-production" is possible without jumping through too many hoops.

Demo

See the asciinema recording at https://asciinema.org/a/WqirmPmHSlHa0LzN5dZLxgCAR

  • In the demo video I'm testing the migration of the graphite::production role to Buster. To do so, I'm adding a freshly provisioned Buster WMCS instance to a self-hosted Pontoon puppet server.
  • I'm confirming which role I want (graphite::production in this case) and proceed to change the 'observability' stack. I'm adding the new role and then assign the newly provisioned host.
  • I'm then committing the change and push the repo's HEAD it to the pontoon puppet server as if it was the 'production' branch.
  • Next, I'm taking over (i.e. enroll) the graphite host. The enroll script needs to know which stack to use (to locate the puppet server) and the hostname to enroll. The script will then log in and make the necessary adjustments (namely changing the puppet server and deleting the host's certificates) and kick off a puppet run.
  • Note that auto signing is disabled yet the puppet server issues the certificate because the host is present in the stack file and thus authorized/recognized.
  • The puppet run then proceeds as expected and graphite::production is applied, some failures are to be expected: for example custom Debian packages not yet available in buster-wikimedia and after the first puppet run (when apt sources are changed).
  • Next I run apt update manually and validate that another puppet agent run is possible.

Howto

This section outlines how to try out Pontoon yourself. The idea is to replicate production's model of one-host-per-role, in other words have a Pontoon server and several agent instances. Development happens locally on your workstation via a checkout of puppet.git and changes are pushed to the Pontoon server.

Server bootstrap

The puppet server is the first host to setup in a stack, therefore it has to be bootstrapped. Bootstrapping is more complicated than enrolling subsequent hosts to an existing stack. Most details about the process are coded in modules/pontoon/files/bootstrap.sh and instructions are provided once bootstrap has completed.

To start your new stack you will need the following:

  1. A local checkout of puppet.git
  2. A Cloud VPS Buster instance (g3.cores1.ram2.disk20 flavor or above). It is recommended to name the host after the function/role plus an integer, e.g. puppet-01 in this case
  3. A name for your new stack. It is recommended to pick a short name after your team and the stack's function, e.g. o11y-alerts
  4. The bootstrap.sh script present on the host to bootstrap. The script can be scp'd from your local puppet.git checkout, must be executable and in your user's $HOME. For convenience's sake on the host you can install the current version with:
curl -sS https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pontoon/files/bootstrap.sh?format=text | base64 -d > bootstrap.sh && chmod a+x bootstrap.sh

With the requirements above in place, you can proceed with the bootstrap:

  1. SSH as your user to the host
  2. Issue sudo ./bootstrap.sh your_stack_name

The bootstrap script will complete in approximately two minutes if everything goes well. After completion you will need to finalize the bootstrap locally on your computer by creating the new stack in puppet and commit the result. The script will print out instructions for you to get started with your new stack (also saved on the host at /etc/README.pontoon).

Add a new host

This section outlines how to add a new (non Puppet server) host to an existing Pontoon stack.

  1. Pick a name for your instance (e.g. myfrontend-01) and provision a new instance in Horizon. Make sure the correct Horizon project is selected in the web interface. We'll use graphite in this example. If you don't known the required instance specs yet, start with the smallest available.
  2. Add the host's FQDN and its role to your stack modules/pontoon/files/STACK/rolemap.yaml. e.g.
 graphite::frontend:
   - myfrontend-01.graphite.eqiad1.wikimedia.cloud
  1. Commit the result and push to your STACK's Pontoon server:
 git commit -m "pontoon: add host" modules/pontoon/files/STACK/rolemap.yaml
 git push -f pontoon-STACK HEAD:production
  1. Enroll the new host. The script will take care of waiting for the host to be accessible, deleting the current puppet ssl keypair and flip the host to the pontoon server. Run this on your development machine:
 modules/pontoon/files/enroll.py --stack STACK myfrontend-01.graphite.eqiad1.wikimedia.cloud
  1. There will be puppet agent failures (likely), tweak puppet/hiera locally as needed and push to the server as above. See Stack Hiera section for more information.

Join an existing stack

When the stack exists already (i.e. there's a rolemap.yaml file in modules/pontoon/files/STACK) you can join it by:

  1. Set up the remote server to act as a git remote: modules/pontoon/files/config.py --stack STACK setup-remote
  2. List the commands you need to run locally to configure the remote: modules/pontoon/files/config.py --stack STACK git-config-remote

The last command will configure remotes such as pontoon-STACK, ready to accept your changes. The Pontoon server will read changes from production branch, thus remember to force-push your changes there, e.g. git push -f pontoon-STACK HEAD:production. Make sure to read team collaboration and git branches to learn more on branch structure for multiple teams.

Hiera

One of the key goals of Pontoon's "look and feel" is to be as close as possible to production. To this end, there are two guidelines to keep in mind when writing your stack’s hiera:

  • Minimal: only variables differing from production should be in your stack’s hiera (e.g. resource limits). If you are setting a variable with the same value as production, include it in production only and not in your stack.
  • Generic: group your hiera settings files by the functionality they enable. Shared settings files are also available to be included in your stack for common functionality (e.g. puppetdb.yaml, prometheus.yaml, etc)

Caveats and limitations

Writing a stack’s hiera can be as straightforward as setting a few variables, however there are a few caveats to keep in mind:

  • Replace lists of hostnames with their role when possible. To do so, use “%{alias(‘__hosts_for_ROLE’)}” as your variable’s value. The result will be expanded at lookup time with a list of hosts running the role in rolemap.yaml. Not having hardcoded hostnames truly makes hiera settings generic with respect to a particular stack and thus shareable with other stacks. There's also a crude "master election" available: “%{alias(‘__master_for_ROLE’)}” will expand to a string with the first host running ROLE in rolemap.yaml
  • Only one role at a time can be expanded and used as a value: the alias function call must be the only value. No concatenation of role hostlist variables is possible from within hiera.
  • Sometimes you’ll have to hardcode hostnames, for example nested data structures with each host in a role being the hash’s key.
  • No interpolation of host lists via alias(), for example variables requiring a list of host:port will require hardcoded hostnames, or split ‘port’ into its own variable.
  • Per-host hiera overrides are available, however generic settings are preferred.
  • You will have to make compromises on production features to enable. This problem usually manifests when first porting your role(s) to Pontoon. Ideally your stack enables all production (sub)systems that are relevant to you. Sometimes though having all subsystems available is not possible or practical. In these cases consider disabling the system/feature via your stack’s hiera. TODO include examples

Lookup order

Your stack’s hiera sits above production and thus takes precedence over it. All other production functionality (e.g. role lookups) will be performed as usual. The relevant files and paths (in the order they are looked up, first match wins) are the following:

modules/pontoon/files/STACK/hiera/hosts/
This path allows for host-specific hiera settings if desired. Similarly to production, HOSTNAME.yaml will be searched for hiera settings.
modules/pontoon/files/STACK/hiera/
This is the main path for hiera overrides for your STACK. This path takes precedence over production' hiera. All *.yaml files in this directory will be searched for variables, irrespective of their name. Typically files are named after the general area/service that they affect, and/or which feature they enable. In some cases the files are generic and shared among stacks with symlinks; for example puppetdb.yaml contains the minimal settings for a functional puppetdb in Pontoon, and the file links to the shared puppetdb.yaml
hieradata/pontoon.yaml
Common to all Pontoon stacks, changes to this file shouldn't be needed.

Making roles work in Pontoon

Read this section if you have added a new role to your stack and things are broken (e.g. Puppet fails).

There are only a few failure classes to think about:

  1. Undefined hiera variables. Check the common hiera settings file in modules/pontoon/files/settings for the missing values. If the values are not set already you’ll need to add them. See hiera section on how to do that.
  2. Services on the host are unhealthy. The service’s dependencies haven’t been bootstrapped yet (e.g. databases, users missing, etc), the service can’t reach its dependencies (see also Puppet/Pontoon/Services for details on services in Pontoon), private material is missing (TODO section on private).

Debugging and fixing these issues will also help find production bugs (e.g. a reimaged host will yield the same error, porting a role to a new Debian distribution, etc). Typical bootstrap problems are directory initialization (puppetdb, trafficserver) or service dependencies (trafficserver). Keep in mind that fixing some of these issues might require hacks, and that's okay given Pontoon is not production and an hack enabling automation is better than manually bootstrapping and fixing services.

Team collaboration and git branches

A Pontoon stack is likely to be shared among multiple people, often in the same team. Ideally we are able to run an unmodified production branch on the Pontoon server, however there are a exceptions that warrant having a stack-specific branches. As of March 2021 the workflow for such branches is the following:

  1. The branch is pushed under the sandbox/ namespace, to allow for force-push. For example sandbox/filippo/pontoon-o11y is the branch for the observability stack. Note that you'll also allow access to ldap/ops at https://gerrit.wikimedia.org/r/admin/repos/operations/puppet,access
  2. Such branches should be periodically rebased on top of production and force-pushed. Note that the Pontoon server will also rebase its local production to keep up with updates. As with any self-hosted Puppet server the rebasing can fail, thus it is important to keep the sandbox branches rebased.
  3. The stack branch is force-pushed as production to the Pontoon server, as explained in the Howto section.