You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Puppet/Pontoon"

From Wikitech-static
Jump to navigation Jump to search
imported>Kormat
m (→‎Server bootstrap: add -S to curl, so that errors are still reported.)
imported>Filippo Giunchedi
(No longer an experiment)
Line 1: Line 1:
{{caution}} Pontoon is an experiment. Please reach out to godog / Filippo Giunchedi on {{Irc|wikimedia-sre}} to know more
Reach out to godog / Filippo Giunchedi on {{Irc|wikimedia-sre}} if you'd like more information and/or assistance


= Problem statement =
= Problem statement =

Revision as of 07:53, 3 August 2021

Reach out to godog / Filippo Giunchedi on #wikimedia-sre connect if you'd like more information and/or assistance

Problem statement

We, as SRE at the Foundation, work together to maintain the production infrastructure up and running. A significant chunk of our work relates to making changes to our public Puppet repository. Routine changes have a wide range of intrusiveness to the infrastructure, consider the following:

  • fine-tuning parameters for production services
  • deploy new services
  • roll out operating system upgrades

Generally speaking, every change introduces risks that we have learned to accept. We have also deployed suitable mitigations to those risks such as:

  • code reviews
  • the Puppet compiler
  • Puppet realms other than 'production'.

In an ideal world we would be able to minimize risks on every change before going to production. Being able to test changes within a testbed stack (i.e. a "virtual production") greatly reduces risks and enables experimentation in a safe way.

Today

Setting up such stacks is possible today, but certainly not in a "disposable" fashion.

The word disposable in this context means that the stacks should be easy to set up and tear down, and are isolated from one another (i.e. self-contained as much as possible). For all intents and purposes the stacks resemble production, but receiving less (possibly zero) user traffic. Each stack also carries data that should be initialized and is stack-specific (e.g. private data).

SRE teams today set up WMCS instances to test changes, and roles are assigned via Horizon. Hiera data comes from different sources: Horizon and Puppet, and look up within hieradata isn't the same as production. The result is duplication of multiple variables and often times banging Hiera data and variables together until Puppet runs successfully.

This works but it requires duplication of variables and the resulting patch can't be applied to production as-is. In a perfect world we would have role assignment done the exact same way as production and Hiera variables looked up the same: a common default and override only what changes (e.g. domain names, hostnames, etc).

Pontoon

Pontoon (in the k8s nautical theme: a recreational floating device) explores the idea of disposable stacks as similar to production as possible. The key idea and goal being that the Puppet code base should not depend on hardcoded production-specific values.

Pontoon features include:

  • Role assignment happens by mapping a role to a list of hostnames that need such role.
  • The role mapping is used by Pontoon to drive its Puppet external node classifier (ENC). The ENC also supplies extra variables generated from the mapping.
  • Hostnames listed in the mapping will have their Puppet certificates automatically signed on the first Puppet run.

The explicit role to hostnames mapping enables meta-programming Puppet, which in turn enables replacing list of hostnames (e.g. firewall rules) in Hiera with variables containing "all hosts for role foobar" at catalog compile time.

In practice

As of April 2020 the implementation consists of the following:

  • A standalone puppet server with the production hiera-rchy. Two additional lookup files are provided too: one at the top of the hierarchy to be able to override production defaults and one at the bottom to be able to supply Pontoon-specific values, possibly auto generated. The latter file (auto.yaml) is used by Pontoon to work around some Hiera limitations (namely that variables from ENC are strings, whereas in some cases we need lists)
  • The standalone puppet server is driven by Pontoon's ENC.
  • Realm is 'labs', to keep subsystems like authentication working as expected.
  • Stack-specific data (e.g. the "root of trust", the Puppet CA, etc) must be initialized manually.

Benefits

Why go through all this trouble of isolated stacks similar to production? There are several benefits to having a bespoke testbed:

  • Lower overhead to experiment with new ideas and services.
  • Increased confidence that a patch will work as expected once applied in production. For example being able to test distribution upgrades in isolation.
  • External reusability of the Puppet code base is improved as a bonus/side effect: we're factoring out assumptions about production and third parties can recreate similar environments. Similarly, contributing to the Puppet code base itself is made easier if recreating an isolated "mini-production" is possible without jumping through too many hoops.

Demo

See the asciinema recording at https://asciinema.org/a/WqirmPmHSlHa0LzN5dZLxgCAR

  • In the demo video I'm testing the migration of the graphite::production role to Buster. To do so, I'm adding a freshly provisioned Buster WMCS instance to a self-hosted Pontoon puppet server.
  • I'm confirming which role I want (graphite::production in this case) and proceed to change the 'observability' stack. I'm adding the new role and then assign the newly provisioned host.
  • I'm then committing the change and push the repo's HEAD it to the pontoon puppet server as if it was the 'production' branch.
  • Next, I'm taking over (i.e. enroll) the graphite host. The enroll script needs to know which stack to use (to locate the puppet server) and the hostname to enroll. The script will then log in and make the necessary adjustments (namely changing the puppet server and deleting the host's certificates) and kick off a puppet run.
  • Note that auto signing is disabled yet the puppet server issues the certificate because the host is present in the stack file and thus authorized/recognized.
  • The puppet run then proceeds as expected and graphite::production is applied, some failures are to be expected: for example custom Debian packages not yet available in buster-wikimedia and after the first puppet run (when apt sources are changed).
  • Next I run apt update manually and validate that another puppet agent run is possible.

Howto

This section outlines how to try out Pontoon yourself. The idea is to replicate production's model of one-host-per-role, in other words have a Pontoon server and several agent instances. Development happens locally on your workstation via a checkout of puppet.git and changes are pushed to the Pontoon server.

Server bootstrap

The puppet server is the first host to setup in a stack, therefore it has to be bootstrapped. Bootstrapping is more complicated than enrolling subsequent hosts to an existing stack. Most details about the process are coded in modules/pontoon/files/bootstrap.sh and instructions are provided once bootstrap has completed.

To start your new stack you will need the following:

  1. A local checkout of puppet.git
  2. A Cloud VPS Buster instance (small flavor or above)
  3. A name for your new stack. It is recommended to pick a short name after your team and the stack's function, e.g. o11y-alerts
  4. The bootstrap.sh script present on the host to bootstrap. The script can be scp'd from your local puppet.git checkout, must be executable and in your user's $HOME. For convenience's sake on the host you can install the current version with:
curl -sS https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pontoon/files/bootstrap.sh?format=text | base64 -d > bootstrap.sh && chmod a+x bootstrap.sh

With the requirements above in place, you can proceed with the bootstrap:

  1. SSH as your user to the host
  2. Issue sudo ./bootstrap.sh your_stack_name

The bootstrap script will complete in approximately two minutes if everything goes well. After completion you will need to finalize the bootstrap locally on your computer by creating the new stack in puppet and commit the result. The script will print out instructions for you to get started with your new stack (also saved on the host at /etc/README.pontoon).

Local setup to join an existing stack

In this case you want to start collaborating on an existing stack, thus the steps involved are the following:

  1. Figure out the server's FQDN, typically the host with puppetmaster::pontoon role applied, for example puppetserver01.proj.eqiad1.wikimedia.cloud
  2. Configure the server to be able to act as a remote for your user's git push commands. See instructions at Help:Standalone_puppetmaster#Push_using_a_single_branch
  3. You can start pushing with git push -f project_puppetmaster HEAD:production

Add a new host

This section outlines how to add a new (non Puppet server) host to an existing Pontoon stack.

  1. Add the host's FQDN and its role to your stack modules/pontoon/files/STACK/rolemap.yaml. e.g.
 graphite::production:
   - graphite-01.graphite.eqiad1.wikimedia.cloud
  1. Commit the result and push to the Pontoon server:
 git commit -m "pontoon: add host" modules/pontoon/files/STACK/rolemap.yaml
 git push -f pontoon HEAD:production
  1. Provision a new instance in Horizon with the FQDN you added above. Make sure the instance is created in the correct Horizon project (graphite in the example above)
  2. Enroll the new host. The script will take care of waiting for the host to be accessible, deleting the current puppet ssl keypair and flip the host to the pontoon server. Run this on your development machine:
 modules/pontoon/files/enroll.py --stack STACK graphite-01.graphite.eqiad1.wikimedia.cloud
  1. There will be puppet agent failures (likely), tweak puppet/hiera locally as needed and push to the server as above. Pontoon-specific hiera variables must live in hieradata/pontoon.yaml while values specific to your stack can live in modules/pontoon/files/STACK/hiera/. All .yaml files in the directory will be considered.

Stack Hiera

One of the key goals of Pontoon's "look and feel" is to be as close as possible to production. There are still a few exceptions to be handled, for example during transitions. In practice this means having a stack-specific hiera directory to be able to override settings in Pontoon. The overrides are meant to be kept at a minimum; for example production's default settings shouldn't be repeated in Pontoon's hiera.

The relevant files and paths are:

hieradata/pontoon.yaml
Common to all Pontoon stacks, changes to this file shouldn't be needed
modules/pontoon/files/STACK/hiera/
This is the main path for hiera overrides for your STACK. This path takes precedence over production' hiera. All *.yaml files in this directory will be searched for variables, irrespective of their name. Typically files are named after the general area/service that they affect, and/or which feature they enable. In some cases the files are generic and shared among stacks with symlinks; for example puppetdb.yaml contains the minimal settings for a functional puppetdb in Pontoon, and the file links to ../../puppetdb.yaml.
modules/pontoon/files/STACK/hiera/hosts/
This path allows for host-specific hiera settings if desired. Similarly to production, HOSTNAME.yaml will be searched for hiera settings.


Team collaboration and git branches

A Pontoon stack is likely to be shared among multiple people, often in the same team. Ideally we are able to run an unmodified production branch on the Pontoon server, however there are a exceptions that warrant having a stack-specific branches. As of March 2021 the workflow for such branches is the following:

  1. The branch is pushed under the sandbox/ namespace, to allow for force-push. For example sandbox/filippo/pontoon-o11y is the branch for the observability stack.
  2. Such branches should be periodically rebased on top of production and force-pushed. Note that the Pontoon server will also rebase its local production to keep up with updates. As with any self-hosted Puppet server the rebasing can fail, thus it is important to keep the sandbox branches rebased.
  3. The stack branch is force-pushed as production to the Pontoon server, as explained in the Howto section.


Frontend services

One of the use cases for Pontoon is to be able to prototype new services and features. Most often, infrastructure services in production take the form of <service>.wikimedia.org. Such services fall in two broad categories:

Proxied by edge CDN
Most common type of service: the edge CDN terminates TLS, optionally caches, and reverse-proxies new TLS connections to <service>.discovery.wmnet.
No edge CDN involved
The service shouldn't depend on edge CDN (e.g. alerting), thus the host has a public IP for <service>.wikimedia.org and TLS connections are terminated on the host.

Prototyping such services must be easy in Pontoon: setting up a new service should be a few lines of configuration, and the service's backends configuration must work as-is in production.

To this end, the following concepts are introduced, some of which are useful in production as well:

$public_domain variable
Meant to indicate the "external" or "public" domain under which services are expected to run. Usually in the form of <service>.<public_domain>. For production the public_domain is obviously wikimedia.org, and for Pontoon stacks it is going to be some form of third-level domain, e.g. monitoring.wmflabs.org in observability's case. The idea and intended usage is to be able to stop hardcoding 'wikimedia.org' in configurations and make realistic testing/prototyping easier to achieve.
$public_domain is related to, but different than, $domain: the latter refers to the network domain and can vary between internal and external, whereas the former is a logical/administrative domain used to host services. The variable can and should be used in production as well as Pontoon. See also gerrit review

In the service::catalog entry for the service the following keys are introduced:

public_endpoint
The service's name as available publicly under $public_domain, typically the same as the service's name (e.g. thanos). The value can be useful in production as well, see also gerrit review.
role
The puppet role that is used to run the service. In the example above, thanos::frontend is used as the service's role. The value can be useful in production too, however generating a list of hosts for a given role carries a dependency on PuppetDB. See also gerrit review.


Implementation

The information above is used to drive the Pontoon frontend and provide a simple but accurate emulation of the traffic received in production. The implementation in WMCS uses a single VM running pontoon::frontend role with the following requirements:

  1. A floating IP associated with the VM running pontoon::frontend role.
  2. A DNS wildcard A record for *.$public_domain to the floating IP.

With these requirements met, then adding a new public service fooservice.$public_domain in Pontoon boils down to the following in service::configuration:

foo:
    encryption: true
    role: 'foo::frontend'
    public_endpoint: 'fooservice'
    port: 443
    description: A service for Foos
    # required keys below (but not stricly needed for Pontoon purposes)
    ...

For each service with a public_endpoint the Pontoon frontend will:

  1. Acquire a letsencrypt certificate for fooservice.$public_domain
  2. Redirect http://fooservice.$public_domain to https://fooservice.$public_domain
  3. Reverse-proxy TLS to all hosts running the role foo::frontend on port 443, sending foo.discovery.wmnet as SNI.

An example modification to add all observability services to service::catalog with their respective public_endpoints can be found at this review. The full implementation of the concepts described above is part of the pontoon-lb Gerrit topic