You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Puppet
- This page is about how to install, configure, and manage puppet. For documentation about writing puppet code, see Puppet usage.
puppet is the main configuration management tool to be used on the Wikimedia clusters (puppet for dummies on the blog).
puppet agent is the client daemon that runs on all servers, and manages machines with configuration information gathered from puppetmasterd.
puppet agent
Installation of the puppet service is handled via our automated installation. No production ready machines should have puppet manually installed.
If you're not using the wmf-reimage script (STRONGLY DISCOURAGED), initial root login can be done from any puppetmaster frontend with sudo /usr/local/sbin/install-console HOSTNAME. The script uses /root/.ssh/new_install ssh key and thus works also while debian-installer is running during PXE install.
Communication with the puppetmaster server is over encrypted SSL and with signed certificates. To sign the certificate of the newly installed machine on the puppetmaster server, log in on the current ca_server (at the moment, puppetmaster1001.eqiad.wmnet) and run:
puppet cert sign clienthostname
To check the list of outstanding, unsigned certificates, use:
puppet cert list
Reinstalls
When a server gets reinstalled, the existing certs/keys on the puppetmaster will not match the freshly generated keys on the client, and puppet will not work. Our automated reimaging script should be used in every case.
The manual steps would be as follow:
- Before a server runs puppet for the first time (again), on the puppetmaster host, the following command should be run to erase all history of a server:
puppet node clean clienthostname
However, if this is done after puppet agent has already run and therefore has already generated new keys, this is not sufficient. To fix this situation on the !!! client !!!, use the following command to erase the newly generated keys/certificates:
find /var/lib/puppet -name "$(hostname -f)*" -exec rm -f {} \;
SANs for puppet certs
If you want to add SANs to your puppet certificate, you can do that with the following:
- Make sure that puppet is already successfully running on the instance whose certificates need to have SANs added
- Set the base::puppet::dns_alt_names puppet variable to a comma-separated list of domains you'd like to be in the SAN. See here for docs.
- Run puppet a couple of times to make sure that /etc/puppet/puppet.conf has a line under [agent] setting dns_alt_names
- On the puppetmaster, revoke the current certificate for the host with puppet cert clean <fqdn-of-host>
- On the client, clean out all the certs with rm -rf /var/lib/puppet/ssl
- Run puppet agent -tv on the client to regenerate a certificate and submit it to the puppetmaster for signing. This CSR will have the SANs you specified in step 2 in it.
- On the server, sign the CSR with puppet cert --allow-dns-alt-names sign <fqdn-of-host>.
Done!
Misc
Sometimes you want to purge info for a host from the puppet db. The below will do it for you:
puppet node clean fqdn
on the puppet master. All references, i.e. the host entry and all facts going with it, will be tossed. It is important to note that the ssl certificate will be tossed as well, so you will need to re-generate and sign a new cert after the fact.
Puppetmaster
As of late 2016 we have a 3-layer infrastructure for our puppet masters:
- Each main datacenter has its own 3-layer infrastructure, linked with the one in the other.
- The first layer is the puppetmaster frontend, running on one machine per DC. It runs on port 8140 and only accepts connections via HTTPS, and proxies them to backends for everything besides static content. Some specific requests like certificate signing requests and requests for volatile data will be redirected to the local backend (see below) if the server is the current designated master. If it's not, requests are proxied to the frontend on the current master.
- The second layer are the puppetmaster backends, which are listening on port 8141 and running the puppetmaster application via apache/mod_passenger. One instance of the backend is also installed on the frontend servers. This is what does most of the server-side work.
- The third layer is puppetDB, where the backend application will both store agent-provided facts, compiled catalogs and resources; it can also be queried by the masters in order to fetch information (from e.g. exported resources, and more) about the nodes. The puppetDB architecture is pretty complex in itself, so it is explained in more detail below.
PuppetDB
PuppetDB is a clojure application that exposes a somewhat-RESTful interface to retrieve information about puppet catalog, resources, facts. At the time of writing, we're using puppetDB version 2.3, which is the last to be compatible with puppet 3.x. PuppetDB uses Postgres to store its data.
Our PuppetDB infrastructure is built for scaling-out and high availability as follows:
- Each datacenter has one puppetdb server, that at the moment hosts both the clojure application and the Postgres server.
- Queries from the puppetmasters in one datacenter normally flow to the local puppetdb application
- Read-only queries go to the local postgres server; writes are done by connecting (over SSL) to whichever postgres instance is the master.
- Postgres instances are set up in a master/read-only slaves configuration, with slaves replicating from the master.
Making changes
![]() | This page may be outdated or contain incorrect details. Please update it if you can. |
For the gerrit and pre-gerrit patch stages, see the Git/Gerrit doc page.
Updating operations/puppet for production nodes
For security purposes, changes made to the puppet git repository are not immediately applied to nodes. In order to get approved puppet changes live on production systems, you must fetch and review the changes one last time on palladium. This final visual check is crucial to making sure that malicious puppet changes don't sneak their way in, as well as making sure that you don't deploy something that wasn't ready to be deployed.
The operations/puppet repository is hosted on palladium at /var/lib/git/operations/puppet. This working copy has hooks to update strontium and other puppetmasters.
"puppet-merge" is a wrapper script designed to formalize the merge steps while making it possible to review actual diffs of submodules when they change. When there are submodule changes, puppet-merge will clone the /var/lib/git/operations/puppet working copy to a tmp directory, do the merge and submodule update, and then show a manual file diff between /var/lib/git/operations/puppet and the temporary clone. This allows for explicit inspection of exactly what is about to be done to the codebase, even when there are submodule changes.
$ cd /var/lib/git/operations/puppet # optional, puppet-merge will work properly from anywhere $ puppet-merge # diff is shown... Merge these changes? (yes/no)? yes Merging a4678c710573006249e86d311198b94cc3889382... git merge --ff-only a4678c710573006249e86d311198b94cc3889382 Updating 8b0e19d..a4678c7 Fast-forward files/puppet/puppet-merge | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) From https://gerrit.wikimedia.org/r/p/operations/puppet 8b0e19d..a4678c7 production -> origin/production Merge made by the 'recursive' strategy. files/puppet/puppet-merge | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Running git clean to clean any untracked files. git clean -dffx HEAD is now a4678c710573006249e86d311198b94cc3889382.
Once the changes are updated, they will be put into place by puppet on whatever relevant nodes during the next puppet run.
Noop test run on a node
You can do a dry run of your changes using:
# puppet agent --noop --test --debug
This will give you (among other things) a list of all the changes it would make.
Trigger a run on a node
Just run:
# puppet agent --test
Debugging
Using
# puppet agent --test --trace --debug
You get maximum output from puppet.
You can see a list of classes that are being included on a given puppet host, by checking the file /var/lib/puppet/state/classes.txt.
With --evaltrace, puppet will shows the resources as they are being evaluated:
# puppet agent -tv --evaltrace info: Class[Apt::Update]: Starting to evaluate the resource info: Class[Apt::Update]: Evaluated in 0.00 seconds info: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]: Starting to evaluate the resource notice: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]/returns: executed successfully info: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]: Evaluated in 16.24 seconds info: Class[Apt::Update]: Starting to evaluate the resource info: Class[Apt::Update]: Evaluated in 0.01 seconds ...
Most of the puppet configuration parameters can be passed as long options (aka evaltrace can be passed as --evaltrace).
Errors
Occassionally you may see puppet fill up disks, and then result in yaml errors during puppet runs. If so, you can run the following on the puppet master, but do so very, very carefully:
cd /var/lib/puppet && find . -name "*<servername>*.yaml -delete
Check .erb template syntax
"ERB files are easy to syntax check. For a file mytemplate.erb, run"
erb -x -T '-' mytemplate.erb | ruby -c
Troubleshooting
puppet master spewing 500s
It might happen that there's a storm of puppet failures, this is usually due to the clients not being able to talk to the master(s). If that happens first identify the failing puppet master, there should be a nagios check on HTTP checking for 200s. Once on the puppet master check that apache children are present, in particular the mod_passenger's passenger-spawn-server and that there "master" processes running, the stdout/stderr are connected to /var/log/apache2/error.log so that will provide some guidance, if e.g. passenger-spawn-server crashed it would be sufficient to restart apache.
puppet-merge fails to sync on secondary
Sometimes puppet-merge
might fail to sync on the secondary for whatever reason (see also https://phabricator.wikimedia.org/T128895). This is easily fixed by ssh into the server where the command failed and running:
sudo puppet-merge
Private puppet
Our main puppet repo is publicly visible and accepts (via gerrit review) volunteer submissions. Certain information (passwords, keys, etc.) cannot be made public, and lives in a separate, private puppet repository.
The private repository is stored on puppetmaster1001 in /srv/private. It is not managed by gerrit or subject to review; changes are made there by logging in, editing and committing directly on the puppetmaster. Changes to /srv/private are distributed to puppetmasters automatically via a post-commit hook. The puppet master pulls private data from /var/lib/git/operations/private but you don't need to edit there, it should be synced automatically by the post-commit hook in /srv/private.
The data in the private repository is highly sensitive and should not ever be copied onto your local machine or to anywhere outside of a puppetmaster system.
Nowadays, most things in the private repo should be class parameters defined with Puppet Hiera. Those reside under private/hieradata and have the big advantage they don't need to get replicated in a second repository (see below).
Public (fake) private puppet repo
In order to satisfy puppet dependencies while retaining security, there is also a 'labs private' repo which the labs puppetmaster uses in place of the actual, secure private repo. The labs private repo lives on Gerrit and consists mainly of disposable keys and dummy passwords. In the case of hieradata in the private repo, in most cases labs can be happy with class defaults or with some data you can put in labs.yaml in the public hiera repository.
puppet git submodules
Some puppet modules are managed as git submodules for maximizing pain and frustration of the developer; the fact it also allows episodic code sharing between production puppet and other environments (e.g. vagrant, third parties) is a plus.
troubleshooting
If submodules need to get merged into the main puppet.git repo, then there's need for a manual cleanup or git pull will fail with
error: The following untracked working tree files would be overwritten by checkout:
thus you'll need to remove whichever modules/SUBMODULE files were there and try pulling again.
Todo
- More secure certificate signing
- Better, more automated version control
- Better tools for adding/maintaining node definitions
tickets
some selected "puppetize" tickets that are open: