Streamlined Service Delivery Design/research
There is an effort underway to introduce CD to WMF, which started early in 2017. The exact shape and scope of this effort is still under discussion. User:Lars Wirzenius was hired in October 2018 to help with this.
This document outlines Lars's understanding and thinking so far. It's meant to aid discussion of this topic within the Release Enginering team, and others interested, and to show where lars is in the dark, to show where Lars have understood things correctly, and reveal things that need discussion. If you disagree with something here, please contact Lars and set him straight.
This document discusses Continuous Deployement and Continous Delivery, which are closely related, but different practices. For most of this document, the differences between the two doesn't matter. The acronym CD is used for both.
WMF provides a large number of web sites. There are around a thousand wikis, implemented with MediaWiki, plus other sites MediaWiki sites often use backing services, which are often microservices.
- Example of microservice: Mathoid renders custom markup in wikitext to MathML and images.
- Databases and load balancers are other types of backing services.
Running a MediaWiki site counts as a service for this document, even if MediaWiki doesn't fit into the model of a microservice with an HTTP API. Deploying MediaWiki should ideally be similar to deploying backing services.
Changes to MediaWiki sites are deployed via two ways: the SWAT and the Train. SWAT is mainly for configuration changes, small bug-fixes for user-visible problems, and other low-risk changes. The Train is for other changes, most importantly, changes to MediaWiki core and extensions.
SWAT happens a couple of times a day (European afternoon, North American afternoon), and Train on Tuesday through Thursday most weeks. The Train changes group 0 ("minor sites") on Tuesday, group 1 ("medium sites") on Wednesday, and group 2 ("big sites") on Thursday, in the hope of catching errors on smaller sites before they affect large numbers of people.
The goal is to move services individually to a CD model. One microservice has been moved (Mathoid). Three more are destined to be moved by the end of 2018 (Graphoid, Zotero, Blubberoid). This means they will be deployed and updated using the new Delivery Pipeline for WMF in the future.
- FIXME: Link to Deployment Pipeline
Currently deployment is triggered manually, but aided by some scripting, and happens daily (SWAT) or weekly (Train). Many of the mechanical steps of deployment have been scripted (scap), but the process still requires manual work. Part of the manual work is checking that the service still works after an upgrade. Part of it is change review, and in some cases small fixes to the changes. There seems to also be a lot of ad hoc communication happening between developers and the Release Engineering team. Deployment is labor-intensive, and as such too error-prone.
- FIXME: link to scap
Scap uses canary servers when deploying: a small subset of the servers are deployed to first, and then logs and other things are checked for problems. If everything goes well, the rest of the servers are deployed to as well.
There is some automated testing of running services. service-checker which consumes an OpenAPI specification for HTTP endpoints to check along with their expected responses. In addition, the logstash log collection service is queried to see if rates of error change after deployment.
There is also some monitoring of services, which SRE sets up and maintains. Monitoring alerts relevant parties, when it notices something breaking. It should be noted that monitoring is not the same as testing: a test suite tells you what aspect of a service doesn't work ("front page doesn't say Wikimedia"), monitoring tells you when measuring some aspect of a service doesn't fall into an expected range ("too many 500 status codes in the log file"). Both are needed. Test suites are especially useful when making changes to the service code ("do the things that the test suite tests, still work?"); monitoring is needed when something changes without deployment (server catches fire).
Known problems with current status
All services (except the one already moved to the delivery pipeline) run on bare metal, or in virtual machines. This limits how well WMF can react to fluctuations in traffic, and thus increases hosting costs, due to having to over-allocate hardware resources (each service needs to have all the resources it needs for peak traffic). A more container-based system could save on hosting costs by not allocating hardware for the peak load time of a particular service, and instead share hardware for different services according to load.
- Question: not sure how big a problem this is.
Software development for WMF needs is slow, because deployment slows it down. At the same time, deployment is risky because the deployment process doesn't keep pace with development; it can't keep pace because it's socially and cognitively too expensive.
Software development can be modelled as loops within loops. Development goes faster, and the software developing entity is more productive, when loops are iterated faster (or at least that's been my experience over the decades). Removing friction and obstacles and automating steps within a loop helps. The innermost loop is the "edit, build, test" cycle. Deployment is in the "build, deploy, test" loop. Making deployment easier and faster will help the overall software development productivity of Wikimedia and its community.
The WMF sites stay up thanks a lot of manual effort spent on review. This is aided by some automated testing, and deployment tooling, monitoring, and many users (both WMF staff and in the community) who eagerly report problems. There seems to be no major issues with quality and level of service, but it seems to require a lot more human effort than might be necessary. The friction coefficient of the deployment loop is high.
While WMF doesn't have a profit motive, it seems nevertheless that making processes smoother and more automated would be beneficial for WMF and the Wikimedia community in general, by freeing people to do more amazing things and spend less effort on mundane, repetitive deployment stuff. Better tools to are force multipliers for brains.
Setting up entirely new services is a medium-big project. This seems like it should be less of an effort.
All services run in containers, hosted by Kubernetes.
This is still under discussion and possibly controversial, so Lars proposes two phases:
Phase 1: Continuous delivery: All the mechanical steps of deployment are automated using scap (or other tooling). Deployment is triggered when a deployer or release engineer runs the deployment script. There's still communication with developers, and keeping an eye on things, as part of the process. Changes that reach the beginning of the delivery pipeline will have been reviewed and approved already. The Train will be replaced by a more SWAT-like approach of changes getting deployed every workday.
Phase 2: Continuoue deployment: Deployments are fully automated, except for change review. The deployment pipeline has one or more gates at which manual review is done. The deployment pipeline is triggered when a developer request a change to be reviewed and merged.
There are automated tests for all services. The test have sufficient coverage and quality that if tests pass, the release engineering team have confidence that the sites and services work for our users.
A possible cultural change: when anything breaks, and it should have been caught by automated tests, tests are changed to catch it in the future.
SRE has responsibility for keeping Kubernetes running, as well as any other infrastructure (DNS, databases, etc). SRE also sets requirements for what runs in production: security, version traceability, testability, monitoring, and more.
Question: Does SRE want to review the code running in production, or its configuration? Also, changes to that. Probably not, but check with them.
SRE handles databases and their configuration seems to be at least partially in the MediaWiki configuration and code. They will probably want to review any changes to that.
Release engineering team
The Release Engineering team has responsibility for providing and maintaining tooling to do deployments automatically, running automation to make deployments happen frequently, and reviewing changes to production (a sanity check, if nothing else).
Service developers (community and WMF)
The developers of the services have responsibility for the writing and maintaining the service code, documenting service configuration, and writing and maintaining automated tests for the running service.
To facilitate this, developers get quick feedback from automated tests if anything seems hinky, so that they can fix it before the Release Engineering team gets involved. Quick here means within minutes. This is achieved by CI (Jenkins).
Possible cultural change: If any reviewer finds anything to fix, even if only extra whitespace, the change is made by the developer. (But silly, simple things like whitespace can be tested for automatically, and such tests are run before a human reviewer ever looks at the change.)
Possible cultural change: If there is a problem, the sites can be rolled back to a known-working version easily. When this happens, the automated tests get improved so they'll catch that problem in the future. The responsibility for improving the tests lies with both the developers and the release engineering team. Canary servers are still used, to look for problems that only happen under production condtions.
The smaller and safer a change is, the less effort it is to get it deployed to production. For example, a change to translations should be possible to do within an hour after having been approved by a reviewer. (That's a goal, not a requirement or a promise.)
Overview of planned solution
All services which can be run in a container, are run a container. Persistent data, such as databases, will stay on bare metal. Containers will communicate with them over the network. We will start by moving microservices into containers, and move MediaWiki last, probably not before late 2019.
All changes are built and tested by CI (Jenkins), which also builds container images. Such images get deplooyed into test instances, and such instances are tested.
This happens for changes to the master branch, as well as for developer branches. Phabricator and Gerrit are used to track changes, as before. Each change is automatically built and tested, and once tests pass, submitted by the developer for review by a human. If accepted by reviewer, merged into the master branch. Changes are handled in a way that notices when they work individually, but break together.
All configuration also gets deployed from git, and merged into the master branch using the same process.
No human ever changes the master branch directly. (If possible, the git server will be configured to prevent that.)