You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This is a page to document and discuss ideas for improving / replacing the train deployment process.
We must replace this broken process with something sane and healthy.
Warning: what follows is essentially a rant and much of it is not backed up by evidence. Even so, I believe every one of my points to be true and important for consideration.
I assert that the MediaWiki deployment train is a severe mental health hazard for deployers. I say this in all seriousness and without hyperbole.
I say this as the person who has ran the train more than anyone else, and as someone who came very close to terminating my employment at the foundation on more than one occasion due to extreme pressure / stress directly resulting from train deployment duties. But mental health isn't the only reason, and maybe not even the most compelling. Let me list just a few of the major reasons that we must do something else.
- The current process involves ~300 patches going out each week (and this number has been increasing steadily for as far back as we have data)
- The person deploying generally has zero familiarity with the specific changes they are deploying. Furthermore, the volume of patches prohibits us from even gaining a superficial understanding of all of these changes.
- The deployers (along with SREs) are ultimately the ones who has to deal with the immediate fallout from whatever they deploy. Sometimes that means a "routine" deployment turns into an all-hands-on-deck emergency incident with hours or days worth of recovery and incident reporting responsibilities.
- By insulating developers from the production concerns around deploying and operating their code, we enable developers to be more reckless and less knowledgeable about the context in which their code is running.
- By taking away a key responsibility, we rob developers of an important connection to their work. When developers deploy their own changes and see the results in production, there is a feeling of satisfaction and empowerment that is really powerful. It builds confidence and a feeling of connection to your work that is valuable beyond any words I can use to describe it.
- The process we use, instead of empowering developers, instills in feeling of helplessness where it feels incredibly difficult to actually get things done. This feeling is demotivating and demoralizing. Worse, it's not at all equitable because it largely depends on your position, social connections, self confidence and assertiveness and many other "soft" subtle factors. This doesn't support our developers in having the most impact and doing their best work. It actively works against them. The people most likely to have a negative experience are the ones least able to make a change or speak up and be heard about it. I have had this sort of experience and I am a fairly outspoken white male with a lot of pre-existing self confidence and a whole lot of work experience prior to working here. If I hadn't had all of that I could have easily just given up any number of times.
- Any delay in deployment of a change ultimately slows down the entire development process.
- When a change introduces a new bug or adds new log spam to production error logs, the person most qualified to deal with that is the person who wrote the change. Currently, a deployer has to track that person down to get the production issue addressed. Unfortunately the deployer doesn't know which one of the ~300 patches was responsible. The person who wrote the patch might not be around anymore by the time it goes out. They could be asleep, on vacation or on an airplane. It is often difficult to track down the right person or team and get attention on the issue in a timely manner.
- When errors show up in production, often a deployer doesn't have any way to tell if the error is significant or just log spam. Even log spam is a problem but it's hard to know whether the problem warrants immediate rollback or if it's minor enough to raise the issue in phabricator and wait. There is not a very big window of time between when we flag an issue and when it starts to impact our deployment schedule so the train is often delayed or rolled back for issues which are trivial for the original engineer who wrote the patch to solve with a quick followup patch. If the developer deployed their own changes then they would see the issue immediately and immediately resolve it.
- There are a bunch more issues with the current process and more important arguments for changing it but I'm running out of steam for today.
...To be continued.
See also: my historical but still somewhat-relevant ranting from 2015: https://phabricator.wikimedia.org/T89945
Small incremental changes
These are the less drastic ideas which could be implemented in small steps without drastic changes to the process
Deploy smaller batches more frequently
Just like the current train but with smaller batches deployed more frequently. As a starting point, consider daily instead of weekly.
Just stop cutting train branches! Everything must be landed in production via backports to the existing production branch.
Simply stop doing trains. Everything is deployed via backport windows using the current backport process. Although this wouldn't require drastic changes to processes or infrastructure it would become a bottleneck for getting patches merged and deployed. At a minimum this would require some automation to help usher patches through the backport and deployment process with minimal overhead. This is similar to what I proposed years ago, see T89945 and for even more context: https://phabricator.wikimedia.org/project/board/2117/query/all/
Use commit metadata to raise visibility of risky patches
Perhaps gerrit hashtags or keywords in the commit message could be used to indicate risky patches and automatically surface them in a situational awareness dashboard.
Large and disruptive changes
These ideas are larger, more disruptive. Maybe even transformative to the point where the current process would be almost completely replaced with a new and different process.
Deploy Batches of Patches, one per team
In this model, each team would be in charge of their own deployment velocity. Instead of merging directly to main, teams would have their merges batched and landed all together as soon as they give the green light for their batch to go to production.
Someone from each team would be responsible for monitoring their batch and rolling back anything that breaks production. Release Engineering would provide automation and support for the process.