You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Heterogeneous deployment/Train deploys: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
(merged dash)
imported>Brennen Bearnes
(→‎Places we monitor: link new errors dashboard, a bit more context on logs, remove channels we don't do much with.)
 
(66 intermediate revisions by 17 users not shown)
Line 1: Line 1:
[[File:MTR_CSR_Sifang_EMU_in_Shek_Kong_Stabling_Sidings_201710.jpg|thumb|500x500px|Bring new code in a fast, safe and efficient way!]]
{{Navigation MediaWiki deployment}}[[File:Trainbows_Not_Painbows1.svg|frameless|none|500px|alt=Trainbows not Painbows]]
{{Navigation MediaWiki deployment}}
<br>__TOC__
== Pairing on the Train ==


As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.
==Weekly steps==
 
=== Monday: Sync up with your deployment partner===
 
As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.


* On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
* On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
Line 10: Line 13:
** It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
** It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
** If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
** If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
* If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the [[Deployments]] calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
* If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the [[Deployments]] calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
* If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.
* If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.


== Breakage ==
===Tuesday: New branch creation and deploy===
====Before the deploy window====


There will be times when this process does not go smoothly. There are [[Deployments/Holding_the_train|guidelines]] for what do to when that happens.
All pre-deploy steps have been automated.


In general, '''if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train'''. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.
* [https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/ Branch cut] happens on releases-jenkins
* <code>scap stage-train auto</code> is run by a cron job


=== Rollback ===
Refer to [[#Troubleshooting_automated_jobs]] if something goes wrong.


To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:
;During the deploy window
 
{| class="wikitable"
<syntaxhighlight lang="shell-session">
! colspan="2" |Step
USERNAME@deploy1001:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
!host
USERNAME@deploy1001:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'
!command
USERNAME@deploy1001:/srv/mediawiki-staging$ # Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
!example
USERNAME@deploy1001:/srv/mediawiki-staging$ git commit --amend
|-
USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master/[VERSION]%l=Code-Review+2
|0-0
|'''Create and auto-merge/deploy the group0 patch'''
|deploy1002
| colspan="2" |<syntaxhighlight lang="shell-session">
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap deploy-promote group0
Promote group0 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
</syntaxhighlight>
</syntaxhighlight>
 
|-
Example:
|0-1
 
|'''Verify production has indeed switched'''
<syntaxhighlight lang="shell-session">
|[[mw:Special:Version|MediaWiki.org]]
USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master/1.34.0-wmf.0%l=Code-Review+2
| colspan="2" |Verify that [[mw:Special:Version|mediawikiwiki]] has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
|-
| 0-2
|'''Monitor production logs'''
|logstash ''etc.''
| colspan="2" |Monitor irc and [[logstash]] and/or [[Wikimedia binaries#logspam-watch|logspam-watch]] for problems, see [[#Places to Watch for Breakage]]
|-
|0-3
|'''Update roadmap page'''
|[[mw:MediaWiki 1.40/Roadmap]]
|Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>0</code> (deployed to group0)
|<syntaxhighlight lang="text">
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}
</syntaxhighlight>
</syntaxhighlight>
|}


*Wait for the patch to merge and the fetch back down to the deployment server
===Wednesday: group0 to group1 deploy===


*[[#Update roadmap]].
;Meta / coordination
 
Attend the Train Log Triage meeting with members of the Core Platform Team and others.
=== Places to Watch for Breakage ===
{| class="wikitable"
 
! colspan="2" |Step
Train deployers should check for breakage as they are rolling out train as they are effectively the first line of defense for train deploys. Some of the places to watch for breakage:
!host
 
!command
* IRC
!example
** primary channel is {{irc|wikimedia-operations}}
|-
** useful channels are {{irc|mediawiki-core}} {{irc|wikimedia-dev}}
|1-0
** for more channels see [https://www.mediawiki.org/wiki/MediaWiki_on_IRC MediaWiki on IRC] and [https://meta.wikimedia.org/wiki/IRC/Channels IRC/Channels]
|'''Create and auto-merge/deploy the group1 patch'''
* [[Mwlog1001|mwlog1001]]
|deploy1001
** [[Wikimedia_binaries#logspam-watch|logspam-watch]]
| colspan="2" |<syntaxhighlight lang="shell-session">
** Logfiles in <code>/srv/mw-log</code>
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap deploy-promote group1
*[https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors Logstash MediaWiki Errors]
Promote group1 from [PREVIOUS-VERSION] to [VERSION] [y/N]
*Logstash "mediawiki-new-errors" dashboard (linked from logstash front page)
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
**[https://logstash.wikimedia.org/app/kibana#/dashboard/dfcf7b70-1aaa-11e9-b4bc-db12fe15ab31 Showing only timeout errors] (see T204871)
* Group-specific Logstash Dashboards:
** [https://logstash.wikimedia.org/app/kibana#/dashboard/group0 group0]
** [https://logstash.wikimedia.org/app/kibana#/dashboard/group1 group1]
* [https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1 Grafana Varnish error-rate dashboard] (HTTP 5XX % should have 3+ 0s after the decimal point, e.g. 0.0001%)
* [https://grafana.wikimedia.org/d/000000612/frontend-responses-nginx-vs-varnish?orgId=1&from=now-15m&to=now Grafana Frontend Responses NGINX vs Varnish]
* [https://grafana.wikimedia.org/d/000000102/production-logging Grafana Production Logging]
* [https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1&from=now-7d&to=now Minerva Client Errors] - Browser JS errors count (only wikipedias on mobile)
 
=== If the train is blocked===
 
*A task will be assigned to you, for example [https://phabricator.wikimedia.org/T191059 T191059] (1.32.0-wmf.13 deployment blockers)
*Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.
 
'''Checklist'''
 
If there are blocking tasks, please do the following:
 
*Make sure all tasks blocking train are set to <code>UBN!</code> priority in phabricator
*Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
*Send e-mail to:
**[https://lists.wikimedia.org/mailman/listinfo/ops ops@lists.wikimedia.org]
**[https://lists.wikimedia.org/mailman/listinfo/wikitech-l wikitech-l@lists.wikimedia.org]
**Subject: <code>[Train] {version} status update</code>
**Body<syntaxhighlight lang="text">The {version} version of MediaWiki is blocked[0].
 
The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
further until these issues are resolved:
 
* {Phab task name} - {phab task link}
 
Once these issues are resolved train can resume. If these issues are
resolved on a Friday the train will resume Monday.
 
Thank you for your help resolving these issues!
 
-- Your humble train toiler
 
[0]. <{link to phab task for train}>
[1]. <https://tools.wmflabs.org/versions/></syntaxhighlight>
*Add relevant people (see [https://www.mediawiki.org/wiki/Developers/Maintainers Developers/Maintainers]) to the blocking task
* Ping relevant people in IRC
* Once train is unblocked be sure to thank the folks who helped unblock it
 
==Monday: Sync up with your deployment partner==
 
See the [[#Pairing on the Train|train pairing]] section above.
 
==Tuesday: New branch creation and deploy==
 
The new branch can be created in Gerrit from anywhere.
 
===Before the deploy window===
 
Depending on how practiced you are and where you choose to run commands (full clones of mediawiki-core from outside the cluster can take a while) the steps will typically take 45 to 90 minutes.
 
====Setup====
 
The script to cut a branch is run on your local machine (as of Jan 2020).
 
'''Local <code>.netrc</code> setup'''
 
Create a .netrc file in your home directory with the following content.
 
<syntaxhighlight lang="shell-session">
you@yourlaptop:~$ vim .netrc
machine gerrit.wikimedia.org login [USERNAME] password [PASSWORD]
</syntaxhighlight>
 
Username and password can obtained from Gerrit:
 
* In the new UI go to [https://gerrit.wikimedia.org/r/settings/#HTTPCredentials HTTP Credentials], copy Username and click Generate new password to generate new password.
* In the old UI, go to [https://gerrit.wikimedia.org/r/#/settings/http-password HTTP Password], copy Username and click Generate Password to generate new password.
 
{{note|type=error|Generated password in both cases is different from your Gerrit password.}}
 
Make sure .netrc file is only readable by you.
 
<syntaxhighlight lang="shell-session">
you@yourlaptop:~$ chmod go-rwx .netrc
</syntaxhighlight>
 
'''Clone or update <code>mediawiki/tools/release</code>.'''
 
<syntaxhighlight lang="shell-session">
USERNAME@yourlaptop:~$ git clone https://gerrit.wikimedia.org/r/mediawiki/tools/release
</syntaxhighlight>
 
To run branch.py you need to have the pygerrit2 library installed for Python3. In Debian 10 (buster), the python3-pygerrit2 package works.
 
====Create the new branch in Gerrit====
 
<syntaxhighlight lang="shell-session">
you@yourlaptop:~/release/make-release/ $ ./branch.py --core --core-bundle wmf_core --bundle wmf_branch --branchpoint HEAD --core-version [VERSION] [WMF BRANCH]
</syntaxhighlight>
 
<syntaxhighlight lang="shell-session">
you@yourlaptop:~/release/make-release/ $ ./branch.py --core --core-bundle wmf_core --bundle wmf_branch --branchpoint HEAD --core-version 1.34.0-wmf.0 wmf/1.34.0-wmf.0
</syntaxhighlight>
 
In {{irc|wikimedia-operations}}, drop a quick log note that you've kicked off the branch process so that others know it's underway, ''e.g.'':
 
<syntaxhighlight lang="irc">
!log 1.35.0-wmf.14 was branched at fb16374c5bdb9d14729f358fb81638fc91640b4f for T233862
</syntaxhighlight>
 
The script will create a release patch, [https://gerrit.wikimedia.org/r/c/mediawiki/core/+/564687 like this one], under your gerrit account. You must C+2 this, and wait for it to merge, to proceed.
 
====tmux or screen====
 
Now that the branch has been cut on your local machine, the remainder of the work will be done on the deployment host: '''deploy1001.eqiad.wmnet'''
 
Some scripts run for 10-60 minutes so consider using tmux or screen.
 
If you prefer tmux:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:~$ tmux new -s train
...
USERNAME@deploy1001:~$ exit
</syntaxhighlight>
 
If you need to leave in the middle you can do <code>ctrl-b d</code> to detach and <code>tmux a -t train</code> to attach.
 
If you prefer screen:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:~$ screen -D -RR train
...
USERNAME@deploy1001:~$ exit
</syntaxhighlight>
 
If you need to leave in the middle you can do <code>ctrl-a d</code> to detach and <code>screen -r train</code> to attach.
 
In either the tmux or the screen session, you'll want to start an ssh-agent and load your local key there:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:~$ eval $(ssh-agent)
USERNAME@deploy1001:~$ ssh-add .ssh/id_ed25519
</syntaxhighlight>
 
====Clone new branch====
This command will create a new <code>/srv/mediawiki-staging/php-[VERSION]</code> directory:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ scap prep [VERSION]
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ scap prep 1.34.0-wmf.0
</syntaxhighlight>
 
This should only take a couple of minutes.
 
====Apply security patches====
 
*Patches should be named sequentially in the order that they will cleanly apply (e.g. <code>01-T[NUMBER].patch</code>, <code>02-T[NUMBER].patch</code>)
*Check and apply each patch in both <code>/srv/patches/[VERSION]/core</code> and <code>/srv/patches/[VERSION]/extensions/[NAME]</code> to the new core checkout and extensions, respectively.
 
Check existing patches:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:~$ tree /srv/patches/[VERSION]
/srv/patches/[VERSION]
├── core
│  ├── 01-T[NUMBER].patch
│  └── 02-T[NUMBER].patch
└── extensions
    └── [EXTENSION]
        ├── 01-T[NUMBER].patch
        └── 02-T[NUMBER].patch
</syntaxhighlight>
 
=====Core=====
 
*You can check a core patch to see if it will apply cleanly with
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]$ git apply --check --3way /srv/patches/[VERSION]/core/[NUMBER]-T[NUMBER].patch
</syntaxhighlight>
 
*If the patch checks out, apply and commit it with
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]$ git am --3way /srv/patches/[VERSION]/core/[NUMBER]-T[NUMBER].patch
</syntaxhighlight>
 
=====Extension=====
 
* For an extension:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]/extensions/[EXTENSION]$ git apply --check --3way /srv/patches/[VERSION]/extensions/[EXTENSION]/[NUMBER]-T[NUMBER].patch
 
USERNAME@deploy1001:/srv/mediawiki-staging/php-[VERSION]/extensions/[EXTENSION]$ git am --3way /srv/patches/[VERSION]/extensions/[EXTENSION]/[NUMBER]-T[NUMBER].patch
</syntaxhighlight>
 
*If the patch fails to apply, investigate whether it's due to a conflict (<code>git status</code>) or the patch having been merged since the new branch cut (search <code>git log</code> for the commit, etc.). If it turns out to be the latter, remove the patch file from the <code>/srv/patches/[VERSION]</code> directory.
*If you need extra help, contact Security Team ([https://wikimediafoundation.org/role/staff-contractors/ Wikimedia Foundation], [https://www.mediawiki.org/wiki/Wikimedia_Security_Team MediaWiki], [https://office.wikimedia.org/wiki/Contact_list#Security Office Wiki]), currently {{ircnick|bawolff|Brian}} and {{ircnick|Reedy|Sam}} in IRC.
 
====Create patches to update wikiversions.json====
 
Create group0 to [VERSION] patch:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap update-wikiversions group0 [VERSION]
USERNAME@deploy1001:/srv/mediawiki-staging/$ git add wikiversions.json
USERNAME@deploy1001:/srv/mediawiki-staging/$ git commit -m "Group0 to [VERSION]"
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap update-wikiversions group0 1.34.0-wmf.0
USERNAME@deploy1001:/srv/mediawiki-staging/$ git add wikiversions.json
USERNAME@deploy1001:/srv/mediawiki-staging/$ git commit -m "Group0 to 1.34.0-wmf.0"
</syntaxhighlight>
 
====Send staged patches to Gerrit for review====
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ git push origin HEAD:refs/for/master/[VERSION]
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ git push origin HEAD:refs/for/master/1.34.0-wmf.0
</syntaxhighlight>
 
====Discard changes to working directory and index====
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ git reset --hard origin/master
</syntaxhighlight>
 
====Clean up old stuff====
 
[[:mw:MediaWiki 1.34/Roadmap]] is a good place to find when a branch was created.
 
List all branches:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ find . -maxdepth 1 -type d -name 'php-*' -print
</syntaxhighlight>
 
Find old branches, more than 7 days old:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ find . -mindepth 2 -maxdepth 2 -type f -path './php-*/README' -ctime +7 -exec dirname {} \;
</syntaxhighlight>
 
For all branches more than 7 days old, drop everything with:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap clean --delete [some old version from find -ctime +7 output above]
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap clean --delete 1.34.0-wmf.0
</syntaxhighlight>
 
Active branches are visible at [https://tools.wmflabs.org/versions/ Wikimedia MediaWiki versions] page.
 
'''Deleting a branch is a full sync of that directory, and can take 10-15 minutes each.'''
 
====Sync to cluster and verify on testwiki====
 
* Edit <code>/srv/mediawiki-staging/wikiversions.json</code> and set <code>testwiki</code> to <code>php-[VERSION]</code>
*Do not commit and push to Gerrit, only make this change locally on the deployment server
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ vim wikiversions.json
</syntaxhighlight>
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git diff
...
-    "testwiki": "php-[VERSION-1]",
+    "testwiki": "php-[VERSION]",
...
</syntaxhighlight>
 
*Run [[scap]] to (re)build localization caches and sync changes across the cluster.
*🐌 Note: this step may take on the order of 70-80 minutes.
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap sync "testwiki to php-[VERSION] and rebuild l10n cache"
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap sync "testwiki to php-1.34.0-wmf.0 and rebuild l10n cache"
</syntaxhighlight>
 
*Verify version change on [https://test.wikipedia.org/wiki/Special:Version testwiki] (Installed software, Product: MediaWiki, Version: [VERSION]) and l10n cache ([https://test.wikipedia.org/wiki/Special:Version Special:Version] should not look like [https://test.wikipedia.org/wiki/Special:Version?uselang=qqx Special:Version?uselang=qqx])
 
This can take well over an hour. Opening or reloading the version page on testwiki after the scap sync command can take a minute or two.
 
*Revert local changes
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ git checkout -- wikiversions.json
</syntaxhighlight>
 
====Update deploy notes====
 
*Deploy notes are automatically generated by the [https://integration.wikimedia.org/ci/job/train-deploy-notes Train Deploy Notes] Jenkins job after you cut the branch
*Be sure to check that the appropriate Changelog was created at <code><nowiki>https://www.mediawiki.org/wiki/MediaWiki_[VERSION]/Changelog</nowiki></code>. Example: [https://www.mediawiki.org/wiki/MediaWiki_1.34/wmf.4/Changelog MediaWiki 1.34/wmf.4/Changelog]
 
====Wait for deploy window====
All of the changes above can be done at any time prior to the actual deployment window.
 
===During the deploy window===
 
====Switch group0 wikis to [VERSION]====
 
*CR+2 <code>group0 to [VERSION]</code> patch in Gerrit that you submitted earlier
*Wait for Gerrit/Zuul/Jenkins to merge the patch(es)
* Pull patch(es) to deployment server
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git fetch
</syntaxhighlight>
 
*Check diff to ensure it is what you expect (this should show a bunch of version changes in wikiversions.json for group0 wikis)
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git diff HEAD..origin/master
</syntaxhighlight>
 
*Apply changes
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git rebase origin/master
</syntaxhighlight>
 
*Sync the change across the cluster
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ scap sync-wikiversions "group0 to [VERSION]"
</syntaxhighlight>
 
Example:
 
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ scap sync-wikiversions "group0 to 1.34.0-wmf.0"
</syntaxhighlight>
</syntaxhighlight>
 
|-
*Verify that [[:mw:Special:Version|mediawikiwiki]] switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
| 1-1
*Monitor irc and [[Logstash|logstash]] and/or [[Wikimedia_binaries#logspam-watch|logspam-watch]] for problems, see [[#Places to Watch for Breakage]]
|'''Verify production has indeed switched'''
 
|[[wikt:Special:Version|English Wiktionary]]
====Update roadmap====
| colspan="2" |Verify that [[wikt:Special:Version|the English Wiktionary]] (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
 
|-
*Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>0</code> (deployed to group0) at [[:mw:MediaWiki 1.35/Roadmap]].
| 1-2
 
|'''Monitor production logs'''
For wikitext editor, change
|logstash ''etc.''
 
| colspan="2" |Monitor irc and [[logstash]] and/or [[Wikimedia binaries#logspam-watch|logspam-watch]] for problems, see [[#Places to Watch for Breakage]]
<syntaxhighlight lang="text">
|-
|1-3
|'''Update roadmap page'''
|[[mw:MediaWiki 1.40/Roadmap]]
|Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>1</code> (deployed to group1)
|<syntaxhighlight lang="text">
{{WMFReleaseTableHead}}
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|[VERSION]|[DATE]|}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
...
{{WMFReleaseTableFooter}}
{{WMFReleaseTableFooter}}
</syntaxhighlight>
</syntaxhighlight>
 
|}
to
===Thursday: group{0,1} to all deploy===
 
{| class="wikitable"
<syntaxhighlight lang="text">
!
{{WMFReleaseTableHead}}
!Step
{{WMFReleaseTableRow|[VERSION]|[DATE]|0}}
!host
...
!command
{{WMFReleaseTableFooter}}
!example
|-
|2-0
|'''Create and auto-merge/deploy the group2 patch'''
|deploy1001
| colspan="2" |<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap deploy-promote all
Promote all from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
</syntaxhighlight>
</syntaxhighlight>
 
|-
Example:
|2-1
 
|'''Verify production has indeed switched'''
<syntaxhighlight lang="text">
|[[w:Special:Version|English Wikipedia]]
| colspan="2" |Verify that [[w:Special:Version|the English Wikipedia]] (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
|-
| 2-2
|'''Monitor production logs'''
|logstash ''etc.''
| colspan="2" |Monitor irc and [[logstash]] and/or [[Wikimedia binaries#logspam-watch|logspam-watch]] for problems, see [[#Places to Watch for Breakage]]
|-
|2-3
|'''Update roadmap page'''
|[[mw:MediaWiki 1.40/Roadmap]]
|Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>2</code> (deployed to all)
|<syntaxhighlight lang="text">
{{WMFReleaseTableHead}}
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
...
{{WMFReleaseTableFooter}}
{{WMFReleaseTableFooter}}
</syntaxhighlight>
</syntaxhighlight>
|}


==== Terminate ssh-agents ====
==Breakage==


Terminate the ssh-agent you started earlier so it doesn't linger on after you log out:
There will be times when this process does not go smoothly. There are [[Deployments/Holding_the_train|guidelines]] for what do to when that happens.


<syntaxhighlight lang="shell-session">
In general, '''if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train'''. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.
pgrep -u "$USER" -laf ssh-agent # list all of your ssh-agent processes
pkill -u "$USER" -f ssh-agent  # kill all your ssh-agent processes
pgrep -u "$USER" -laf ssh-agent # did they all die?
</syntaxhighlight>


Every other day of the train you need to start a new ssh-agent and kill it later.
===Rollback===


==Wednesday: group0 to group1 deploy==
To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:


==== Switch group1 wikis to [VERSION]====
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1001:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'


Use the <code>release/bin/deploy-promote</code> script to update <code>wikiversions.json</code>
# Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
 
# Ideally, you should also add the train blocker task id to the Bug: field for this commit
<syntaxhighlight lang="shell-session">
USERNAME@deploy1001:/srv/mediawiki-staging$ git commit --amend
USERNAME@deploy1001:~$ ./release/bin/deploy-promote
USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2
Promote group1 from [PREVIOUS-VERSION] to [VERSION] [y/N]
</syntaxhighlight>
</syntaxhighlight>


The script automatically Code-Review +2 the patch in Gerrit. Once CI has merged the patch, hit enter at the 2nd prompt.
Example:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2
</syntaxhighlight>
</syntaxhighlight>


After the script run is complete, group1 wikis should be running [VERSION].
*Wait for the patch to merge and the fetch back down to the deployment server


The above should take about five minutes, including the waiting time for Gerrit/CI.
*[[#Update roadmap]].


====Update roadmap====
===Places to Watch for Breakage===


*Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>1</code> (deployed to group1) at [[:mw:MediaWiki 1.35/Roadmap]].
Train deployers should check for breakage as they are rolling out the train as they are effectively the first line of defense for train deploys.


For wikitext editor, change
Given limited resources, it is not possible to monitor every dashboard during the train. There are a limited set of signals that are actively monitored. And a much larger set of signals which may be monitored.


<syntaxhighlight lang="text">
====Places we monitor====
{{WMFReleaseTableRow|[VERSION]|[DATE]|0}}
These are the places Release Engineering actively monitor during the train.
</syntaxhighlight>


to
*IRC
**Primary channel is {{irc|wikimedia-operations}}. This is where official deployment communications happen, alerts are broadcast, etc.
**For more channels see [[mw:MediaWiki_on_IRC|MediaWiki on IRC]] and [[metawiki:IRC/Channels|IRC/Channels]]


<syntaxhighlight lang="text">
*Logs
{{WMFReleaseTableRow|[VERSION]|[DATE]|1}}
**Current [[mwlog]] ([[mwlog1001]] or [[mwlog2002]], depending on primary datacenter):
</syntaxhighlight>
***[[Wikimedia_binaries#logspam-watch|logspam-watch]]
***Logfiles can be found in <code>/srv/mw-log</code>
**Logstash
***[https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors mediawiki-errors] dashboard gives the full firehose of almost all errors
***[https://logstash.wikimedia.org/app/dashboards#/view/c7013c90-a487-11ec-be91-b3435f0c0c49 MediaWiki New Errors ECS] is a workboard with known issues filtered out, useful for surfacing new breakage
**See the [[phab:tag/wikimedia-production-error/|Wikimedia-production-error workboard]] for known issues


Example:
*[https://grafana.wikimedia.org/ Grafana]
**[https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now&refresh=30s Application Servers RED Dashboard]


<syntaxhighlight lang="text">
====Other places to look====
{{WMFReleaseTableRow|12|2018-07-10|1}}
</syntaxhighlight>


==Thursday: group{0,1} to all deploy==
These links are not actively monitored by Release Engineering, but may be useful for troubleshooting and investigation of problems with the train.


==== Switch all wikis to [VERSION]====
*Logstash [https://logstash.wikimedia.org/app/dashboards#/view/AXDBY8Qhh3Uj6x1zCF56 mw-client-errors] dashboard
**New errors appearing more than 1000 times in a 12 hour period should be considered blockers
**See also [https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1 Grafana dashboard] with summary of average error rate over time
*[https://grafana.wikimedia.org/ Grafana]
**[https://grafana.wikimedia.org/d/000000503/varnish-http-errors?refresh=5m&orgId=1 Varnish http-errors dashboard] (HTTP 5XX % should have 3+ 0s after the decimal point, e.g. 0.0001%)
**[https://grafana.wikimedia.org/d/000000612/frontend-responses-nginx-vs-varnish?orgId=1&from=now-15m&to=now Frontend Responses NGINX vs Varnish]
**[https://grafana.wikimedia.org/d/000000102/production-logging Production Logging]
**[https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1&from=now-7d&to=now Minerva Client Errors] - Browser JS errors count (only wikipedias on mobile)


Thursday deploy is very similar to the Wednesday deploy, the only difference in terms of procedure is the target group
===If the train is blocked===


Use the <code>release/bin/deploy-promote all</code> script to update <code>wikiversions.json</code>
*A task will be assigned to you, for example [[phab:T191059|T191059]] (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
*Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.


<syntaxhighlight lang="shell-session">
'''Checklist'''
USERNAME@deploy1001:~$ ./release/bin/deploy-promote all
Promote all from [PREVIOUS-VERSION] to [VERSION] [y/N]
</syntaxhighlight>


The script automatically Code-Review +2 the patch in Gerrit. Once CI has merged the patch, hit enter at the 2nd prompt.
If there are blocking tasks, please do the following:


<syntaxhighlight lang="shell-session">
* Make sure all tasks blocking train are set to <code>UBN!</code> priority in phabricator
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
*Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
</syntaxhighlight>
*Send e-mail to:
**[[mail:ops|ops@lists.wikimedia.org]]
**[[mail:wikitech-l|wikitech-l@lists.wikimedia.org]]
** Ping private [https://app.slack.com/client/T024KLHS4/C01R06P8D1B/ #engineering-all Slack channel]
**Subject: <code>[Train] {version} status update</code>
**Body<syntaxhighlight lang="text">The {version} version of MediaWiki is blocked[0].


After the script run is complete, '''all wikis''' should be running [VERSION].
The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
further until these issues are resolved:


====Update roadmap====
* {Phab task name} - {phab task link}


*Change the <code>Deployed to group</code> (if you're using VisualEditor) or the 3rd parameter of the <code>WMFReleaseTableRow</code> template (if you're using the wikitext editor) to <code>2</code> (deployed to all wikis) at [[:mw:MediaWiki 1.35/Roadmap]].
Once these issues are resolved train can resume. If these issues are
resolved on a Friday the train will resume Monday.


For wikitext editor, change
Thank you for your help resolving these issues!


<syntaxhighlight lang="text">
-- Your humble train toiler
{{WMFReleaseTableRow|[VERSION]|[DATE]|1}}
</syntaxhighlight>


to
[0]. <{link to phab task for train}>
[1]. <https://versions.toolforge.org/></syntaxhighlight>
*Add relevant people (see [[mw:Developers/Maintainers|Developers/Maintainers]]) to the blocking task
*Ping relevant people in IRC
*Once train is unblocked be sure to thank the folks who helped unblock it


<syntaxhighlight lang="text">
=== Troubleshooting automated jobs ===
{{WMFReleaseTableRow|[VERSION]|[DATE]|2}}
{| class="wikitable"
</syntaxhighlight>
|+ Troubleshooting pre-sync failure
! What you're seeing
! Likely problem
! How to fix it
|-
| You received an email that indicates the [https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/ automated branch cut job] has failed.
| The job has failed.
| Follow the link in the email to the failed build. Inspect the console and continue below to troubleshoot.
|-
|The failed build console includes the message <code><url> was rejected by a test failure</code>
|The branch-cut change for <code>mediawiki/core</code> has failed in CI.
|Follow the link to the change in Gerrit. Remove any existing +2 vote and re-vote +2 to trigger gate-and-submit. If the change is merged, all is well (but you should report the flaky behavior). If it fails again, continue below to troubleshoot.
|-
| The branch-cut change has failed in CI again (above).
| This is a real test failure.
| Yell for help from developers in Slack (#engineering-all) and/or on IRC (#wikimedia-releng ?). After a fix has been merged into the mainline branch and backported to the version branch, click rebuild last in Jenkins to rerun the branch-cut job.
|-
| You received an email with subject line ''FAIL: train-presync''
| The systemd timer that runs <code>scap stage-train auto</code> has failed.
| Continue below to troubleshoot.
|-
| The email contains <code>.gitmodules does not exist. Did the train branch commit get merged?</code>.
| The [https://releases-jenkins.wikimedia.org/job/Automatic%20branch%20cut/ automated branch cut job] has failed.
| Head to the top of this table and troubleshoot the branch cut failure. Once you've solved the issue, re-run <code>scap stage-train --yes auto</code> on the deployment server.
|-
| The email contains <code>ERROR: git am: error: Failed to merge in the changes</code>.
| Security patches have failed to apply cleanly.
| Ping Security Team on the [https://phabricator.wikimedia.org/T276237 Currently Deployed Security Patches task] in Phabricator or on Slack. Once they've resolved the issue, re-run <code>scap stage-train --yes auto</code> on the deployment server.
|-
| The email contains <code>ssh: connect to host <host> port 22: Connection timed out</code>.
| ?
| ?
|-
| The email contains <code>error: insufficient permission for adding an object to repository database .git/objects</code>.
| ?
| ?
|-
| Something else.
| ???
| Get help from your backup conductor and fellow RelEngineers to troubleshoot the failure. '''Once you have solved the issue, be sure to update this section with: what you saw, the root problem, how you fixed it.'''
|}


Example:
===Incident documentation===


<syntaxhighlight lang="text">
*If there were problems during the train, follow instructions at [[Incident documentation]] on incident reports and post-mortem review.
{{WMFReleaseTableRow|12|2018-07-10|2}}
*Use <code>Create report</code> form to create a new page, <code>train-[VERSION]</code>. Example: [[Incident documentation/20181212-Train-1.33.0-wmf.8]].
</syntaxhighlight>
*For Timeline section, events from [https://sal.toolforge.org/production SAL] and Phabricator task are a good start.


==Incident documentation==
== See also ==
*For information about the current status of the versions deployed to the various wikis, see https://versions.toolforge.org/


*If there were problems during the train, follow instructions at [[Incident documentation]] on incident reports and post-mortem review.
==Footnotes==
*Use <code>Create report</code> form to create a new page, <code>train-[VERSION]</code>. Example: [[Incident documentation/20181212-Train-1.33.0-wmf.8]].
<references />
*For Timeline section, events from [https://tools.wmflabs.org/sal/production SAL] and Phabricator task are a good start.


[[Category:How-To]]
[[Category:How-To]]
[[Category:Deployment]]
[[Category:Deployment]]

Latest revision as of 18:58, 8 May 2023

Deployments
Trainbows not Painbows


Weekly steps

Monday: Sync up with your deployment partner

As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.

  • On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
    • Updates on IRC while your partner is working and updates on the train blocker ticket if they're offline seems to be a useful pattern.
    • Liberal use of video chat for pairing on hard problems is encouraged.
    • It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
    • If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
  • If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the Deployments calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
  • If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.

Tuesday: New branch creation and deploy

Before the deploy window

All pre-deploy steps have been automated.

  • Branch cut happens on releases-jenkins
  • scap stage-train auto is run by a cron job

Refer to #Troubleshooting_automated_jobs if something goes wrong.

During the deploy window
Step host command example
0-0 Create and auto-merge/deploy the group0 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap deploy-promote group0
Promote group0 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
0-1 Verify production has indeed switched MediaWiki.org Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
0-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
0-3 Update roadmap page mw:MediaWiki 1.40/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 0 (deployed to group0)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}

Wednesday: group0 to group1 deploy

Meta / coordination

Attend the Train Log Triage meeting with members of the Core Platform Team and others.

Step host command example
1-0 Create and auto-merge/deploy the group1 patch deploy1001
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap deploy-promote group1
Promote group1 from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
1-1 Verify production has indeed switched English Wiktionary Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
1-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
1-3 Update roadmap page mw:MediaWiki 1.40/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 1 (deployed to group1)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
{{WMFReleaseTableFooter}}

Thursday: group{0,1} to all deploy

Step host command example
2-0 Create and auto-merge/deploy the group2 patch deploy1001
USERNAME@deploy1001:/srv/mediawiki-staging/$ scap deploy-promote all
Promote all from [PREVIOUS-VERSION] to [VERSION] [y/N]
Now wait for jenkins to merge the patch, then press enter to continue with git pull && scap sync-wikiversions
2-1 Verify production has indeed switched English Wikipedia Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
2-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
2-3 Update roadmap page mw:MediaWiki 1.40/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 2 (deployed to all)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
{{WMFReleaseTableFooter}}

Breakage

There will be times when this process does not go smoothly. There are guidelines for what do to when that happens.

In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.

Rollback

To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:

USERNAME@deploy1001:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1001:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'

# Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
# Ideally, you should also add the train blocker task id to the Bug: field for this commit
USERNAME@deploy1001:/srv/mediawiki-staging$ git commit --amend
USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2

Example:

USERNAME@deploy1001:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2
  • Wait for the patch to merge and the fetch back down to the deployment server

Places to Watch for Breakage

Train deployers should check for breakage as they are rolling out the train as they are effectively the first line of defense for train deploys.

Given limited resources, it is not possible to monitor every dashboard during the train. There are a limited set of signals that are actively monitored. And a much larger set of signals which may be monitored.

Places we monitor

These are the places Release Engineering actively monitor during the train.

Other places to look

These links are not actively monitored by Release Engineering, but may be useful for troubleshooting and investigation of problems with the train.

If the train is blocked

  • A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
  • Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.

Checklist

If there are blocking tasks, please do the following:

  • Make sure all tasks blocking train are set to UBN! priority in phabricator
  • Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
  • Send e-mail to:
    • ops@lists.wikimedia.org
    • wikitech-l@lists.wikimedia.org
    • Ping private #engineering-all Slack channel
    • Subject: [Train] {version} status update
    • Body
      The {version} version of MediaWiki is blocked[0].
      
      The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
      further until these issues are resolved:
      
      * {Phab task name} - {phab task link}
      
      Once these issues are resolved train can resume. If these issues are
      resolved on a Friday the train will resume Monday.
      
      Thank you for your help resolving these issues!
      
      -- Your humble train toiler
      
      [0]. <{link to phab task for train}>
      [1]. <https://versions.toolforge.org/>
      
  • Add relevant people (see Developers/Maintainers) to the blocking task
  • Ping relevant people in IRC
  • Once train is unblocked be sure to thank the folks who helped unblock it

Troubleshooting automated jobs

Troubleshooting pre-sync failure
What you're seeing Likely problem How to fix it
You received an email that indicates the automated branch cut job has failed. The job has failed. Follow the link in the email to the failed build. Inspect the console and continue below to troubleshoot.
The failed build console includes the message <url> was rejected by a test failure The branch-cut change for mediawiki/core has failed in CI. Follow the link to the change in Gerrit. Remove any existing +2 vote and re-vote +2 to trigger gate-and-submit. If the change is merged, all is well (but you should report the flaky behavior). If it fails again, continue below to troubleshoot.
The branch-cut change has failed in CI again (above). This is a real test failure. Yell for help from developers in Slack (#engineering-all) and/or on IRC (#wikimedia-releng ?). After a fix has been merged into the mainline branch and backported to the version branch, click rebuild last in Jenkins to rerun the branch-cut job.
You received an email with subject line FAIL: train-presync The systemd timer that runs scap stage-train auto has failed. Continue below to troubleshoot.
The email contains .gitmodules does not exist. Did the train branch commit get merged?. The automated branch cut job has failed. Head to the top of this table and troubleshoot the branch cut failure. Once you've solved the issue, re-run scap stage-train --yes auto on the deployment server.
The email contains ERROR: git am: error: Failed to merge in the changes. Security patches have failed to apply cleanly. Ping Security Team on the Currently Deployed Security Patches task in Phabricator or on Slack. Once they've resolved the issue, re-run scap stage-train --yes auto on the deployment server.
The email contains ssh: connect to host <host> port 22: Connection timed out. ? ?
The email contains error: insufficient permission for adding an object to repository database .git/objects. ? ?
Something else. ??? Get help from your backup conductor and fellow RelEngineers to troubleshoot the failure. Once you have solved the issue, be sure to update this section with: what you saw, the root problem, how you fixed it.

Incident documentation

See also

Footnotes