You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Deployments/Holding the train: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>DannyS712
imported>Brennen Bearnes
m (Minor tweaks to non-emergency para.)
(11 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Navigation MediaWiki deployment}}
{{Navigation MediaWiki deployment}}
'''Holding the deployment train''' is not something the Release Engineering team takes lightly. When [[Release Engineering|RelEng]] does hold a deployment train, we expect all engineers with relevant expertise to be focused on resolving the issue. A quick resolution is beneficial to all engineers as holding the train, counter-intuitively, can create more problems than it solves. Over time the versions of MediaWiki and extensions that are deployed to the cluster will become more widely divergent from the primary development versions (e.g. <code>master</code>) of the code.
 
'''Holding the deployment train''' is not something the Release Engineering team takes lightly. When [[Release Engineering|RelEng]] holds a deployment train for a production error, we expect all engineers with relevant expertise to be focused on resolving the issue. A quick resolution is beneficial to all engineers as holding the train, counter-intuitively, can create more problems than it solves. Over time the versions of MediaWiki and extensions that are deployed to the cluster will become more widely divergent from the primary development versions (e.g. <code>master</code>) of the code.
 
We ''may'' also pause the train for non-emergency issues which have an expected resolution in the near future. This is done at the discretion of the train conductor, and should be communicated on the blocker task.  It is appropriate for issues where user experience will be substantially improved, actions which save developers and deployers substantial work, and changes which increase deployment safety (e.g. by removing logspam).
 
__TOC__
__TOC__


== Issues that hold the train ==
== Issues that hold the train ==


This is not exhaustive list of things that would cause the train to pause or roll back. As always, it's up to the best judgment of SRE and release engineering, however, the following are representative examples of what we'd take action on.
This is a non-exhaustive list of things that would cause the train to pause or roll back. As always, it's up to the best judgment of SRE and Release Engineering, but the following are representative examples of what we'd take action on:


* Security issues
* Security issues
Line 16: Line 20:
** Page save/update time
** Page save/update time
* Major stylistic problems affecting all pages
* Major stylistic problems affecting all pages
* Significant Error-rate increases (See [[Deployments/Holding the train#Logspam|#Logspam]])
**Complete loss of UI elements critical for reading and top-level navigation on a skin which is default for mobile or desktop users (in practice one the Vector, Vector 2022 or Minerva skins) e.g. page has no text, all the links are gone, styles are not loaded
** Any new error messages that occur frequently enough to be noticed in logstash will block the train.
**Issues on opt-in only skins (Timeless, Monobook, Modern, and CologneBlue) should seldom block the train, but should be fixed promptly (E.g. by the end of the week).
***Exceptions are possible, where editing experience is judged to be severely impacted for a significant fraction of edits (e.g. editing is not loading). When justifying such a blocker please use data where ever possible.
***Purely cosmetic issues that can easily be patched via site CSS should never block the existing train but should be addressed promptly and potentially block the next train.
**For other issues,  avoid passing individual judgment from rollback and block decisions.
***Establish a time limit on when a decision needs to be made
***Include the introducer of the bug, and product owner in the decision making where known.
***Ideally, the product manager of the product with the regression should take responsibility for the decision.
* Error-rate increases (See [[Deployments/Holding the train#Logspam|#Logspam]])
** ''Any'' new error messages that occur frequently enough to be noticed [[Heterogeneous deployment/Train deploys#Places to Watch for Breakage|where deployers watch for breakage]] will block the train.
** If the frequency increases significantly after a deployment then it should be immediately rolled back until the error can be fixed and the branch re-deployed.
** If the frequency increases significantly after a deployment then it should be immediately rolled back until the error can be fixed and the branch re-deployed.
** Even <code>DEBUG</code> / <code>INFO</code>-level logs are a problem. Especially problematic if the frequency of the messages is high enough to put unnecessary load on the logstash servers.
** Even <code>DEBUG</code> / <code>INFO</code>-level logs are a problem. These are especially problematic if the frequency of the messages is high enough to put unnecessary load on the logstash servers.
**[https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1 total client error rate graph is in the red zone] due to an open UBN ticket.
**Newly introduced bugs that show up on [https://logstash.wikimedia.org/app/dashboards#/view/AXDBY8Qhh3Uj6x1zCF56 the mw-client-errors dashboard] or [https://logstash.wikimedia.org/goto/3891b7bd7cb40b984a2ae05c8fe026ba mw-client-error editing dashboard] at a rate of:
***Over 100 errors in a 1 hour period
***Over 1000 errors in a 12 hr period
 
=== Deprecations ===
* '''PHP Deprecation messages''' block the ''following week's'' train.


== What happens in [[SWAT]] while the train is on hold? ==
== What happens during [[backport windows]] while the train is on hold? ==
'''Only simple config changes and emergency fixes are allowed during SWAT while we are reverted.''' This is to reduce the complexity during investigation.
'''Only simple config changes and [[Deployments/Emergencies|emergency fixes]] are allowed during backport windows while we are reverted.''' This is to reduce the complexity during investigation.


Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact affect their investigations.
Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact affect their investigations.
Line 29: Line 48:
* '''If''' a blocker was found and addressed before 3pm Pacific Tues/Wed/Thur '''THEN'''
* '''If''' a blocker was found and addressed before 3pm Pacific Tues/Wed/Thur '''THEN'''
** the planned deploy/rollout can move forward at that time (deployment schedule permitting)
** the planned deploy/rollout can move forward at that time (deployment schedule permitting)
* '''If''' the new <code>wmf.XX</code> version wasn't deployed to group2 (all wikipedias) on Thursday due to blockers '''THEN'''
* '''If''' the new <code>wmf.XX</code> version wasn't deployed to group2 (all Wikipedias) on Thursday due to blockers '''THEN'''
** If there is a fix available for deploy, RelEng will attempt to get the train back on track to ensure we adhere as closely as possible to the train schedule.
** If there is a fix available for deploy, RelEng will attempt to get the train back on track to ensure we adhere as closely as possible to the train schedule.
** An incident report will be filed to address follow-up actions and process improvements, and,
** An incident report will be filed to address follow-up actions and process improvements, and,
Line 40: Line 59:


== Train "blocker tasks" ==
== Train "blocker tasks" ==
'''What:''' For each weekly train version rollout an accompanying task is filed in Phabricator. They all live in the [[phab:project/board/2770/|#Train-Deployments]] tag.
'''What:''' For each weekly train version rollout an accompanying task is filed in Phabricator. They all live in the [[phab:project/board/2770/|#Train-Deployments]] tag. You can find the current task at https://train-blockers.toolforge.org.


'''Purpose:''' The purpose of these tasks is to track the rollout of the train especially including any blocking issues that may arise (see above). These blocking issues are filed as sub-tasks.
'''Purpose:''' The purpose of these tasks is to track the rollout of the train especially including any blocking issues that may arise (see above). These blocking issues are filed as sub-tasks.
Line 49: Line 68:


====== Priority of blocking (sub) tasks: ======
====== Priority of blocking (sub) tasks: ======
Tasks which block the train from moving forward or cause it to be rolled back are set to UBN! ("Unbreak Now!") priority as getting the train moving again should be the highest priority for the person(s)/team responsible for the code in question.
Tasks which block the train from moving forward or cause it to be rolled back are set to UBN! ("Unbreak Now!") priority, as getting the train moving again should be the highest priority for the person(s)/team responsible for the code in question.


====== Status of blocking (sub) tasks: ======
====== Status of blocking (sub) tasks: ======
Most times a blocking task must be "Resolved" in Phabricator for the train to move forward. A subset of times the task itself is not resolved because the issue has been worked around in another way, for instance when eg: a backport was prepared and merged but that backport is not yet merged in <code>master</code>. The task will normally be closed after that patch is merged into <code>master</code>.
Most times a blocking task must be "Resolved" in Phabricator for the train to move forward. A subset of times the task itself is not resolved because the issue has been worked around in another way, for instance when e.g.: a backport was prepared and merged but that backport is not yet merged in <code>master</code>. The task will normally be closed after that patch is merged into <code>master</code>.


====== Communication on blocking tasks: ======
====== Communication on blocking tasks: ======
The "[[mw:Wikimedia Release Engineering Team/Roles#Train Conductor|train conductor]]" for that week is responsible for commenting on any blocking (sub) tasks with their assumptions on status and impact, especially if they choose to move the train forward with the task not set to "Resolved" for whatever reason. The reason for this commenting (and potential over communication) is to ensure all parties are aware of all assumptions and decisions.
The "[[mw:Wikimedia Release Engineering Team/Roles#Train Conductor|train conductor]]" for that week, or the backup conductor, is responsible for commenting on any blocking (sub) tasks with their assumptions on status and impact, especially if they choose to move the train forward with the task not set to "Resolved" for whatever reason. The reason for this commenting (and potential over communication) is to ensure all parties are aware of all assumptions and decisions.


====== Maintaining the task series in Phabricator: ======
====== Maintaining the task series in Phabricator: ======
Line 61: Line 80:


== Logspam ==
== Logspam ==
[[File:Can of Spam on a log.jpg|thumb|Can of Spam on a log]]


=== What it is ===
=== What it is ===


'''#Logspam''' is the term we use to describe the category of noisy error messages in our logs. These usually do not represent actual error conditions or the errors are being ignored/purposefully not prioritized by the responsible parties (when any exist). Specific error messages that have been identified by [[MediaWikiWiki:Wikimedia Release Engineering Team|Release Engineering]] are tracked in the [[phab:tag/wikimedia-production-error/|#Wikimedia-Production-Error]] Phabricator project.
'''Logspam''' is the term we use to describe the category of noisy error messages in our logs. These don't necessarily represent user-facing error conditions, though oftentimes errors are being ignored or aren't a high priority for the responsible parties (when any exist).
 
Specific error messages that have been identified by deployers and log triagers are tracked in the [[phab:tag/wikimedia-production-error/|#Wikimedia-Production-Error]] Phabricator project.


=== Why it's a problem ===
=== Why it's a problem ===


Logspam is a problem because noisy logs make it more difficult to detect real problems quickly when looking at a log dashboard, for example [https://logstash.wikimedia.org/goto/77501d32e51555547aee4e676fdbfc15 fatalmonitor].
Logspam is a problem because noisy logs make it hard to detect problems quickly when looking at log dashboards.


All deployers need to be able to quickly detect any new problems that are introduced by their newly deployed code. If important error messages are drowned out by this logspam then they might not detect more serious issues.
All deployers need to be able to quickly detect any new problems that are introduced by their newly deployed code. If important error messages are drowned out by logspam then deployers can easily miss more serious issues. If code produces extraneous errors in production logs, then that code is considered broken, ''even if there is no immediate user-facing impact.''


=== Major Causes (and how you can fix them) ===
=== Major Causes (and how you can fix them) ===


==== Incorrectly categorized log messages ====
==== Incorrectly categorized log messages ====
The most common example of this type would be expected (or known) conditions being recorded as exceptional conditions, eg: <code>Debug</code> notices or <code>Warnings</code> being logged as <code>Errors</code>. This is incorrect use of logging and should be corrected.
The most common example of this type would be expected (or known) conditions being recorded as exceptional conditions, e.g.: <code>Debug</code> notices or <code>Warnings</code> being logged as <code>Errors</code>. This is an incorrect use of logging and should be corrected.


==== Notice "Undefined variable", "Undefined index" or "Undefined offset" ====
==== Notice "Undefined variable", "Undefined index", or "Undefined offset" ====
These are a common occurrence in PHP code. Whenever you attempt to access a variable or index of an array that doesn't exist, PHP logs a notice. '''These are coding errors and they need to be fixed'''. It might be that the input is malformed and the error is in the caller; or it might a mistyped reference; or it might be that the key is allowed to be absent but forgot to access it conditionally.
These are a common occurrence in PHP code. Whenever you attempt to access a variable or index of an array that doesn't exist, PHP logs a notice. '''These are coding errors and they need to be fixed'''. It might be that the input is malformed and the error is in the caller; or it might a mistyped reference; or it might be that the key is allowed to be absent but the developer forgot to access it conditionally.


== See also ==
== See also ==
* [https://tools.wmflabs.org/versions/ Current train status]
* [https://versions.toolforge.org/ Current train status]
* [https://train-blockers.toolforge.org/ Current train blocker task]


[[Category:Deployment]]
[[Category:Deployment]]

Revision as of 18:18, 23 June 2022

Deployments

Holding the deployment train is not something the Release Engineering team takes lightly. When RelEng holds a deployment train for a production error, we expect all engineers with relevant expertise to be focused on resolving the issue. A quick resolution is beneficial to all engineers as holding the train, counter-intuitively, can create more problems than it solves. Over time the versions of MediaWiki and extensions that are deployed to the cluster will become more widely divergent from the primary development versions (e.g. master) of the code.

We may also pause the train for non-emergency issues which have an expected resolution in the near future. This is done at the discretion of the train conductor, and should be communicated on the blocker task. It is appropriate for issues where user experience will be substantially improved, actions which save developers and deployers substantial work, and changes which increase deployment safety (e.g. by removing logspam).

Issues that hold the train

This is a non-exhaustive list of things that would cause the train to pause or roll back. As always, it's up to the best judgment of SRE and Release Engineering, but the following are representative examples of what we'd take action on:

  • Security issues
  • Data loss
  • Major feature regressions
    • Inability to login/logout/create account for a large portion of users
    • Inability to edit for a large portion of users
  • Performance regressions
    • Page load time
    • Page save/update time
  • Major stylistic problems affecting all pages
    • Complete loss of UI elements critical for reading and top-level navigation on a skin which is default for mobile or desktop users (in practice one the Vector, Vector 2022 or Minerva skins) e.g. page has no text, all the links are gone, styles are not loaded
    • Issues on opt-in only skins (Timeless, Monobook, Modern, and CologneBlue) should seldom block the train, but should be fixed promptly (E.g. by the end of the week).
      • Exceptions are possible, where editing experience is judged to be severely impacted for a significant fraction of edits (e.g. editing is not loading). When justifying such a blocker please use data where ever possible.
      • Purely cosmetic issues that can easily be patched via site CSS should never block the existing train but should be addressed promptly and potentially block the next train.
    • For other issues, avoid passing individual judgment from rollback and block decisions.
      • Establish a time limit on when a decision needs to be made
      • Include the introducer of the bug, and product owner in the decision making where known.
      • Ideally, the product manager of the product with the regression should take responsibility for the decision.
  • Error-rate increases (See #Logspam)

Deprecations

  • PHP Deprecation messages block the following week's train.

What happens during backport windows while the train is on hold?

Only simple config changes and emergency fixes are allowed during backport windows while we are reverted. This is to reduce the complexity during investigation.

Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact affect their investigations.

What happens next?

  • If a blocker was found and addressed before 3pm Pacific Tues/Wed/Thur THEN
    • the planned deploy/rollout can move forward at that time (deployment schedule permitting)
  • If the new wmf.XX version wasn't deployed to group2 (all Wikipedias) on Thursday due to blockers THEN
    • If there is a fix available for deploy, RelEng will attempt to get the train back on track to ensure we adhere as closely as possible to the train schedule.
    • An incident report will be filed to address follow-up actions and process improvements, and,
    • A post-mortem will be conducted.
  • If there are issues affecting performance discovered significantly after the current version of MediaWiki and extensions has been deployed to all wikis (group2, Thursday) THEN
    • The current code version will remain on servers—we will not attempt to rollback to a version > 1 week old, and,
    • The next rollout of the following release will be at the Performance Team's discretion, and,
    • An incident report will be filed to address follow-up actions and process improvements, and,
    • A post-mortem will be conducted.

Train "blocker tasks"

What: For each weekly train version rollout an accompanying task is filed in Phabricator. They all live in the #Train-Deployments tag. You can find the current task at https://train-blockers.toolforge.org.

Purpose: The purpose of these tasks is to track the rollout of the train especially including any blocking issues that may arise (see above). These blocking issues are filed as sub-tasks.

Blocking (sub) tasks types:
  • A task which causes an entire revert/rollback to the previously deployed version and which must be addressed before moving forward.
  • A task which prevents the continued rollout of the new version until it is addressed.
Priority of blocking (sub) tasks:

Tasks which block the train from moving forward or cause it to be rolled back are set to UBN! ("Unbreak Now!") priority, as getting the train moving again should be the highest priority for the person(s)/team responsible for the code in question.

Status of blocking (sub) tasks:

Most times a blocking task must be "Resolved" in Phabricator for the train to move forward. A subset of times the task itself is not resolved because the issue has been worked around in another way, for instance when e.g.: a backport was prepared and merged but that backport is not yet merged in master. The task will normally be closed after that patch is merged into master.

Communication on blocking tasks:

The "train conductor" for that week, or the backup conductor, is responsible for commenting on any blocking (sub) tasks with their assumptions on status and impact, especially if they choose to move the train forward with the task not set to "Resolved" for whatever reason. The reason for this commenting (and potential over communication) is to ensure all parties are aware of all assumptions and decisions.

Maintaining the task series in Phabricator:

Periodically, the release manager will create batches of new tasks in Phabricator for planned upcoming MediaWiki version. This is accomplished by running the scap task-series plugin. For documentation, see: Deployments/Blocking_Tasks

Logspam

Can of Spam on a log

What it is

Logspam is the term we use to describe the category of noisy error messages in our logs. These don't necessarily represent user-facing error conditions, though oftentimes errors are being ignored or aren't a high priority for the responsible parties (when any exist).

Specific error messages that have been identified by deployers and log triagers are tracked in the #Wikimedia-Production-Error Phabricator project.

Why it's a problem

Logspam is a problem because noisy logs make it hard to detect problems quickly when looking at log dashboards.

All deployers need to be able to quickly detect any new problems that are introduced by their newly deployed code. If important error messages are drowned out by logspam then deployers can easily miss more serious issues. If code produces extraneous errors in production logs, then that code is considered broken, even if there is no immediate user-facing impact.

Major Causes (and how you can fix them)

Incorrectly categorized log messages

The most common example of this type would be expected (or known) conditions being recorded as exceptional conditions, e.g.: Debug notices or Warnings being logged as Errors. This is an incorrect use of logging and should be corrected.

Notice "Undefined variable", "Undefined index", or "Undefined offset"

These are a common occurrence in PHP code. Whenever you attempt to access a variable or index of an array that doesn't exist, PHP logs a notice. These are coding errors and they need to be fixed. It might be that the input is malformed and the error is in the caller; or it might a mistyped reference; or it might be that the key is allowed to be absent but the developer forgot to access it conditionally.

See also