You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Deployments/Holding the train: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>BryanDavis
m (Reverted edits by Ivey13608 (talk) to last revision by 20after4)
imported>Quiddity
(add cat, clarify link)
Line 1: Line 1:
__TOC__
__TOC__
{{Draft}}


Holding the deployment train is not something that should happen unless there are serious security, performance, or functionality issues. Holding the train, counter-intuitively, can create more problems than it solves as the differences between the versions of MediaWiki and extensions that are deployed to the cluster become more widely divergent from the primary development versions of the code.
'''Holding the deployment train''' is not something the Release Engineering team takes lightly. When the RelEng (Release Engineering) does hold a deployment train we expect all engineers with relevant expertise to be focused on resolving the issue. A quick resolution is beneficial to all engineers as holding the train, counter-intuitively, can create more problems than it solves. Over time the versions of MediaWiki and extensions that are deployed to the cluster will become more widely divergent from the primary development versions (eg: <code>master</code>) of the code.


== Issues that hold the train ==
== Issues that hold the train ==
Line 17: Line 16:
** Page save/update time
** Page save/update time
* Major stylistic problems affecting all pages
* Major stylistic problems affecting all pages
* Error-rate increases (See [[Deployments/Holding the train#Logspam|Logspam]])
* Error-rate increases (See [[Deployments/Holding the train#Logspam|#Logspam]])
** Any new error messages that occur frequently enough to be noticed in logstash will block the train.
** Any new error messages that occur frequently enough to be noticed in logstash will block the train.
** If the frequency increases significantly after a deployment then it should be immediately rolled back until the error can be fixed and the branch re-deployed.
** If the frequency increases significantly after a deployment then it should be immediately rolled back until the error can be fixed and the branch re-deployed.
Line 27: Line 26:
Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact effect their investigations.
Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact effect their investigations.


== What happens next (modified train scheduled)? ==
== What happens next? ==
* '''If''' a new <code>wmf.XX</code> version wasn't deployed due to blockers for the entire week '''then'''
* '''If''' a blocker was found and addressed before 3pm Pacific Tues/Wed/Thur '''THEN'''
** The following week no new branch will be cut (target getting <code>wmf.XX</code> to all wikis) '''OR''' The following week a new branch will be cut (skipping last week's <code>wmf.XX</code> branch)
** An incident report will be filed to address follow-up actions and process improvements
* '''If''' a blocker was found and addressed before 3pm Pacific '''then'''
** the planned deploy/rollout can move forward at that time (deployment schedule permitting)
** the planned deploy/rollout can move forward at that time (deployment schedule permitting)
* '''If''' there are issues affecting performance discovered ''after'' the current version of MediaWiki and extensions has been deployed '''then'''
* '''If''' the new <code>wmf.XX</code> version wasn't deployed to group2 (all wikipedias) on Thursday due to blockers '''THEN'''
** The current code version will remain on servers—we will not attempt to rollback to a version > 1 week old
** If the fix is available for a deploy on the following Monday, RelEng will attempt to get the train back on schedule that day and then continue with the next train on schedule the following day.
** The next release will remain at the Performance Team's discretion until XXX time, after which a new branch will be cut and rolled out
** An incident report will be filed to address follow-up actions and process improvements, and,
* '''IF'''...'''THEN'''
** A post-mortem will be conducted.
* '''If''' there are issues affecting performance discovered ''significantly after'' the current version of MediaWiki and extensions has been deployed to all wikis (group2, Thursday) '''THEN'''
** The current code version will remain on servers—we will not attempt to rollback to a version > 1 week old, and,
** The next rollout of the following release will be at the Performance Team's discretion, and,
** An incident report will be filed to address follow-up actions and process improvements, and,
** A post-mortem will be conducted.
 
== Train "blocker tasks" ==
'''What:''' For each weekly train version rollout an accompanying task is filed in Phabricator. They all live in the [[phab:project/board/2770/|#Train-Deployments]] tag.
 
'''Purpose:''' The purpose of these tasks is to track the rollout of the train especially including any blocking issues that may arise (see above). These blocking issues are filed as sub-tasks.
 
'''Blocking (sub) tasks types:'''
* A task which causes an entire revert/rollback to the previously deployed version and which must be addressed before moving forward.
* A task which prevents the continued rollout of the new version until it is addressed.
'''Status of blocking (sub) tasks:''' Most times a blocking task must be "Resolved" in Phabricator for the train to move forward. A subset of times the task itself is not resolved because the issue has been worked around in another way, for instance when eg: a backport was prepared and merged but that backport is not yet merged in <code>master</code>. The task will normally be closed after that patch is merged into <code>master</code>.
 
'''Communication on blocking tasks:''' The "[[mw:Wikimedia Release Engineering Team/Roles#Train Conductor|train conductor]]" for that week is responsible for commenting on any blocking (sub) tasks with their assumptions on status and impact, especially if they choose to move the train forward with the task not set to "Resolved" for whatever reason. The reason for this commenting (and potential over communication) is to ensure all parties are aware of all assumptions and decisions.


== Logspam ==
== Logspam ==
Line 57: Line 70:
==== Undefined index notices ====
==== Undefined index notices ====
These are a common occurrence in php code. Whenever you attempt to access an index of an array but the array does not contain the specified key, HHVM will log a notice. '''These are coding errors and they need to be fixed'''. If the array index is not always expected to exist then the code needs to check with <code>isset()</code> or <code>array_key_exists()</code> before referencing the key.
These are a common occurrence in php code. Whenever you attempt to access an index of an array but the array does not contain the specified key, HHVM will log a notice. '''These are coding errors and they need to be fixed'''. If the array index is not always expected to exist then the code needs to check with <code>isset()</code> or <code>array_key_exists()</code> before referencing the key.
[[Category:Deployment]]

Revision as of 21:52, 19 August 2017

Holding the deployment train is not something the Release Engineering team takes lightly. When the RelEng (Release Engineering) does hold a deployment train we expect all engineers with relevant expertise to be focused on resolving the issue. A quick resolution is beneficial to all engineers as holding the train, counter-intuitively, can create more problems than it solves. Over time the versions of MediaWiki and extensions that are deployed to the cluster will become more widely divergent from the primary development versions (eg: master) of the code.

Issues that hold the train

This is not exhaustive list of things that would cause the train to pause or roll back. As always, it's up to the best judgment of operations and release engineering, but the following scenarios are pretty indicative of what we'd take action on.

  • Security issues
  • Data loss
  • Major feature regressions
    • Inability to login/logout/create account for a large portion of users
    • Inability to edit for a large portion of users
  • Performance regressions
    • Page load time
    • Page save/update time
  • Major stylistic problems affecting all pages
  • Error-rate increases (See #Logspam)
    • Any new error messages that occur frequently enough to be noticed in logstash will block the train.
    • If the frequency increases significantly after a deployment then it should be immediately rolled back until the error can be fixed and the branch re-deployed.
    • Even DEBUG / INFO-level logs are a problem. Especially problematic if the frequency of the messages is high enough to put unnecessary load on the logstash servers.

What happens in SWAT while the train is on hold?

Only simple config changes and emergency fixes are allowed during SWAT while we are reverted. This is to reduce the complexity during investigation.

Remember, while we are reverted people are diligently diagnosing and debugging issues; any seemingly unrelated change could in fact effect their investigations.

What happens next?

  • If a blocker was found and addressed before 3pm Pacific Tues/Wed/Thur THEN
    • the planned deploy/rollout can move forward at that time (deployment schedule permitting)
  • If the new wmf.XX version wasn't deployed to group2 (all wikipedias) on Thursday due to blockers THEN
    • If the fix is available for a deploy on the following Monday, RelEng will attempt to get the train back on schedule that day and then continue with the next train on schedule the following day.
    • An incident report will be filed to address follow-up actions and process improvements, and,
    • A post-mortem will be conducted.
  • If there are issues affecting performance discovered significantly after the current version of MediaWiki and extensions has been deployed to all wikis (group2, Thursday) THEN
    • The current code version will remain on servers—we will not attempt to rollback to a version > 1 week old, and,
    • The next rollout of the following release will be at the Performance Team's discretion, and,
    • An incident report will be filed to address follow-up actions and process improvements, and,
    • A post-mortem will be conducted.

Train "blocker tasks"

What: For each weekly train version rollout an accompanying task is filed in Phabricator. They all live in the #Train-Deployments tag.

Purpose: The purpose of these tasks is to track the rollout of the train especially including any blocking issues that may arise (see above). These blocking issues are filed as sub-tasks.

Blocking (sub) tasks types:

  • A task which causes an entire revert/rollback to the previously deployed version and which must be addressed before moving forward.
  • A task which prevents the continued rollout of the new version until it is addressed.

Status of blocking (sub) tasks: Most times a blocking task must be "Resolved" in Phabricator for the train to move forward. A subset of times the task itself is not resolved because the issue has been worked around in another way, for instance when eg: a backport was prepared and merged but that backport is not yet merged in master. The task will normally be closed after that patch is merged into master.

Communication on blocking tasks: The "train conductor" for that week is responsible for commenting on any blocking (sub) tasks with their assumptions on status and impact, especially if they choose to move the train forward with the task not set to "Resolved" for whatever reason. The reason for this commenting (and potential over communication) is to ensure all parties are aware of all assumptions and decisions.

Logspam

What it is

#Logspam is the term we use to describe the category of noisy error messages in our logs. These usually do not represent actual error conditions or the errors are being ignored/purposefully not prioritized by the responsible parties (when any exist). Specific error messages that have been identified by Release Engineering are tracked in the #Wikimedia-Log-Errors Phabricator project.

Why it's a problem

Logspam is a problem because noisy logs make it more difficult to detect real problems quickly when looking at a log dashboard, for example fatalmonitor.

All deployers need to be able to quickly detect any new problems that are introduced by their newly deployed code. If important error messages are drowned out by this logspam then they might not detect more serious issues.

Major Causes (and how you can fix them)

Incorrectly categorized log messages

The most common example of this type would be expected (or known) conditions being recorded as exceptional conditions, eg: Debug notices or Warnings being logged as Errors. This is incorrect use of logging and should be corrected.

Undefined index notices

These are a common occurrence in php code. Whenever you attempt to access an index of an array but the array does not contain the specified key, HHVM will log a notice. These are coding errors and they need to be fixed. If the array index is not always expected to exist then the code needs to check with isset() or array_key_exists() before referencing the key.