You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Performance/Runbook/Monitor production errors

From Wikitech-static
< Performance‎ | Runbook
Revision as of 16:04, 10 June 2022 by imported>Ahmon Dancy (fix typo)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is a guide for monitoring production errors with OpenSearch, and creating or triaging reports of such errors with Phabricator.

Introduction

Terminology

  • OpenSearch (previously known as Elasticsearch): The internal storage where log messages are stored for 90 days.
  • Logstash: The processing layer between applications like MediaWiki and OpenSearch.
  • OpenSearch Dashboards (previously known as Kibana): The frontend that lets you query and visualise messages from OpenSearch. Our instance of this for production messages is served at https://logstash.wikimedia.org

Basically, MediaWiki PHP sends all its log messages to Logstash.

To learn more about the intermediary transport and backend infrastructure, refer to Logstash.

To learn more about the dashboard frontend, refer to OpenSearch Dashboards.

OpenSearch

When monitoring, triaging or investigating, we primarily use these two dashboards:

  • mediawiki-errors: This shows MW errors, such as runtime warnings, fatal errors, uncaught exceptions, etc.
  • mediawiki-new-errors: This is a fork of mediawiki-errors, with filters applied to exclude errors already reported to Phabricator.

Privacy policy

It is important to note that logs are retained only for 90 days, in accordance with the Foundation's Privacy Policy. Diagnostic log messages are generally treated as personal information (PII). Diagnostic information is sometimes directly identifable (e.g. account X encountered Y while on Z), and sometimes only indirectly (page X was requested at timestamp Y, action X was attempted on page Y). The indirect information is still considered personal in our environment as a precaution for correlation with other datasets or outside knowledge. For example, the fact that "Special:Contributions/Alice?offset=2011-04-01" failed on 2022-01-31 reveals that someone was browsing ten year old contributions made by Alice. If this were reported publicly, this firstly may draw unwanted to attention to Alice, but more importantly it may reveal to Bob that Cathy was looking for evidence of something Alice and Bob did in 2011, if they were discussing that earlier that same day.

Basically, with a big enough dataset, the fact that something happened at all, isn't personally identifiable per-se. But in practice, something concrete happening, even without mentioning the account that opened the URL, can be surprisingly narrow. Especially when it's a Special page or Talk page, as opposed to the canonal URL for a popular article. See also Anonimity in disclosure (2007) by Bruce Schneier, and Differential privacy.

When reporting a task, share a link to Logstash for the details. Do not copy entire messages or JSON values from Logstash into public tasks. Only publish the following information in public places like Phabricator, IRC, mailing lists, etc:

  1. mwversion
  2. reqId
  3. normalized_message: Generally anonymous by default. Remove any names, titles, tokens, SQL values, timestamps, if they happen to be there.
  4. exception.trace: Generally redacted automatically.
    • Do not report traces with quoted arguments, like "xyz" in foo/Bar.php:1 Foo\Bar::doStuff("xyz").
    • Only report trace that contain type placeholders, such as foo/Bar.php:1 Foo\Bar::doStuff(string).
  5. URL: Must be anonimised to not contain input values or other variables, only the route as defined within the software.
    • Good:
      • https://en.wikipedia.org/wiki/*
      • https://wikipedia.org/wiki/ABC?wprov=*
      • https://en.wikipedia.org/w/index.php?title=*&action=*&q=*
      • https://en.wikipedia.org/w/api.php?action=*&q=*
      • https://en.wikipedia.org/w/rest.php/v0/investigate/*?q=*
    • Bad:
      • https://en.wikipedia.org/wiki/User_talk:Alice
      • https://wikipedia.org/wiki/ABC?wprov=123
      • https://en.wikipedia.org/w/index.php?action=search&q=alice+bob+2011
      • https://en.wikipedia.org/w/rest.php/v0/investigate/Alice/q=*

Log message context

Log messages have a context of key-value pairs associated with them. Here are some common context keys and what they mean:

  • mwversion: MediaWiki version and wmf branch name (Train deploy). This helps you in determining if an error is new this week or likely resolved this week if the error is only reported from last week's branch.
  • channel: Indicates which MediaWiki component emitted the message.
  • exception:
  • exception.class: The Exception subclass. This is typically the class name used in the throw statement, e.g. throw new FooException.
  • exception.message: The unredacted message describing the error.
  • exception.file: The exact file name and line number where the exception was thrown (also known as the "call site", and the first line of exception.trace).
  • exception.trace: The full stack trace of the error.
  • url: The HTTP url path of the web request that failed.
  • referrer: The previous or parent HTTP url from a browser. If present and set to a url with a WMF domain, it likely means that the error was experienced in a browser while browsing the referrer page and then requesting the "url" through a form or JS feature.
  • message: The unredacted message describing the error.
  • normalized_message: Like “message” but in its original source code format, with variable placeholders still in tact. This is generally anonymous as the source code wouldn't contain personal information, rather the variables tend to be injected via placeholders. The normalized message is also generally optimised for aggregation and filtering, such that all instances of the same specific root cause have the same normalized message.

Other places to find logs

Run logspam-watch to get a filtered log of real-time occurring errors including information on: MW version, number of occurrences of the log, and the exception message.

Triage process

The below process is tailored for the Train triage meeting, but you can also use this to learn the mechanics and form the basis for your team's own triage process.

Getting started

Open the Wikimedia-production-error workboard in Phabricator. Phabricator holds tasks already filed by others that may need reviewing or validation.

Open the mediawiki-new-errors and mediawiki-errors dashboards in Logstash. The "new" dasboard helps discover new issues, the regular dashboard is for brief investigations to understand how long an error has been happening.

Review the workboard

New production error tasks will end up in the "Untriaged" column of the Wikimedia-production-error workboard in Phabricator.

If a column for the current month does not yet exist, create one.

Review the tasks in the "Untriaged" column and for each one:

  • Ensure it tagged with the most relevant component(s) mentioned in the stack trace, e.g. the affected MediaWiki extension, skin, or core feature (see #Determine component).

Reporting an error

When you discover an error on the "mediawiki-new-errors" dashboard in Logstash, first search for any existing task about it.

Search for existing tasks

Search on Phabricator to determine if there is an open task for it already.

  • You can keep a tab open with the Wikimedia-production-error workboard and use "Find in page" to type a few keywords from the Logstash normalized message and find an existing task.
  • You can expand one of the log entries in Logstash, and from the "Phatality" tab, click "Search".

If there is an existing task, consider leaving a comment with the information from the sample you encountered (skip this if the task is less than a month old, or otherwise still has active discussion recently). You can generate a comment with one click by going to the "Phatality" tab within the Logstash log entry, and click on the "desc" text field, which is ready to copy-paste into a Phab comment.

If there is no task, proceed to Create a task, otherwise proceed to Create a filter.

Create a task

  1. Expand one of the log event rows for the error you want to report.
  2. If it is a MediaWiki PHP error, click "Phatality > Submit".
    • Clean up the title by removing any private or other variable information from the title (see #Privacy policy).
    • Check that the task description is non-empty. Usually the generated description is too large to submit automatically. In that case, copy the "desc" text field from the Logstash/Phatality tab and paste it yourself.
  3. If it is not a MediaWiki PHP error, go to Phabricator and open the "New Production Error" form (linked from the sidebar of the Wikimedia-production-error workboard, as well as from the "Create Task" menu).
    • Set a title, e.g. using the log message, possibly anonimised or shortened for brevity.
    • Fill in the task description template, including the "single document" permalink, error message, and stack trace.
  4. Add tag(s) for with the most relevant component(s) mentioned in the stack trace, e.g. the affected MediaWiki extension, skin, or core feature (see #Determine component).
  5. Submit the "Create New Task" form.
  6. If Herald did not automatically add a purple team tag, edit the task to add tag for the relevant team. Use the list at mediawiki.org/Maintainers to find what team owns the components, and tag them as well. Leave "Wikimedia-production-error" intact (this places it on the production error workboard). If the component has no maintainer, add "Unstewarded-production-error" instead of a team tag.
  7. Move the task to the current month column on the workboard. (You can quickly add tags and move columns from the comment field at the bottom of the task, or via the "Edit task" form, or from the modal interface on the Phab workboard and drag-drop).
  8. Unless you have reason to believe this is an old issue, consider marking it as a #Train blocker.

Create a filter

As part of the triaging process, we typically discover new errors on the "mediawiki-new-errors" Logstash dashboard, and report them to Phabricator. After creating the task in Phabricator, we then also add a filter on the Logstash dashboard to exclude all instances of that error.

This helps deployers and other triagers to know whether or not there are new errors at any given point in time, as well as to not report the same error twice or waste time searching for it.

Firstly, decide what to filter on.

We can filter by line of code (exception.file, the start of stack trace) or by error message (normalized_message).

It is generally preferred to exclude by exception.file because these are less likely to accidentally make us blind to an entire category of errors (e.g. "Something somewhere responded with HTTP 500") which can have multiple independent causes from different MW extensions or different backend services. It also has the benefit of naturally expiring by the next train branch and re-activating if the error remains prominent the next week, and so on.

If the value of exception.file is something generic like MWDebug.php or MWExceptionHandler.php, then we fall back to filtering by normalised_message instead.

  1. Expand one of the log event rows relating to the problem we just reported to Phabricator.
  2. Click the negative magnifying glass icon with the hover value "Filter out value", either on field "exception.file" or "normalized_message".
  3. There is now a new filter in the toolbar on top, click its pencil icon to edit this.
  4. Set its label to the Phabricator task ID with a few keywords from the title, e.g. "T123456 - Bad $oldContent" or "T123456 - ResourceLoader Undefined index".
  5. Click Save to save the filter.
  6. Save the dashboard by clicking Edit in the bar at the top, and then Save. (Not needed for every filter, but remember to do it before you close or leave the dashboard!)

Bag of tricks

Determine component

Look through the stack trace, or other log attributes, and find the most relevant component(s) of MediaWiki core and/or which extension an error originated from. This is important as this controls how the task is discovered and acted on by relevant teams. We generally tag against the first and last non-generic component mentioned in the stack trace.

Avoid tagging upstream projects, and instead tag the feature performing the faulty action through the upstream project. The investigating developer may forward it upstream if it seems likely to be a general issue.

Examples of generic code paths:

  • error handling (MWDebug, MWExceptionHandler).
  • backend abstraction (Wikimedia\Rdbms for MySQL, BagOStuff and WANObjectCache for Memcached, Shellbox), e.g. do not tag Wikimedia-Rdbms or MediaWiki-ObjectCache.
  • JobQueue routing (RunSingleJob, EventBus, JobExecutor), e.g. do not tag upstream "MediaWiki-JobQueue" for failed jobs by default.
  • REST API routing (Rest\EntryPoint, Rest\Router, rest.php), e.g. do not tag MediaWiki-REST-API.
  • Action API routing (ApiBase, ApiMain, api.php), e.g. do not tag MediaWiki-Action-API.
  • Task scheduling (DeferredUpdates).

The below stack trace is most relevant to FlaggedRevs:

#0 /srv/mediawiki/php-…/includes/libs/rdbms/database/Database.php: Wikimedia\Rdbms\Database->getQueryException()
#3 /srv/mediawiki/php-…/includes/libs/rdbms/database/Database.php: Wikimedia\Rdbms\Database->query(s)

7 /srv/mediawiki/php-/extensions/FlaggedRevs/backend/FlaggableWikiPage.php: Wikimedia\Rdbms\DBConnRef->delete()
8 /srv/mediawiki/php-/extensions/FlaggedRevs/backend/FlaggedRevs.php: FlaggableWikiPage->clearStableVersion()
9 /srv/mediawiki/php-/extensions/FlaggedRevs/backend/FRStableVersionUpdate.php: FlaggedRevs::stableVersionUpdates()
10 /srv/mediawiki/php-/includes/deferred/RefreshSecondaryDataUpdate.php: DeferredUpdates::attemptUpdate()

#11 /srv/mediawiki/php-…/includes/deferred/DeferredUpdates.php: RefreshSecondaryDataUpdate->doUpdate()
#17 /srv/mediawiki/php-…/includes/MediaWiki.php: DeferredUpdates::doUpdates()
#18 /srv/mediawiki/php-…/includes/MediaWiki.php: MediaWiki->restInPeace()
#20 /srv/mediawiki/php-…/index.php(50): MediaWiki->run()
#21 /srv/mediawiki/php-…/index.php(46): wfIndexMain()

The below stack trace is most relevant to VisualEditor and MediaWiki-Core-Preferences (aka "user settings"):

#1 /srv/mediawiki/php-…/includes/libs/rdbms/database/DBConnRef.php: Wikimedia\Rdbms\Database->doAtomicSection()
#2 /srv/mediawiki/php-…/includes/libs/rdbms/database/DBConnRef.php: Wikimedia\Rdbms\DBConnRef->__call()

3 /srv/mediawiki/php-/includes/user/User.php: Wikimedia\Rdbms\DBConnRef->doAtomicSection()
4 /srv/mediawiki/php-/extensions/VisualEditor/includes/Hooks.php: User->saveSettings()
5 /srv/mediawiki/php-/includes/deferred/MWCallableUpdate.php: MediaWiki\Extension\VisualEditor\{closure}()

#8 /srv/mediawiki/php-…/includes/deferred/DeferredUpdates.php: DeferredUpdates::run()
#12 /srv/mediawiki/php-…/includes/MediaWiki.php: DeferredUpdates::doUpdates()
#13 /srv/mediawiki/php-…/includes/MediaWiki.php: MediaWiki->restInPeace()
#14 /srv/mediawiki/php-…/index.php: MediaWiki->run()

Train blocker

If a production error is found as part of the deployment process, and was not previously reported to Phabricator or otherwise known to have already been ocurring in prior weeks, mark the task as a train blocker. When in doubt, mark as train blocker and reach out to Release Engineering to (de)escalate.

  1. Click Edit Parent Task
  2. Add the train blocker task for the week. The task is linked from the Deployments calendar entries, as well as from https://train-blockers.toolforge.org/

Note that there is an exemption for deprecation warnings, these are instead to be marked as train blocking the next week's train blocker task. You can find this on the current train blocking task and navigate via the "Next" task button linked there.

Refer to Deployments/Holding the train for more information on blocking the train.

Managing filters

To temporarily hide the effect of a filter on the "mediawiki-new-errors" dashboard, disable the filter instead of removing it. Hover over a filter and click the checkbox-like icon. This way, if you're in the middle of a session to triage errors and have created new filters you want to save, you can still temporarily disable filters without losing their queries and labels.

To delete a filter, click the trashcan icon. Be cautious when deleting, once saved, this will affect everyone’s view and should only be done when the error is obsolete or fixed.

Find obsolete filters

The mediawiki/tools/release repo contains an automation script for checking the task status of each of the filters on the "mediawiki-new-errors" Logstash dashboard. It correlates with the Phabricator API. The code and instrucions reside in the check-new-error-tasks directory of said repository.

Filtering by query string

For trickier filters, it may be helpful to use full text queries. Click on Edit filter, then Edit as Query DSL. As an example, when trying to match an error like:

Wikimedia\Rdbms\DBQueryError: Error 1213: Deadlock found when trying to get lock; try restarting transaction (db1181)
Function: MediaWiki\User\UserOptionsManager::saveOptionsInternal
Query: [query redacted]

You could use:

{
  "query": {
    "query_string": {
      "query": "UserOptionsManager saveOptionsInternalQuery",
      "fields": [
        "exception.message"
      ]
    }
  }
}

This supports some level of wildcard notation - see OpenSearch documentation for details.

After task resolution

After a Phabricator task has been completed, be sure to take a few more steps:

  1. Open the mediawiki-new-errors dashboard afresh.
  2. Delete the filter for this T-number from mediawiki-new-errors by clicking the trashcan icon.
  3. Edit > Save, to save these changes.

The exception triaging process can start in either Phabricator or Kibana. Phabricator holds tasks already filed by others that may need reviewing or validation. The Kibana dasboard can be used to triage new issues and filter out known ones.

Any production exception reported will end up in the Untriaged column of the Wikimedia-production-error workboard in Phabricator.

  1. Create the column for the current month if it doesn't already exist.
  2. Triage what's there, ensuring each task has a team assigned (if not