You are browsing a read-only backup copy of Wikitech. The live site can be found at

Performance/Runbook/Kibana Monitoring: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nikki Nikkhoui
m (Note that logs are purged after 90 days)
Line 1: Line 1:
This is a guide for how to use Kibana and Phabricator for monitoring and reporting production level errors and exceptions.
#REDIRECT [[Performance/Runbook/Kibana monitoring]]
The exception triaging process can start in either Phabricator or Kibana. Phabricator holds tasks already filed by others that may need reviewing and validation. Kibana can be used to identify new issues and exclude known issues from the display.
Any production exceptions reported should end up in the Untriaged column of the [ Wikimedia Production Error workboard] in Phabricator.
== Overview ==
ELK stack
* '''Elasticsearch''': Where log events are stored.
* '''Logstash''': Does processing and injection into Elasticsearch.
* '''Kibana''': User interface that reads from Elasticsearch. Lives at
MediaWiki/PHP pushes messages into Logstash, which stores them in Elasticsearch, where they can be viewed via Kibana.
There are 2 main dashboards to keep track of:
* mediawiki-errors
* mediawiki-new-errors
This dashboard is used to keep track of errors encountered on hosted WMF wikis. It contains logs for errors and exceptions encountered such as runtime errors, logic errors, memory limits exceeded, timeouts, etc.
This dashboard is a copy of mediawiki-errors, with filters applied to remove errors that already have Phabricator tasks associated with them, and are therefore already reported.
You can manipulate your view of the mediawiki-new-errors dashboard:
Temporarily disable a filter on mediawiki-new-errors by hovering over a filter and clicking the checkbox icon.
Delete a filter by clicking the trashcan icon. (Be cautious when deleting, once saved, this will affect everyone’s view and should only be done when the error is obsolete or fixed.)
Log events have multiple attributes such as channel, reqId, url, wiki, etc. It is important to note that logs are purged after 90 days due to a privacy policy.
A few notable attributes are:
mwversion: MediaWiki version and wmf branch name. This is useful to determine if an error is in the new code still riding the train, or in fully deployed code.
* channel: Indicates where the error originated
* exception : Fatal exception (any uncaught exception, or timeout, out of memory, etc).
* error:Native PHP silent error (such as undefined variables).
* exception.class: The class name of the exception object that led to the problem. This is the object that propagated from a throw statement, and describes the kind of error.
* exception.message: The actual message describing the event.
* exception.file:This is the exact line of code where the error happened. It is the start of the stack trace, also known as the call site.
* exception.trace: The full stack trace for the event.
* url:HTTP url of the server request that failed.
* referrer: The previous or parent HTTP url from a browser. If present and set to a url with a WMF domain, it generally indicates that an error was experienced in a browser on one of our user-facing web pages (instead of via a bot or app).
* message: Specific message for this event, including full details and identifiers.
* normalized_message: Like “message” but with variable values replaced by placeholders. Useful for determining how often a particular category of message is occurring.
===Helpful Filter Attributes===
* normalized_message: Prefer this attribute for filtering over the “message” field. The “message” field contains a unique request ID and normalized message does not, to encompass different requests that have the same error message. It ignores values from fields such as “reqId” and “url”.
===Other places to find logs===
* [[Mwlog1001|mwlog1001]]
Run [[Wikimedia_binaries#logspam-watch|logspam-watch]] to get a filtered log of real-time occurring errors including information on: MW version, number of occurrences of the log, and the exception message
==Reporting An Error==
You notice an error on the mediawiki-new-errors dashboard, signifying it does not yet have an associated Phabricator task. There are 2 steps to take:
# Create a Phabricator task.
# Create a Kibana filter.
===Create a Phabricator Task===
There are two ways to create a Phabricator Task. Manually, or automatically generated from an existing Kibana log event.
# '''Manual Creation'''
## Click '''New Task'''
## Click '''New Task''' a second time, and click '''Report Error Code'''
# '''Automatic Creation from Kibana Log'''
## Expand one of the log event rows for the error you want to report
## Click '''Phatility''' > '''Submit'''.
## Check that the necessary information was autofilled.
### '''exception.trace''' - the stack trace. If not, manually copy it from the Kibana log
### '''reqId''' - unique request ID that the developer investigating the Phabricator task can use to find this Kibana log again (as well as to find any other errors that were emitted during the same request).
## Ensure no personally identifiable information (PII) is submitted with task
### Replace any of these with wildcards or such: User name, Article name, IP address, exact query parameters, exact timestamps. Be careful not to expose that a certain user was viewing a certain article, or that a certain article was viewed at a certain time.
#### Example: from to or /w/api.php?format=json&action=query&list=contribs&lc=John to /w/api.php?action=query&list=contribs&…
====Additional Information====
'''Add Tags'''
; Component : Look through the stack trace, or other log attributes, to try and find the appropriate component(s) to tag with the task. Oftentimes the stack trace will reveal clues about which component of MediaWiki core and/or which extension an error originated from. Try to narrow down the source of the error by identifying the non-generic code path. For example, for the MW API any code inside ApiBase or ApiMain could be considered generic.
; Team : Use the list at [] to find what team(s) own the component(s), and tag them as well. “wikimedia-production-error” is the default tag which adds it to the production errors Phabricator workboard.
'''Optional Steps'''
# If the task seems like it would be detrimental to production environments, you can mark this task as a train blocker. When in doubt, mark as train blocker and reach out to ReleaseEngineering to confirm.
## Click '''Edit Parent Task'''
## Add the train blocker task [[Deployments|for the week]] and add any extra information you can. For POST requests to the API, the request parameters are not available via Kibana. To find these, ssh to [[Mwlog1001|mwlog1001]] and grep /srv/mw-log/api.log for the reqId. The log here will contain the request parameters. (See if you need access to production servers.)
### Try to find how frequent the error is and if it’s likely new this week or whether it’s been recorded in previous week as well. To do this, use the (unfiltered) mediawiki-errors dashboard, and use the query bar to enter something like “message:<part of the error>”. Then search through the last 30 days to see how often and since when it is happening. Expand a few log event rows to see if it looks like the same issue indeed.
===Create a Kibana filter===
A new filter should be added to the mediawiki-new-errors dashboard so that we don’t report the same error multiple times and so that new errors stand out.
Decide what to filter.
Expand a specific log event and look for the exception.file field (or normalized_message).
We need to decide whether to exclude similar errors by call site (exception.file, start of stack trace) or by error message (normalized_message). It is preferred to exclude by exception.file because these can’t accidentally filter unrelated errors. To create the exclusion filter:
# Expand one of the log event rows relating to the problem we just reported to Phabricator.
# Click the negative magnifying glass icon with the hover value '''Filter out Value.''' This will create a filter to exclude all log events that are from this line of code.
##  As long as the entry for exception.file is not something generic like MWDebug.php or MWExceptionHandler.php, then we’ll filter by that. Otherwise, fall back to filtering out by normalised_message instead. 
# Click on the new filter’s '''pencil icon''' to edit.
# Set a Label
## Usually a combination of Phabricator task ID + very short summary. '''Example''': T231084 Bad $oldContent param.
# Click '''Save''' on the filter.
# Save the dashboard by clicking '''Edit''' in the gray bar at the top, then clicking '''Save'''.
==After Task Resolution==
After a Phabricator task has been completed, be sure to take a few more steps:
# Open the mediawiki-new-errors dashboard afresh.
# Delete the filter for this T-number from mediawiki-new-errors by clicking the trashcan icon.
# '''Edit''' > '''Save''', to save these changes.
<nowiki>Insert non-formatted text here</nowiki>

Revision as of 22:12, 13 May 2020