You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Logs: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ahmon Dancy
(Added link to mw:Beta_Cluster#Testing_changes_on_Beta_Cluster)
imported>Krinkle
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=logging}}
{{Navigation Wikimedia infrastructure|expand=logging}}
: ''This page is about server log files. For [[IRC]] channel logs, see e.g. http://wm-bot.wmflabs.org/ ''
: ''This page is about server log files. For [[IRC]] channel logs, see e.g. http://wm-bot.wmflabs.org/ ''
'''Logs''' of several sorts are generated across the cluster and collected in a single [[Locations|location]] replicated on some machines. Privileged users can explore most logs through the [[Kibana]] front-end at https://logstash.wikimedia.org/.
'''Logs''' of several sorts are generated across the cluster and collected in a single [[Locations|location]] replicated on some machines. Privileged users can explore most logs through the [[OpenSearch Dashboards]] front-end at https://logstash.wikimedia.org/.


{{anchor|mw-log}}
{{anchor|mw-log}}


The SRE Observability team is working on a common log format called ECS, see the linked [https://docs.google.com/document/d/1HYHCPvuz93nAYXQSEReUN07HQTQUF_nvltag5H_YZq4/edit#heading=h.vpanev2oq14b doc] and intro slides. ECS documentation can be found at https://doc.wikimedia.org/ecs/
The SRE Observability team is working on a common log format called ECS, see the linked [https://docs.google.com/document/d/1HYHCPvuz93nAYXQSEReUN07HQTQUF_nvltag5H_YZq4/edit#heading=h.vpanev2oq14b doc] and intro slides. ECS documentation can be found at https://doc.wikimedia.org/ecs/
For a quick reference of debugging techniques, see [[Logs/Runbook]].


__TOC__
__TOC__
== <code>[[mwlog1002]]:/srv/mw-log/</code> ==
== <code>[[mwlog1002]]:/srv/mw-log/</code> ==
These record <code>wfDebugLog()</code> and similar calls in MediaWiki (see especially [[mw:Manual:Structured_logging|mw:Structured logging]]). All cluster-wide logs are aggregated here (configured through [[MediaWiki UDP logging|$wmfUdp2logDest]], see also [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=wmgMonologChannels wmgMonologChannels]). There are dozens log files, which amount to around 15 GB compressed per day [[phabricator:T88393#1161994|as of April 2015]]. Some are not sent to [[logstash]] ([https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27logstash%27 settings]) and [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27sample%27 some are sampled]; log archives are stored for a [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/logging/mw-log-cleanup variable amount of time], up to 90 days (per [[m:Data_retention_guidelines#To_what_data_do_these_guidelines_apply?|data retention guideline]]). Note that logstash also records the context data for structured logging, so it might contain significantly more information than the files.
These record <code>wfDebugLog()</code> and similar calls in MediaWiki (see especially [[mw:Manual:Structured_logging|mw:Structured logging]]). All cluster-wide logs are aggregated here (configured through [[MediaWiki UDP logging|$wmgUdp2logDest]], see also [https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php wmgMonologChannels]). There are dozens log files, which amount to around 15 GB compressed per day [[phabricator:T88393#1161994|as of April 2015]]. Some are not sent to [[logstash]] ([https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php settings]) and some are sampled; log archives are stored for a [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/logging/mw-log-cleanup variable amount of time], up to 90 days (per [[m:Data_retention_guidelines#To_what_data_do_these_guidelines_apply?|data retention guideline]]). Note that logstash also records the context data for structured logging, so it might contain significantly more information than the files.
 
Source: All appserver clusters.
 
Directories:


Source: Cluster-wide
* <code>archive/</code>: Directory holding a limited number of previous days of the same logs (compressed once a day).
*'''<code>exception.log</code>''': Exceptions exposed to users in simplified form include a hexadecimal fingerprint (e.g. in case of "<code>[1903eff7] 2013-06-18 02:39:00: Fatal exception of type MWException"</code>, grep the exception log for "1903eff7" to find the complete stack trace). See [[bugzilla:38095|bug 38095]] for background.
 
**Counts for fatals and exceptions are monitored in [https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 Grafana].
General channels:
*<code>antispoof.log</code>: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
*<code>exception.log</code>: Fatal exceptions that receive either a localised "Internal error" page, or a Wikimedia Error page rendered by [[php-wmerrors]].
**Error pages report a request ID, e.g. <code>[d84af39036] 2011-04-01: Fatal exception of type MWException"</code>.
**To find details, search for <code>d84af39036</code> in exception.log, or in Grafana under the "mediawiki" dashboard the exception log for "1903eff7" to find the complete stack trace).
*<code>apache2.log</code>: aggregated Apache error logs, see [[#syslog]]
*<code>apache2.log</code>: aggregated Apache error logs, see [[#syslog]]
*<code>api.log</code>: API requests (including URLs and some agent info, like username and IP address). Sampled 1:1000 [[gerrit:179412|from 2014-12-15 to some time in 2015]], flushed every 30 days as of November 2015.
*<code>api.log</code>: API requests and their parameters (including redacted POST payloads, and temporary PII). This used to be sampled, but is no longer ([[gerrit:179412|during 2014-2015]]) and is flushed every 30 days as of Nov 2015.
*<code>badpass.log</code>: Failed login attempts to wikis.
Specific components:
 
* <code>antispoof.log</code>: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
* <code>badpass.log</code>: Failed login attempts to wikis.
*<code>captcha.log</code>: Captcha attempts (both failed and successful attempts).
*<code>captcha.log</code>: Captcha attempts (both failed and successful attempts).
*<code>centralauth.log</code> (2013-05-09–), <code>centralauth-bug39996.log</code>, <code>centralauthrename.log</code> (2014-07-14–): (temporary) debug logs for [[bugzilla:35707]], [[bugzilla:39996]], [[bugzilla:67875]]. In theory, rare events; can include username and page visited/request made.
*<code>centralauth.log</code> (2013-05-09–), <code>centralauth-bug39996.log</code>, <code>centralauthrename.log</code> (2014-07-14–): (temporary) debug logs for [[bugzilla:35707]], [[bugzilla:39996]], [[bugzilla:67875]]. In theory, rare events; can include username and page visited/request made.
Line 23: Line 34:
*<code>CirrusSearchSlowRequests.log</code>: Logs slow requests
*<code>CirrusSearchSlowRequests.log</code>: Logs slow requests
*<code>CirrusSearchChangeFailed.log</code>: Logs update failures
*<code>CirrusSearchChangeFailed.log</code>: Logs update failures
*<code>dberror.log</code>: Database errors (invalid queries, missing tables, dealocks, lock-wait timeouts, disconnections).
*<code>dbperformance.log</code>: DB transactions that hold DB locks open for a long time while running slow functions.
*<code>exec.log</code>: Errors from shell commands run by MediaWiki via <code>wfShellEx</code> (logs the command and error string).
*<code>external.log</code>: ExternalStore blob fetch failures (see [[External storage]])
*<code>external.log</code>: ExternalStore blob fetch failures (see [[External storage]])
*'''<code>fatal.log</code>''': Fatal PHP errors during web requests, responded to with a [[PHP error pages|Wikimedia Error]] page. ([https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg{{urlencode:[]}}=vanadium.eqiad.wmnet&mreg{{urlencode:[]}}=fatal|exception&gtype=stack&glegend=show&aggregate=1&embed=1 aggregated graph]). With [[HHVM]], they are in general under "hhvm" logs, in [[logstash]].[https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm]
*<code>filebackend-ops.log</code>: FileBackendStore operation failures (i.e. backend errors that happen during user file uploads).
*<code>generated-pp-node-count.log</code>: High node count parses that took place (typically for slow parses of very large and complex articles).
*<code>gettingstarted.log</code>: ?
*<code>imagemove.log</code>: Page renames in the File namespace that take place (both failed and successful renames).
*<code>imagemove.log</code>: Page renames in the File namespace that take place (both failed and successful renames).
*<code>memcached-serious.log</code>: Memcached access failures (effects caching and storage of ephemeral data, like rate limiting counters and advisory locks).
*<code>memcached.log</code>: [[Memcached for MediaWiki]] (WANObjectCache, misc ephemeral data, rate limiting counters, advisory locks).
*<code>poolcounter.log</code>: [[PoolCounter]] failures (connection problems, excess queue size, wait timeouts).
*<code>poolcounter.log</code>: [[PoolCounter]] failures (connection problems, excess queue size, wait timeouts).
*<code>redis.log</code>: Redis query and connection failures (might involve sessions, job queues, and some other assorted features).
*<code>redis.log</code>: Redis query and connection failures (might involve sessions, job queues, and some other assorted features).
*<code>redis-jobqueue.log</code>: Redis query and connection failures from JobQueueRedis.
* <code>resourceloader.log</code>: Exceptions related to [[mw:ResourceLoader|ResourceLoader]].
*<code>resourceloader.log</code>: Exceptions related to [[mw:ResourceLoader|ResourceLoader]].
* <code>runJobs.log</code>: Tracks job queue activity and including errors (both failed and successful runs).
*<code>runJobs.log</code>: Tracks job queue activity and including errors (both failed and successful runs).
**Can be used to produce stats on jobs run on the various wikis, e.g. with Tim's <code>perl ~/job-stats.pl runJobs.log</code>.
**Can be used to produce stats on jobs run on the various wikis, e.g. with Tim's <code>perl ~/job-stats.pl runJobs.log</code>.
*'''<code>slow-parse.log</code>''' (since May 2012; 6 months archive)
 
* <code>swift-backend.log</code>: Errors in the SwiftFileBackend class (timeouts and HTTP 500 type errors for file and listing reads/writes).
*<code>slow-parse.log</code> (since May 2012; 6 months archive)
*<code>spam.log</code>: SimpleAntiSpam honeypot hits from bots (attempted user actions are discarded).
*<code>spam.log</code>: SimpleAntiSpam honeypot hits from bots (attempted user actions are discarded).
*<code>swift-backend.log</code>: Errors in the SwiftFileBackend class (timeouts and HTTP 500 type errors for file and listing reads/writes).
*<code>temp-debug.log</code>: Used for temporary logging of misc things during live debug sessions.
*<code>test2wiki.log</code>: Full wfDebug log of [[test2.wikipedia.org]].
*<code>testwiki.log</code>: Full wfDebug log of [[test.wikipedia.org]].
*'''<code>[[thumbnail.log]]</code>''': Failed thumbnail transformations (e.g. missing file, conversion failure, 0-byte output files).
*<code>xff.log</code>: User agent and IP data for POST requests.
*<code>XWikimediaDebug.log</code>: see [[X-Wikimedia-Debug#Debug logging]].
*<code>XWikimediaDebug.log</code>: see [[X-Wikimedia-Debug#Debug logging]].
*<code>zero.log</code> (since May 2013)
* <code>archive</code>: Directory holding historical versions of the above logs (one compressed file per log source per day).


== [[syslog]] ==
== [[syslog]] ==
Line 58: Line 54:
== 5xx errors ==
== 5xx errors ==


5xx errors are available on centrallog1001.eqiad.wmnet:/srv/weblog/webrequest/5xx.json. And in logstash, with [https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X 5xx kibana dashboard]
5xx errors are available on centrallog1001.eqiad.wmnet:/srv/weblog/webrequest/5xx.json. And in logstash, with [https://logstash.wikimedia.org/app/dashboards#/view/Varnish-Webrequest-50X Varnish 5xx Logstash dashboard]


== <code>[[deploy1002]]:/var/log/l10updatelog/l10update.log</code> ==
== <code>[[deploy1002]]:/var/log/l10updatelog/l10update.log</code> ==
Line 65: Line 61:


== <code>[[vanadium]]:/var/log/eventlogging/</code>==
== <code>[[vanadium]]:/var/log/eventlogging/</code>==
* <code>various</code>: Logs of [[mediawikiwiki:Extension:EventLogging|EventLogging]] entries. Potentially useful, in case their transformation into SQL and MongoDB records fails.
* <code>various</code>: Logs of [[mw:Extension:EventLogging|EventLogging]] entries. Potentially useful, in case their transformation into SQL records fails.


==Request logs==
==Request logs==

Latest revision as of 04:19, 4 August 2022

This page is about server log files. For IRC channel logs, see e.g. http://wm-bot.wmflabs.org/

Logs of several sorts are generated across the cluster and collected in a single location replicated on some machines. Privileged users can explore most logs through the OpenSearch Dashboards front-end at https://logstash.wikimedia.org/.

The SRE Observability team is working on a common log format called ECS, see the linked doc and intro slides. ECS documentation can be found at https://doc.wikimedia.org/ecs/

For a quick reference of debugging techniques, see Logs/Runbook.

mwlog1002:/srv/mw-log/

These record wfDebugLog() and similar calls in MediaWiki (see especially mw:Structured logging). All cluster-wide logs are aggregated here (configured through $wmgUdp2logDest, see also wmgMonologChannels). There are dozens log files, which amount to around 15 GB compressed per day as of April 2015. Some are not sent to logstash (settings) and some are sampled; log archives are stored for a variable amount of time, up to 90 days (per data retention guideline). Note that logstash also records the context data for structured logging, so it might contain significantly more information than the files.

Source: All appserver clusters.

Directories:

  • archive/: Directory holding a limited number of previous days of the same logs (compressed once a day).

General channels:

  • exception.log: Fatal exceptions that receive either a localised "Internal error" page, or a Wikimedia Error page rendered by php-wmerrors.
    • Error pages report a request ID, e.g. [d84af39036] 2011-04-01: Fatal exception of type MWException".
    • To find details, search for d84af39036 in exception.log, or in Grafana under the "mediawiki" dashboard the exception log for "1903eff7" to find the complete stack trace).
  • apache2.log: aggregated Apache error logs, see #syslog
  • api.log: API requests and their parameters (including redacted POST payloads, and temporary PII). This used to be sampled, but is no longer (during 2014-2015) and is flushed every 30 days as of Nov 2015.

Specific components:

  • antispoof.log: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
  • badpass.log: Failed login attempts to wikis.
  • captcha.log: Captcha attempts (both failed and successful attempts).
  • centralauth.log (2013-05-09–), centralauth-bug39996.log, centralauthrename.log (2014-07-14–): (temporary) debug logs for bugzilla:35707, bugzilla:39996, bugzilla:67875. In theory, rare events; can include username and page visited/request made.
  • CirrusSearch.log: Logs various info concerning cirrus (update/query failures and various debug info), Cirrus now uses the analytics platform to log search requests (Analytics/Data/Cirrus).
  • CirrusSearchSlowRequests.log: Logs slow requests
  • CirrusSearchChangeFailed.log: Logs update failures
  • external.log: ExternalStore blob fetch failures (see External storage)
  • imagemove.log: Page renames in the File namespace that take place (both failed and successful renames).
  • memcached.log: Memcached for MediaWiki (WANObjectCache, misc ephemeral data, rate limiting counters, advisory locks).
  • poolcounter.log: PoolCounter failures (connection problems, excess queue size, wait timeouts).
  • redis.log: Redis query and connection failures (might involve sessions, job queues, and some other assorted features).
  • resourceloader.log: Exceptions related to ResourceLoader.
  • runJobs.log: Tracks job queue activity and including errors (both failed and successful runs).
    • Can be used to produce stats on jobs run on the various wikis, e.g. with Tim's perl ~/job-stats.pl runJobs.log.
  • swift-backend.log: Errors in the SwiftFileBackend class (timeouts and HTTP 500 type errors for file and listing reads/writes).
  • slow-parse.log (since May 2012; 6 months archive)
  • spam.log: SimpleAntiSpam honeypot hits from bots (attempted user actions are discarded).
  • XWikimediaDebug.log: see X-Wikimedia-Debug#Debug logging.

syslog

The syslog for all application servers can be found on apache2.log on mwlog1001 or /srv/syslog/apache.log on centrallog1001. This includes things like segmentation faults.

5xx errors

5xx errors are available on centrallog1001.eqiad.wmnet:/srv/weblog/webrequest/5xx.json. And in logstash, with Varnish 5xx Logstash dashboard

deploy1002:/var/log/l10updatelog/l10update.log

Source: scap

  • l10update.log: Error log for LocalisationUpdate runs.

vanadium:/var/log/eventlogging/

  • various: Logs of EventLogging entries. Potentially useful, in case their transformation into SQL records fails.

Request logs

Logs of any kind of request, e.g. viewing a wiki page, editing, using the API, loading an image.

  • Analytics/Data/Webrequest: "wmf.webrequest" is a name of one unsampled requests archive in Hive. We started deleting older wmf.webrequest data in March 2015. We currently keep 62 days.

centrallog1001:/srv/weblog/webrequest

The cache (outer layer) request logs; see Squid logging#Log files.

The 1:1000 sampled logs are used for about 15 monthly and quarterly reports and day to day operations (source).

Beta cluster

The mw:Beta cluster in labs has a similar logging configuration to production. Various server logs are written to the remote syslog server deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud in /srv/mw-log.

Apache access logs are written to /var/log/apache2/other_vhosts_access.log on each beta cluster host.

See mw:Beta_Cluster#Testing_changes_on_Beta_Cluster for information on how to access the beta logstash web UI.

Mailservers

exim logs are retained for 90 days (see phabricator:T167333).

Dead

Lucene (search)

Each host logs at /a/search/log/log (now less noisy), see Search#Trouble on how to identify which host serves what pool etc.

fenari:/home/wikipedia/syslog

Source: All apaches

  • apache.log: Error log of all apaches (includes sterr of PHP, so PHP Notices, PHP Warnings etc.)
    • Use fatalmonitor to aggregate this into a (tailing) report
    • This has been deprecated in favor of fluorine:/a/mw-log/apache2.log and logstash.

fenari:/var/log/

Source: Machine-specific logs

External links