You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Logs"

From Wikitech-static
Jump to navigation Jump to search
imported>Ema
imported>Ladsgroup
 
(25 intermediate revisions by 14 users not shown)
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=logging}}
{{Navigation Wikimedia infrastructure|expand=logging}}
: ''This page is about server log files. For [[IRC]] channel logs, see e.g. http://wm-bot.wmflabs.org/ ''
: ''This page is about server log files. For [[IRC]] channel logs, see e.g. http://wm-bot.wmflabs.org/ ''
'''Logs''' of several sorts are generated across the cluster and collected in a single [[Locations|location]] replicated on some machines. You can explore most logs through the [[Kibana]] front-end at https://logstash.wikimedia.org/.
'''Logs''' of several sorts are generated across the cluster and collected in a single [[Locations|location]] replicated on some machines. Privileged users can explore most logs through the [[Kibana]] front-end at https://logstash.wikimedia.org/.
== <code>[[fluorine]]:/a/mw-log/</code> ==
 
All cluster-wide logs are aggregated here (configured through [[MediaWiki UDP logging|$wmfUdp2logDest]], see also [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=wmgMonologChannels wmgMonologChannels]). There are dozens log files, which amount to around 15 GB compressed per day [[phabricator:T88393#1161994|as of April 2015]]. Some are not sent to [[logstash]] ([https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27logstash%27 settings]) and [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27sample%27 some are sampled]; log archives are stored for a [https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/misc/scripts/mw-log-cleanup variable amount of time], up to 180 days.
{{anchor|mw-log}}
 
The SRE Observability team is working on a common log format called ECS, see the linked [https://docs.google.com/document/d/1HYHCPvuz93nAYXQSEReUN07HQTQUF_nvltag5H_YZq4/edit#heading=h.vpanev2oq14b doc] and intro slides. ECS documentation can be found at https://doc.wikimedia.org/ecs/
 
__TOC__
== <code>[[mwlog1002]]:/srv/mw-log/</code> ==
These record <code>wfDebugLog()</code> and similar calls in MediaWiki (see especially [[mw:Manual:Structured_logging|mw:Structured logging]]). All cluster-wide logs are aggregated here (configured through [[MediaWiki UDP logging|$wmfUdp2logDest]], see also [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=wmgMonologChannels wmgMonologChannels]). There are dozens log files, which amount to around 15 GB compressed per day [[phabricator:T88393#1161994|as of April 2015]]. Some are not sent to [[logstash]] ([https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27logstash%27 settings]) and [https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php?grep=%27sample%27 some are sampled]; log archives are stored for a [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/logging/mw-log-cleanup variable amount of time], up to 90 days (per [[m:Data_retention_guidelines#To_what_data_do_these_guidelines_apply?|data retention guideline]]). Note that logstash also records the context data for structured logging, so it might contain significantly more information than the files.


Source: Cluster-wide
Source: Cluster-wide
*'''<code>exception.log</code>''': Exceptions exposed to users in simplified form include a hexadecimal fingerprint (e.g. in case of "<code>[1903eff7] 2013-06-18 02:39:00: Fatal exception of type MWException"</code>, grep the exception log for "1903eff7" to find the complete stack trace). See [[bugzilla:38095|bug 38095]] for background.
*'''<code>exception.log</code>''': Exceptions exposed to users in simplified form include a hexadecimal fingerprint (e.g. in case of "<code>[1903eff7] 2013-06-18 02:39:00: Fatal exception of type MWException"</code>, grep the exception log for "1903eff7" to find the complete stack trace). See [[bugzilla:38095|bug 38095]] for background.
**Counts for fatals, query errors and exceptions are monitored in Ganglia under "[https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+eqiad&h=vanadium.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=NOGROUPS_|_mediawiki#mg_mediawiki MediaWiki metrics on vanadium]" ([https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg%5B%5D=vanadium.eqiad.wmnet&mreg%5B%5D=fatal%7Cexception&gtype=stack&glegend=show&aggregate=1&embed=1 aggregated graph])
**Counts for fatals and exceptions are monitored in [https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1 Grafana].
*<code>antispoof.log</code>: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
*<code>antispoof.log</code>: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
*<code>apache2.log</code>: aggregated Apache error logs, see [[#syslog]]
*<code>apache2.log</code>: aggregated Apache error logs, see [[#syslog]]
Line 14: Line 20:
*<code>captcha.log</code>: Captcha attempts (both failed and successful attempts).
*<code>captcha.log</code>: Captcha attempts (both failed and successful attempts).
*<code>centralauth.log</code> (2013-05-09–), <code>centralauth-bug39996.log</code>, <code>centralauthrename.log</code> (2014-07-14–): (temporary) debug logs for [[bugzilla:35707]], [[bugzilla:39996]], [[bugzilla:67875]]. In theory, rare events; can include username and page visited/request made.
*<code>centralauth.log</code> (2013-05-09–), <code>centralauth-bug39996.log</code>, <code>centralauthrename.log</code> (2014-07-14–): (temporary) debug logs for [[bugzilla:35707]], [[bugzilla:39996]], [[bugzilla:67875]]. In theory, rare events; can include username and page visited/request made.
*<code>CirrusSearchRequests.log</code>: Logs every request made between mediawiki and the elasticsearch cluster
*<code>CirrusSearch.log</code>: Logs various info concerning cirrus (update/query failures and various debug info), Cirrus now uses the analytics platform to log search requests ([[Analytics/Data/Cirrus]]).
*<code>CirrusSearchSlowRequests.log</code>: Logs slow requests
*<code>CirrusSearchChangeFailed.log</code>: Logs update failures
*<code>dberror.log</code>: Database errors (invalid queries, missing tables, dealocks, lock-wait timeouts, disconnections).
*<code>dberror.log</code>: Database errors (invalid queries, missing tables, dealocks, lock-wait timeouts, disconnections).
*<code>dbperformance.log</code>: DB transactions that hold DB locks open for a long time while running slow functions.
*<code>dbperformance.log</code>: DB transactions that hold DB locks open for a long time while running slow functions.
Line 39: Line 47:
*'''<code>[[thumbnail.log]]</code>''': Failed thumbnail transformations (e.g. missing file, conversion failure, 0-byte output files).
*'''<code>[[thumbnail.log]]</code>''': Failed thumbnail transformations (e.g. missing file, conversion failure, 0-byte output files).
*<code>xff.log</code>: User agent and IP data for POST requests.
*<code>xff.log</code>: User agent and IP data for POST requests.
*<code>XWikimediaDebug.log</code>: see [[X-Wikimedia-Debug#Debug logging]].
*<code>zero.log</code> (since May 2013)
*<code>zero.log</code> (since May 2013)


Line 45: Line 54:
== [[syslog]] ==
== [[syslog]] ==


The [[syslog]] for all application servers can be found on apache2.log on fluorine or /srv/syslog/apache.log on [[lithium]]. This includes things like segmentation faults.
The [[syslog]] for all application servers can be found on apache2.log on mwlog1001 or /srv/syslog/apache.log on [[centrallog1001]]. This includes things like segmentation faults.


== 5xx errors ==
== 5xx errors ==


5xx errors are available on oxygen.eqiad.wmnet:/srv/log/webrequest/5xx.json.  
5xx errors are available on centrallog1001.eqiad.wmnet:/srv/weblog/webrequest/5xx.json. And in logstash, with [https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X 5xx kibana dashboard]


== <code>[[tin]]:/var/log/l10updatelog/l10update.log</code> ==
== <code>[[deploy1002]]:/var/log/l10updatelog/l10update.log</code> ==
Source: tin
Source: scap
* <code>l10update.log</code>: Error log for LocalisationUpdate runs.
* <code>l10update.log</code>: Error log for LocalisationUpdate runs.


== <code>[[vanadium]]:/var/log/eventlogging/</code>==
== <code>[[vanadium]]:/var/log/eventlogging/</code>==
* <code>various</code>: Logs of [[mediawikiwiki:Extension:EventLogging|EventLogging]] entries. Potentially useful, in case their transformation into SQL and MongoDB records fails.
* <code>various</code>: Logs of [[mediawikiwiki:Extension:EventLogging|EventLogging]] entries. Potentially useful, in case their transformation into SQL and MongoDB records fails.
=== Deprecated properties and features in clients ===
Harvesting from clients in production is certainly possible: like [http://git.wikimedia.org/blob/translatewiki.git/HEAD/webfiles%2Ftwn.jserrorlog.js translatewiki.net], Wikimedia does it. Deprecated properties[https://github.com/wikimedia/mediawiki-core/blob/wmf/1.24wmf4/resources/src/mediawiki/mediawiki.js#L567] and features[https://github.com/wikimedia/mediawiki-core/blob/wmf/1.24wmf4/resources/src/mediawiki.api/mediawiki.api.js#L189] use mw.track[https://github.com/wikimedia/mediawiki-core/blob/wmf/1.24wmf4/resources/src/mediawiki/mediawiki.js#L410-L427] to emit an event.
And the WikimediaEvents extension forwards these to EventLogging (at a sampled rate of course).[https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/ext.wikimediaEvents.deprecate.js#L14-L16] Which are then available privately in the analytics database, and made available anonymised in Graphite[http://codepen.io/Krinkle/full/zyodJ/].
When something, like jQuery Migrate, doesn't have nice keys, you'll have to do with the full descriptive sentence of the warning (as done at TWN).


==Request logs==
==Request logs==
Line 74: Line 75:
** Used to generate dumps for [[Analytics/Data/Pagecounts-all-sites|pagecounts-all-sites]]
** Used to generate dumps for [[Analytics/Data/Pagecounts-all-sites|pagecounts-all-sites]]


=== <code>[[erbium]]:/a/log/webrequest/</code> ===
=== <code>[[centrallog1001]]:/srv/weblog/webrequest</code> ===


The squid (now varnish) request logs; see [[Squid logging#Log files]].
The cache (outer layer) request logs; see [[Squid logging#Log files]].
* [[Cache log format]]
* [[Cache log format]]
* [[Analytics/Requests stream]]
* [[Analytics/Requests stream]]
Line 84: Line 85:
== Beta cluster ==
== Beta cluster ==
The [[mw:Beta cluster]] in labs has a similar logging configuration to production.
The [[mw:Beta cluster]] in labs has a similar logging configuration to production.
Various server logs are written to the remote syslog server deployment-fluorine.eqiad.wmflabs in <tt>/srv/mw-log</tt>.
Various server logs are written to the remote syslog server <code>deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud</code> in <tt>/srv/mw-log</tt>.


Apache access logs are written to <tt>/var/log/apache2/other_vhosts_access.log</tt> on
Apache access logs are written to <tt>/var/log/apache2/other_vhosts_access.log</tt> on
each beta cluster yhost.
each beta cluster yhost.
== Mailservers ==
[[exim]] logs are retained for 90 days (see [[phabricator:T167333]]).


==Dead==
==Dead==
Line 104: Line 109:
*<code>l10nupdatelog/l10nupdate.log</code>: Used by [[LocalisationUpdate]].
*<code>l10nupdatelog/l10nupdate.log</code>: Used by [[LocalisationUpdate]].


[[Category:Help]]
==External links==
*[http://groups.ischool.berkeley.edu/log-mgmt/ Recommendation from the Usage Log Retention Policy Workshop ]
 
[[Category:Wikimedia infrastructure]]
[[Category:How-To]]
[[Category:Bot and monitoring]]
[[Category:Bot and monitoring]]

Latest revision as of 18:05, 14 September 2021

Wikimedia infrastructure

[edit]

This page is about server log files. For IRC channel logs, see e.g. http://wm-bot.wmflabs.org/

Logs of several sorts are generated across the cluster and collected in a single location replicated on some machines. Privileged users can explore most logs through the Kibana front-end at https://logstash.wikimedia.org/.

The SRE Observability team is working on a common log format called ECS, see the linked doc and intro slides. ECS documentation can be found at https://doc.wikimedia.org/ecs/

mwlog1002:/srv/mw-log/

These record wfDebugLog() and similar calls in MediaWiki (see especially mw:Structured logging). All cluster-wide logs are aggregated here (configured through $wmfUdp2logDest, see also wmgMonologChannels). There are dozens log files, which amount to around 15 GB compressed per day as of April 2015. Some are not sent to logstash (settings) and some are sampled; log archives are stored for a variable amount of time, up to 90 days (per data retention guideline). Note that logstash also records the context data for structured logging, so it might contain significantly more information than the files.

Source: Cluster-wide

  • exception.log: Exceptions exposed to users in simplified form include a hexadecimal fingerprint (e.g. in case of "[1903eff7] 2013-06-18 02:39:00: Fatal exception of type MWException", grep the exception log for "1903eff7" to find the complete stack trace). See bug 38095 for background.
    • Counts for fatals and exceptions are monitored in Grafana.
  • antispoof.log: Collision check passes and failures from the AntiSpoof extension. This checks for strings that look the same using different Unicode characters (such as spoofed usernames).
  • apache2.log: aggregated Apache error logs, see #syslog
  • api.log: API requests (including URLs and some agent info, like username and IP address). Sampled 1:1000 from 2014-12-15 to some time in 2015, flushed every 30 days as of November 2015.
  • badpass.log: Failed login attempts to wikis.
  • captcha.log: Captcha attempts (both failed and successful attempts).
  • centralauth.log (2013-05-09–), centralauth-bug39996.log, centralauthrename.log (2014-07-14–): (temporary) debug logs for bugzilla:35707, bugzilla:39996, bugzilla:67875. In theory, rare events; can include username and page visited/request made.
  • CirrusSearch.log: Logs various info concerning cirrus (update/query failures and various debug info), Cirrus now uses the analytics platform to log search requests (Analytics/Data/Cirrus).
  • CirrusSearchSlowRequests.log: Logs slow requests
  • CirrusSearchChangeFailed.log: Logs update failures
  • dberror.log: Database errors (invalid queries, missing tables, dealocks, lock-wait timeouts, disconnections).
  • dbperformance.log: DB transactions that hold DB locks open for a long time while running slow functions.
  • exec.log: Errors from shell commands run by MediaWiki via wfShellEx (logs the command and error string).
  • external.log: ExternalStore blob fetch failures (see External storage)
  • fatal.log: Fatal PHP errors during web requests, responded to with a Wikimedia Error page. (aggregated graph). With HHVM, they are in general under "hhvm" logs, in logstash.[1]
  • filebackend-ops.log: FileBackendStore operation failures (i.e. backend errors that happen during user file uploads).
  • generated-pp-node-count.log: High node count parses that took place (typically for slow parses of very large and complex articles).
  • gettingstarted.log: ?
  • imagemove.log: Page renames in the File namespace that take place (both failed and successful renames).
  • memcached-serious.log: Memcached access failures (effects caching and storage of ephemeral data, like rate limiting counters and advisory locks).
  • poolcounter.log: PoolCounter failures (connection problems, excess queue size, wait timeouts).
  • redis.log: Redis query and connection failures (might involve sessions, job queues, and some other assorted features).
  • redis-jobqueue.log: Redis query and connection failures from JobQueueRedis.
  • resourceloader.log: Exceptions related to ResourceLoader.
  • runJobs.log: Tracks job queue activity and including errors (both failed and successful runs).
    • Can be used to produce stats on jobs run on the various wikis, e.g. with Tim's perl ~/job-stats.pl runJobs.log.
  • slow-parse.log (since May 2012; 6 months archive)
  • spam.log: SimpleAntiSpam honeypot hits from bots (attempted user actions are discarded).
  • swift-backend.log: Errors in the SwiftFileBackend class (timeouts and HTTP 500 type errors for file and listing reads/writes).
  • temp-debug.log: Used for temporary logging of misc things during live debug sessions.
  • test2wiki.log: Full wfDebug log of test2.wikipedia.org.
  • testwiki.log: Full wfDebug log of test.wikipedia.org.
  • thumbnail.log: Failed thumbnail transformations (e.g. missing file, conversion failure, 0-byte output files).
  • xff.log: User agent and IP data for POST requests.
  • XWikimediaDebug.log: see X-Wikimedia-Debug#Debug logging.
  • zero.log (since May 2013)
  • archive: Directory holding historical versions of the above logs (one compressed file per log source per day).

syslog

The syslog for all application servers can be found on apache2.log on mwlog1001 or /srv/syslog/apache.log on centrallog1001. This includes things like segmentation faults.

5xx errors

5xx errors are available on centrallog1001.eqiad.wmnet:/srv/weblog/webrequest/5xx.json. And in logstash, with 5xx kibana dashboard

deploy1002:/var/log/l10updatelog/l10update.log

Source: scap

  • l10update.log: Error log for LocalisationUpdate runs.

vanadium:/var/log/eventlogging/

  • various: Logs of EventLogging entries. Potentially useful, in case their transformation into SQL and MongoDB records fails.

Request logs

Logs of any kind of request, e.g. viewing a wiki page, editing, using the API, loading an image.

  • Analytics/Data/Webrequest: "wmf.webrequest" is a name of one unsampled requests archive in Hive. We started deleting older wmf.webrequest data in March 2015. We currently keep 62 days.

centrallog1001:/srv/weblog/webrequest

The cache (outer layer) request logs; see Squid logging#Log files.

The 1:1000 sampled logs are used for about 15 monthly and quarterly reports and day to day operations (source).

Beta cluster

The mw:Beta cluster in labs has a similar logging configuration to production. Various server logs are written to the remote syslog server deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud in /srv/mw-log.

Apache access logs are written to /var/log/apache2/other_vhosts_access.log on each beta cluster yhost.

Mailservers

exim logs are retained for 90 days (see phabricator:T167333).

Dead

Lucene (search)

Each host logs at /a/search/log/log (now less noisy), see Search#Trouble on how to identify which host serves what pool etc.

fenari:/home/wikipedia/syslog

Source: All apaches

  • apache.log: Error log of all apaches (includes sterr of PHP, so PHP Notices, PHP Warnings etc.)
    • Use fatalmonitor to aggregate this into a (tailing) report
    • This has been deprecated in favor of fluorine:/a/mw-log/apache2.log and logstash.

fenari:/var/log/

Source: Machine-specific logs

External links