You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
< Logs
Revision as of 22:33, 11 June 2022 by imported>Quiddity (fix)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


This page outlines some useful techniques that can be used to help diagnose issues with Wikimedia sites, based on the various application and infrastructure logs available.

The choice of which technique(s) to employ will largely depend on the nature of the situation.

Ad-Hoc Analysis

Sometimes it is useful to be able to perform ad-hoc analysis of a real-time incident, by viewing a live log file of certain events and filtering it according to your needs. The following examples may be adapted to your specific requirements.

Webrequest Sampled

This log file is available on centrallog1001 and centrallog2002 in the file: /srv/log/webrequest/sampled-1000.json

As the name suggests 1 in 1,000 requests are extracted from the stream in Kafka and are retained in this file. Each file contains one day's logs and 62 day's worth of old logs are stored in /srv/log/webrequest/archive

nice summary

$ tail -f sampled-1000.json | /home/legoktm/webreq-filter

Grep-able oputput

$ jq  -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json

Select all public_cloud nets with 429

$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'

Select all requests with a specific user_agent and .referer

$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json

List of the top 10 IPs by response size

$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10

Select logs matching specific HTTP status, datestamp prefix, host, and uri_path, outputting the top query parameters found

$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head

5xx errors

most of the queries for the sampled-1000 log would work here as well


$ tail -f  /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"


all ips which have made more the 100 large requests

$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'

Retrospective Analysis

When the situation calls for analysis of more historical data, or to access the complete set of data, the Analytics Systems can help.


Turnilo has access to the [[1]] dataset, which is loaded every hour to Druid. As the name suggests, this samples 1 in 128 requests.

Data Lake

The primary source for webrequest logs is the Data Lake and the Analytics/Data Lake/Traffic/Webrequest tables in Hive.

These tables are updated hourly and may be queried using Hive, Presto, or Spark.

Please see for some sample queries using Hive.