You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Logs/Runbook: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jbond
imported>Quiddity
(fix)
 
Line 14: Line 14:


=== nice summary ===
=== nice summary ===
<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ tail -f sampled-1000.json | /home/legoktm/webreq-filter
$ tail -f sampled-1000.json | /home/legoktm/webreq-filter
</syntaxhighlight>
</syntaxhighlight>


==== Grep-able oputput ====
==== Grep-able oputput ====
<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ jq  -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json
$ jq  -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json
</syntaxhighlight>
</syntaxhighlight>


==== Select all public_cloud nets with 429 ====
==== Select all public_cloud nets with 429 ====
<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'
$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'
</syntaxhighlight>
</syntaxhighlight>
Line 30: Line 30:
==== Select all requests with a specific user_agent and .referer ====
==== Select all requests with a specific user_agent and .referer ====


<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json
$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json
</syntaxhighlight>
</syntaxhighlight>
Line 36: Line 36:
==== List of the top 10 IPs by response size ====
==== List of the top 10 IPs by response size ====


<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10
$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10
</syntaxhighlight>
</syntaxhighlight>
Line 49: Line 49:


==== Grepable ====
==== Grepable ====
<syntaxhighlight source="console">
<syntaxhighlight lang="bash">
$ tail -f  /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"
$ tail -f  /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"
</syntaxhighlight>
</syntaxhighlight>
Line 57: Line 57:
==== all ips which have made more the 100 large requests ====
==== all ips which have made more the 100 large requests ====


<syntaxhighlight lang=console>
<syntaxhighlight lang="bash">
$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'
$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'
</syntaxhighlight>
</syntaxhighlight>

Latest revision as of 22:33, 11 June 2022

Introduction

This page outlines some useful techniques that can be used to help diagnose issues with Wikimedia sites, based on the various application and infrastructure logs available.

The choice of which technique(s) to employ will largely depend on the nature of the situation.

Ad-Hoc Analysis

Sometimes it is useful to be able to perform ad-hoc analysis of a real-time incident, by viewing a live log file of certain events and filtering it according to your needs. The following examples may be adapted to your specific requirements.

Webrequest Sampled

This log file is available on centrallog1001 and centrallog2002 in the file: /srv/log/webrequest/sampled-1000.json

As the name suggests 1 in 1,000 requests are extracted from the stream in Kafka and are retained in this file. Each file contains one day's logs and 62 day's worth of old logs are stored in /srv/log/webrequest/archive

nice summary

$ tail -f sampled-1000.json | /home/legoktm/webreq-filter

Grep-able oputput

$ jq  -r "[.uri_path,.hostname,.user_agent,.ip] | @csv" /srv/log/webrequest/sampled-1000.json

Select all public_cloud nets with 429

$ tail -n10000 /srv/weblog/webrequest/sampled-1000.json | jq -r 'select(.http_status == "429") | select(.x_analytics | contains("public_cloud=1"))'

Select all requests with a specific user_agent and .referer

$ jq -r 'if .user_agent == "-" and .referer == "-" then [.uri_path,.hostname,.user_agent,.ip] else empty end | @csv' /srv/log/webrequest/sampled-1000.json

List of the top 10 IPs by response size

$ jq -r '.ip + " " + (.response_size | tostring)' /srv/log/webrequest/sampled-1000.json| awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' | sort -nr | head -10

Select logs matching specific HTTP status, datestamp prefix, host, and uri_path, outputting the top query parameters found

$ tail -n300000 /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head

5xx errors

most of the queries for the sampled-1000 log would work here as well

Grepable

$ tail -f  /srv/log/webrequest/5xx.json | jq "[.uri_host, .uri_path, .uri_query, .http_method, .ip, .user_agent] | @csv"

Mediawiki

all ips which have made more the 100 large requests

$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'

Retrospective Analysis

When the situation calls for analysis of more historical data, or to access the complete set of data, the Analytics Systems can help.

Turnilo

Turnilo has access to the [[1]] dataset, which is loaded every hour to Druid. As the name suggests, this samples 1 in 128 requests.

Data Lake

The primary source for webrequest logs is the Data Lake and the Analytics/Data Lake/Traffic/Webrequest tables in Hive.

These tables are updated hourly and may be queried using Hive, Presto, or Spark.

Please see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Sample_queries for some sample queries using Hive.