You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:Jbond/debugging

From Wikitech-static
Jump to navigation Jump to search

USE

http://www.brendangregg.com/USEmethod/use-linux.html

Logs

https://wikitech.wikimedia.org/wiki/Logs

Network

https://wikitech.wikimedia.org/wiki/Network_cheat_sheet#Juniper

Sampled-1000.json on centrallog1001

https://wikitech.wikimedia.org/wiki/Logs/Runbook#Webrequest_Sampled

Example of digging into the data (from cdanis)

$ tail -n300000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | .uri_host' | sort | uniq -c | sort -gr
	  45371 www.wikipedia.org
	    728 en.wikipedia.org
	     16 upload.wikimedia.org
	      6 de.wikipedia.org
	      5 pt.wikipedia.org
	      5 fr.wikipedia.org
	      5 es.wikipedia.org
	      4 query.wikidata.org
	      2 nl.wikipedia.org
	      2 ja.wikipedia.org
	      2 api.wikimedia.org
	      1 sv.wikipedia.org
$ tail -n300000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org")' | less                                 
$ tail -n300000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | .uri_path' | sort | uniq -c | sort -gr
	  45371 /
$ tail -n300000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | .uri_query' | sort | uniq -c | sort -gr | head
	      1 ?q=ZZZWF8bdj6hw
	      1 ?q=zZzsfU01A8F4
	      1 ?q=ZzzLH6zEJRvD
	      1 ?q=ZzZiRz0QoPBK
	      1 ?q=zZWIuevTlAOu
	      1 ?q=ZzvAdulyFrRe
	      1 ?q=ZZv96mB4T6WK
	      1 ?q=zzUrTAWa2kA8
	      1 ?q=zzUPPhnOicQ4
	      1 ?q=ZZT8Y8D2gRnE
$ tail -n400000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | select(.uri_query|test("^\\?q=[^&]+$")) | .user_agent' | sort | uniq -c | sort -gr
7711 Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
7656 Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko
7567 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3599.0 Safari/537.36
7535 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.18247
7451 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3599.0 Safari/537.36
7451 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3599.0 Safari/537.36
$ tail -n400000  /srv/weblog/webrequest/sampled-1000.json| jq -r 'select(.http_status == "429") | select(.dt | contains("2022-06-10T14:5")) | select(.uri_host == "www.wikipedia.org") | select(.uri_path == "/") | select(.uri_query|test("^\\?q=[^&]+$")) | .tls' | sort | uniq -c | sort -gr
43774 vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new
1597 vers=TLSv1.2;keyx=UNKNOWN;auth=ECDSA;ciph=AES256-GCM-SHA384;prot=h2;sess=new


mw server

list all ips which have made more the 100 large requests

$ awk '$2>60000 {print $11}' /var/log/apache2/other_vhosts_access.log | sort | uniq -c | awk '$1>100 {print}'

MediaWiki Shell

$ ssh mwmaint1002
$ mwscript maintenance/shell.php --wiki=enwiki

Then

>>> var_dump($wgUpdateRowsPerQuery);
int(100)
=> null
>>>

One of purge

On mwmaint1002, run:

$ echo 'https://example.org/foo?x=y' | mwscript purgeList.php

re: https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge

LVS Server

Sample 100k pkts and list top talkers

$ sudo tcpdump -i enp4s0f0 -pn -c 100000 | sed -r 's/.* IP6? //;s/\.[^\.]+ .*//' | sort | uniq -c | sort -nr | head -20

Testig a site agains a specific lvs

$ curl --connect-to "::text-lb.${site}.wikimedia.org" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM

CP Server

Query for specific status code

$ sudo varnishncsa -n frontend -g request -q 'RespStatus eq 429'

Custom format with client IP address

$ sudo -i varnishncsa -n frontend -g request -q 'RespStatus eq 429' -F '%{X-Client-IP}i %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{X-Forwarded-Proto}i\""'

Or the much more verbos version

$ sudo varnishlog -n frontend -g request -q 'RespStatus eq 429'

Check the connection tuples for the varnish

$ sudo ss -tan 'sport = :3120' | awk '{print $(NF)" "$(NF-1)}' | sed 's/:[^ ]*//g' | sort | uniq -c

The number of avaible ports which also maps to tuples is available from if the number above is equal to approaching the number of available ports from below then there could ba en issue

$ cat /proc/sys/net/ipv4/ip_local_port_range

Checking sites from CP server

You can use curl from the cp serveres to ensure you fiut the front end/back end cache and for it to hit fetch a specific site with the following commands

Using $RANDOM below prevents us from hitting the cache

frontend

$ curl --connect-to "::$HOSTNAME" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM

backend

$ curl --connect-to "::$HOSTNAME:3128"   -H "X-Forwarded-Proto: https"" https://en.wikipedia.org/wiki/Main_Page?x=$RANDOM

Proxed web service

Show all request and response headeres on loopback

$ sudo stdbuf -oL -eL /usr/sbin/tcpdump -Ai lo -s 10240 "tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)" | egrep -a --line-buffered ".+(GET |HTTP\/|POST )|^[A-Za-z0-9-]+: " | perl -nle 'BEGIN{$|=1} { s/.*?(GET |HTTP\/[0-9.]* |POST )/\n$1/g; print }'

re: https://serverfault.com/a/633452/464916

show full body

$ sudo stdbuf -oL -eL /usr/sbin/tcpdump -Ai lo -s 10240 "tcp port 8001 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)"

Pooling

Check the pooled state

Servcie

$ confctl select service=thumbor get

host

$ confctl select dc=eqiad,cluster=cache_text,service=varnish-be,name=cp1052.eqiad.wmnet get

Depooling

https://wikitech.wikimedia.org/wiki/Depooling_servers

pybal

Check log files /var/log/pybal.log on lvs servers

Postgresql

display locks

SELECT a.datname,
         l.relation::regclass,
         l.transactionid,
         a.query,
         age(now(), a.query_start) AS "age",
         a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;

show blocked by waiting on lock

SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;

get table sizes

SELECT nspname || '.' || relname AS "relation",
      pg_size_pretty(pg_relation_size(C.oid)) AS "disk size", 
      pg_size_pretty( pg_total_relation_size(nspname || '.' || relname)) AS "size" 
    FROM pg_class C
    LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
   WHERE nspname IN ('public')
    ORDER BY pg_relation_size(C.oid) DESC;

DHCPd

Use the following to capture DHCP traffic regarding a specific client mac. in the following the mac address was aa:00:00:d9:81:8a. We just use the last 4 bytes (00:d9:81:8a) in the filter below

$ sudo tcpdump -i ens5 -vvv -s 1500 '((port 67 or port 68) and (udp[38:4] = 0x00d9818a))'

iPXE cli

While booting press ctrl+b to drop you into the iPXE shell. you may be required to use the advanced console connections options