You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Monitoring/VarnishTrafficDrop"

From Wikitech-static
Jump to navigation Jump to search
imported>Ema
(→‎Things to do: link the DDoS playbook)
 
imported>Ema
(Ema moved page Monitoring/VarnishTrafficDrop to Monitoring/EdgeTrafficDrop: varnish is an implementation detail)
 
Line 1: Line 1:
''VarnishTrafficDrop'' is a [[Alertmanager|Prometheus alert]] defined in [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-traffic/traffic.yaml traffic.yaml] on the operations/alerts repo. The alert fires if there is a significant percentage difference in request rate compared to the recent past, and may be indicative of traffic anomalies.
#REDIRECT [[Monitoring/EdgeTrafficDrop]]
 
=== Things to do ===
Check the dashboard [https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?orgId=1&viewPanel=5&refresh=15m varnish-caching-last-week-comparison] for the affected cluster/site. For example, if the alert says "44% GET drop in text@codfw during the past 30 minutes", you want to select the '''text''' cluster and '''codfw''' as the site. If the shape of the curve is a clear drop without previous increase as shown in the image ''Traffic Drop'' on the right, this could mean that we served less traffic than normal due to either an attack or some anomalies in our infrastructure. If the pattern does not seem to recover on its own, page the Traffic team.
[[File:Esams-traffic-drop.png|thumb|Traffic Drop]]
If instead the curve looks like a spike as show in the image ''Traffic Spike'' on the right, that is likely due to some anomalous incoming traffic and in general there's not much to worry about.
[[File:Ulsfo-traffic-spike.png|thumb|Traffic Spike]]
 
Regardless of the shape of the curve, you should do the following:
* Look at the [https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 load-balancers-lvs] dashboard for the given site. If you see clear spikes there, it's probably a DoS attack. See the [https://docs.google.com/document/d/1SeXdegjsfL94R6XYB1I4Uv8yjCPH1tVXeL0taJF0NNs/edit#heading=h.4clcms8g96vx (D)DoS Playbook].
* Take a look at [https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=ulsfo&var-cache_type=varnish-text varnish-aggregate-client-status-codes] for the relevant site/cluster to learn more about the type of traffic, in particular whether any specific method/status code stands out.
* Dig into webrequest_sampled_128 on [https://turnilo.wikimedia.org/ Turnilo] for the specific details of the type of requests causing the spike.
* Let ''#wikimedia-traffic'' know about your findings.
 
 
[[Category:Runbooks]]
[[Category:SRE Traffic]]

Latest revision as of 13:07, 22 October 2021