You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Monitoring/VarnishTrafficDrop

From Wikitech-static
< Monitoring
Revision as of 13:27, 12 October 2021 by imported>Ema (→‎Things to do: link the DDoS playbook)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

VarnishTrafficDrop is a Prometheus alert defined in traffic.yaml on the operations/alerts repo. The alert fires if there is a significant percentage difference in request rate compared to the recent past, and may be indicative of traffic anomalies.

Things to do

Check the dashboard varnish-caching-last-week-comparison for the affected cluster/site. For example, if the alert says "44% GET drop in text@codfw during the past 30 minutes", you want to select the text cluster and codfw as the site. If the shape of the curve is a clear drop without previous increase as shown in the image Traffic Drop on the right, this could mean that we served less traffic than normal due to either an attack or some anomalies in our infrastructure. If the pattern does not seem to recover on its own, page the Traffic team.

If instead the curve looks like a spike as show in the image Traffic Spike on the right, that is likely due to some anomalous incoming traffic and in general there's not much to worry about.

Regardless of the shape of the curve, you should do the following:

  • Look at the load-balancers-lvs dashboard for the given site. If you see clear spikes there, it's probably a DoS attack. See the (D)DoS Playbook.
  • Take a look at varnish-aggregate-client-status-codes for the relevant site/cluster to learn more about the type of traffic, in particular whether any specific method/status code stands out.
  • Dig into webrequest_sampled_128 on Turnilo for the specific details of the type of requests causing the spike.
  • Let #wikimedia-traffic know about your findings.