You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Monitoring/EdgeTrafficDrop: Difference between revisions
m (Ema moved page Monitoring/VarnishTrafficDrop to Monitoring/EdgeTrafficDrop: varnish is an implementation detail)
(Update facts about traffic alerting)
|Line 1:||Line 1:|
'''' is a [[Alertmanager|Prometheus alert]] defined in [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-traffic/traffic.yaml traffic.yaml] on the operations/alerts repo. The alert fires if there is a significant percentage difference in request rate compared to the recent past, and may be indicative of traffic anomalies.
=== Things to do ===
=== Things to do ===
Latest revision as of 16:08, 27 July 2022
EdgeTrafficDrop is a Prometheus Alertmanager alert defined in traffic.yaml on the operations/alerts repo. The alert fires if there is a significant percentage difference in request rate compared to the recent past, and may be indicative of traffic anomalies.
Things to do
Check the dashboard varnish-caching-last-week-comparison for the affected cluster/site. For example, if the alert says "44% GET drop in text@codfw during the past 30 minutes", you want to select the text cluster and codfw as the site. If the shape of the curve is a clear drop without previous increase as shown in the image Traffic Drop on the right, this could mean that we served less traffic than normal due to either an attack or some anomalies in our infrastructure. If the pattern does not seem to recover on its own, page the Traffic team.
If instead the curve looks like a spike as show in the image Traffic Spike on the right, that is likely due to some anomalous incoming traffic and in general there's not much to worry about.
Regardless of the shape of the curve, you should do the following:
- Look at the load-balancers-lvs dashboard for the given site. If you see clear spikes there, it's probably a DoS attack. See the (D)DoS Playbook.
- Take a look at varnish-aggregate-client-status-codes for the relevant site/cluster to learn more about the type of traffic, in particular whether any specific method/status code stands out.
- Dig into webrequest_sampled_128 on Turnilo for the specific details of the type of requests causing the spike.
- Let #wikimedia-traffic know about your findings.