You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2018-08-08 Network

From Wikitech-static
Jump to navigation Jump to search

Summary

Topology changes made to improve the redundancy and stability of the switch stack asw2-a-eqiad caused it to drop ~1/3 of the packets transiting through its members for about 1h. This packet drop caused internal services to timeout/retry, exact user facing issues TBD but at least an increase of 5xx errors.

Timeline (UTC)

17:14 - First topology change made

17:43 - Last topology change made (T201145#4489225)

17:47 - First Icinga alerts, some high API latencies, puppetfails, etc. IRC spam is bad, but no major pages or signs of broader user-facing issues yet.

18:07 - Replaced fpc1-fpc3 link for T201095 (Unaware of the alerts)

18:10 - Started investigating asw2-a-eqiad

18:30 - Disabled fpc1-fpc3 link

18:33 - Minor user-facing disturbances begin showing up as a low-but-unusual rate of 503s

18:42 - 503 rate begins climbing significantly, reaching ~5% of all cache_text request rate at peak (probably roughly all of the misses and passes (e.g. logged-in traffic), only cache hits being served). Grafana

18:47 - Disabled fpc2-fpc4 link

18:47 - First Icinga recoveries

18:50 - 503 burst that began at 18:42 comes back to normal near-zero rate.

19:18 - eqiad front edge depooled in DNS, to stabilize and reduce risk during follow-on investigations fixups (takes 10 minutes for DNS TTLs to expire as this comes into effect)

Conclusions

  • Virtual Chassis are black boxes, which makes it more difficult to investigate issues
  • Topology changes included cable move, which makes a rollback more difficult
  • Our current topologies are unsupported, this outage revealed that any changes, even though toward a more supported configuration can have bad consequences.
  • Logging work done in SAL could have reduced the response time
  • This event caused a driver issue on new cp1* servers, causing their link to be up on the switch side, but down on the server side

Actionables

  • Status:    Unresolved - Fix asw2-a-eqiad topology phab:T201145
  • Status:    TODO - Repool eqiad front edge traffic once eqiad is stable