You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-10-22 eqiad return path timeouts: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Legoktm
(→‎Actionables: add phab links, clarify klaxon item)
 
imported>Legoktm
(link to ayounsi's phab explanation)
Line 7: Line 7:


== Summary ==
== Summary ==
Users had trouble connecting to the eqiad datacenter (caches/wikis as well as developer tools like Phabricator, Gerrit, etc.) because of a return path issue after datacenter maintenance. Because of other networking maintenance it took a while to diagnose the underlying issue and gain confidence that we had fixed the correct issue. In addition, the page only went out over IRC, not the VictorOps pager because ~half of our outbound paths to the internet were broken, because of the outage.  
Users had trouble connecting to the eqiad datacenter (caches/wikis as well as developer tools like Phabricator, Gerrit, etc.) because of a return path issue after datacenter maintenance (see [[phab:T293726#7454820|T293726#7454820]] for an explanation of how the maintenance went wrong). Because of other networking maintenance it took a while to diagnose the underlying issue and gain confidence that we had fixed the correct issue. In addition, the page only went out over IRC, not the VictorOps pager because ~half of our outbound paths to the internet were broken, because of the outage.  
[[File:2021-10-22 Varnish request rate during outage.png|thumb|Varnish traffic chart showing dip in overal traffic during outage and then when eqiad was temporarily depooled.]]
[[File:2021-10-22 Varnish request rate during outage.png|thumb|Varnish traffic chart showing dip in overal traffic during outage and then when eqiad was temporarily depooled.]]


Line 19: Line 19:
* 19:39 Icinga reports 2 interfaces down on cr2-eqord and 1 down on cr2-codfw
* 19:39 Icinga reports 2 interfaces down on cr2-eqord and 1 down on cr2-codfw
* 19:50 Telia mails us saying “we Suspected Cable fault in St Louis and your circuits are affected” and lists circuits: IC-313592,  IC-314534 (turned out to be unrelated, red herring in the investigation)
* 19:50 Telia mails us saying “we Suspected Cable fault in St Louis and your circuits are affected” and lists circuits: IC-313592,  IC-314534 (turned out to be unrelated, red herring in the investigation)
* 20:06 '''a cable is patched in eqiad ([[Phab:T293726#7452237|T293726#7452237]]) - incident starts here'''
* 20:06 '''a cable is patched in eqiad ([[Phab:T293726#7452237|T293726#7452237]]) - incident & user impact begin here'''
* 20:09 <AntiComposite> reports of timeouts from a few users on Discord
* 20:09 <AntiComposite> reports of timeouts from a few users on Discord
* 20:10 <dontpanic> oh, I'm not the only one
* 20:10 <dontpanic> oh, I'm not the only one
Line 33: Line 33:
* 20:39 < bblack> I think we've got some peers over that exchange which we're still advertising in one or both directions with, but are affected by telia somehow and the peering traffic is borked.  
* 20:39 < bblack> I think we've got some peers over that exchange which we're still advertising in one or both directions with, but are affected by telia somehow and the peering traffic is borked.  
* 20:41 < cdanis> it's a return path issue for sure
* 20:41 < cdanis> it's a return path issue for sure
* 20:41 < XioNoX> !log disable sessions to equinix eqiad IXP ([https://sal.toolforge.org/log/GIK8qXwB8Fs0LHO5kUJR SAL entry])
* 20:41 < XioNoX> !log disable sessions to equinix eqiad IXP ([https://sal.toolforge.org/log/GIK8qXwB8Fs0LHO5kUJR SAL entry]) '''user impact temporarily ends'''
* 20:41 <+icinga-wm> RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
* 20:41 <+icinga-wm> RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
* 20:42 <greg-g> it's back <greg-g> my mtr is happy now
* 20:42 <greg-g> it's back <greg-g> my mtr is happy now
* 20:42 multiple users start to report things are working now and they can reach Gerrit
* 20:42 multiple users start to report things are working now and they can reach Gerrit
* 20:45 <greg-g> can't connect to the eqiad lb again
* 20:45 <greg-g> can't connect to the eqiad lb again '''change auto-rolls-back on the router; user impact begins again'''
* 20:45 <+icinga-wm> RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
* 20:45 <+icinga-wm> RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
* 20:46 <cdanis> XioNoX: you did commit confirmed but didn't confirm <cdanis> it auto rolled back
* 20:46 <cdanis> XioNoX: you did commit confirmed but didn't confirm <cdanis> it auto rolled back
* 20:46 Brandon depools eqiad in DNS ([https://sal.toolforge.org/log/bCTDqXwB1jz_IcWuZmcx SAL entry])
* 20:46 Brandon depools eqiad in DNS ([https://sal.toolforge.org/log/bCTDqXwB1jz_IcWuZmcx SAL entry])
* 20:47 <XioNoX> cdanis: er, yeah commiting for real
* 20:47 <XioNoX> cdanis: er, yeah commiting for real '''user impact ends again and stays fixed'''
* 20:48 users report recoveries, again
* 20:48 users report recoveries, again
* 20:49 <bblack> it takes several minutes for most to see a real impact from the dns-level depool, so any immediate recoveries are probably from re-committing the exchange fix
* 20:49 <bblack> it takes several minutes for most to see a real impact from the dns-level depool, so any immediate recoveries are probably from re-committing the exchange fix

Revision as of 17:40, 25 October 2021

document status: draft

Summary

Users had trouble connecting to the eqiad datacenter (caches/wikis as well as developer tools like Phabricator, Gerrit, etc.) because of a return path issue after datacenter maintenance (see T293726#7454820 for an explanation of how the maintenance went wrong). Because of other networking maintenance it took a while to diagnose the underlying issue and gain confidence that we had fixed the correct issue. In addition, the page only went out over IRC, not the VictorOps pager because ~half of our outbound paths to the internet were broken, because of the outage.

Varnish traffic chart showing dip in overal traffic during outage and then when eqiad was temporarily depooled.


Impact: From 20:06 to 20:48, some users who geographically reach eqiad had trouble connecting and received timeout errors. Based on traffic dashboards, we lost about 7k requests/second. For about 10 minutes, eqiad was depooled, sending users to a different datacenter, increasing latencies.

Network Error Logging dashboard showing the outage. The initial recovery happens at 20:42 but regresses minutes later because the router change was not fully committed.

Timeline

  • 19:38 Icinga reports issues with 1 interface going down on cr3-ulsfo
  • 19:39 Icinga reports 2 interfaces down on cr2-eqord and 1 down on cr2-codfw
  • 19:50 Telia mails us saying “we Suspected Cable fault in St Louis and your circuits are affected” and lists circuits: IC-313592, IC-314534 (turned out to be unrelated, red herring in the investigation)
  • 20:06 a cable is patched in eqiad (T293726#7452237) - incident & user impact begin here
  • 20:09 <AntiComposite> reports of timeouts from a few users on Discord
  • 20:10 <dontpanic> oh, I'm not the only one
  • 20:14 Icinga starts reporting RIPE Atlas probes starting to be down
  • 20:15:04 <mutante> XioNoX: users report issues right after a cable was patched in Eqiad but things work for me
  • 20:15:20 <mutante> jclark-ctr: are you working with someone on that cable thing?
  • 20:15 <+icinga-wm> PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
  • 20:24 Incident opened, Kunal becomes IC
  • ... mostly looking into Telia transport links being down (red herring)
  • 20:28 <bblack> the transport one (IC-314534) the interface appears to be down, so that's good at that level
  • 20:32 <cdanis> saturation on transport links shouldn't be causing these flavors of NELs nor the RIPE Atlas alert
  • 20:35 <bblack> https://librenms.wikimedia.org/device/device=2/tab=port/port=11600/ <bblack> equinix peering in eqiad, there's a dropoff in traffic, probably from telia fiber cut impacting other peers there?
  • 20:39 < bblack> I think we've got some peers over that exchange which we're still advertising in one or both directions with, but are affected by telia somehow and the peering traffic is borked.
  • 20:41 < cdanis> it's a return path issue for sure
  • 20:41 < XioNoX> !log disable sessions to equinix eqiad IXP (SAL entry) user impact temporarily ends
  • 20:41 <+icinga-wm> RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 74, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
  • 20:42 <greg-g> it's back <greg-g> my mtr is happy now
  • 20:42 multiple users start to report things are working now and they can reach Gerrit
  • 20:45 <greg-g> can't connect to the eqiad lb again change auto-rolls-back on the router; user impact begins again
  • 20:45 <+icinga-wm> RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28
  • 20:46 <cdanis> XioNoX: you did commit confirmed but didn't confirm <cdanis> it auto rolled back
  • 20:46 Brandon depools eqiad in DNS (SAL entry)
  • 20:47 <XioNoX> cdanis: er, yeah commiting for real user impact ends again and stays fixed
  • 20:48 users report recoveries, again
  • 20:49 <bblack> it takes several minutes for most to see a real impact from the dns-level depool, so any immediate recoveries are probably from re-committing the exchange fix
  • 20:50 <mutante> are we sure this has nothing to do with it? "eqiad: patch 2nd Equinix IXP - https://phabricator.wikimedia.org/T293726 (Jclark-ctr) Cable has been run shows link."
  • 20:52 <XioNoX> yeah, that's most likely it
  • 20:52 <cdanis> bug post is at 20:06 <cdanis> the NEL reports start at 20:06 exactly
  • 20:54 <XioNoX> I disabled the interface on cr1, going to re-enabled the active on on cr2 (SAL entry)
  • 20:57 Cause identified and fixed, incident over
  • 20:57 <bblack> !log re-pooling eqiad in DNS (SAL entry)

Actionables

  • T293726: Investigate & document why adding a new patch cable to the Equinix IX ports caused return path issues
  • T294166: NEL alert included #page in IRC but did not page through VictorOps (and then it did page through VO, at 21:07)
    • This was because our outbound path to any internet destinations best reached via the Equinix IX were broken from (half of?) eqiad for the duration of the incident.
    • Icinga also missed reporting when "BGP status on cr2-eqiad" went down, only reporting its recovery, our tentative theory is that it was also affected by the networking issues.
    • Working around this is difficult but we should think about what to do.
  • Remind people that using Klaxon when they're relatively confident something is wrong is preferred to waiting for automated monitoring to page.
    • It's possible that Klaxon would also not work for the same users -- Klaxon is hosted in eqiad on alert1001.wikimedia.org, with the codfw installation being a passive backup host.
    • It would have required someone on IRC to relay the issue through Klaxon, which had happened, just lack of actually using the tool.