You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/2021-11-18 codfw ipv6 network: Difference between revisions
imported>Ayounsi (added details) |
imported>Krinkle mNo edit summary |
||
Line 20: | Line 20: | ||
For 8 minutes, | '''Impact''': For 8 minutes, the Codfw cluster experienced partial loss of IPv6 connectivity for upload.wikimedia.org. Thanks to [[:en:Happy_Eyeballs|Happy Eyeballs]] there was no visible user impact (or, at worse, a slight latency increase). The Codfw cluster generally serves Mexico and parts of the US and Canada. The upload.wikimedia.org service serves photos and other media/document files, such as displayed in Wikipedia articles. | ||
'''Documentation''': | '''Documentation''': |
Revision as of 12:22, 3 December 2021
document status: in-review
Summary
After preemptively replacing one of codfw row B spine switch (asw-b7-codfw) for signs of disk failure, the new switch was silently discarding IPv6 traffic (through and within the switch).
As this switch was a spine, ~50% traffic toward that row (from cr2) was transiting through it.
Row B being at this time the row hosting the load-balancer in front of upload-lb.codfw, this was the most visible impact.
Monitoring triggered and the interface between asw-b7-codfw and cr2-codfw was disabled, forcing traffic through the cr1<->asw-b2-codfw link. Resolving the upload-lb issue.
Replacing the switch didn't solve the underlying IPv6 issue, showing that it was not a hardware issue. Forcing a virtual-chassis master failover solved what we think was a Junos (switch operating system) bug.
Note that at the time of the issue, our Juniper support contract was expired, preventing us from opening a JTAC case.
Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 connectivity for upload.wikimedia.org. Thanks to Happy Eyeballs there was no visible user impact (or, at worse, a slight latency increase). The Codfw cluster generally serves Mexico and parts of the US and Canada. The upload.wikimedia.org service serves photos and other media/document files, such as displayed in Wikipedia articles.
Documentation:
- Original maintenance/incident task, T295118
Actionables
- Icinga check for ipv6 host reachability, T163996