You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-18 codfw ipv6 network: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
No edit summary
 
imported>Ayounsi
(added details)
Line 1: Line 1:
{{irdoc|status=draft}} <!--
{{irdoc|status=review}} <!--
The status field should be one of:
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
Line 7: Line 7:


==Summary==
==Summary==
<mark>What happened? Write one paragraph or at most two, including UTC timestamps for key events like the start and end of the outage. Avoid assuming deep knowledge of the systems here -- but if the incident is too complex to sum up in a couple of paragraphs, this lightweight format may be a bad fit.</mark>
After preemptively replacing one of codfw row B spine switch (asw-b7-codfw) for signs of disk failure, the new switch was silently discarding IPv6 traffic (through and within the switch).


'''Impact''': For 8 minutes, some clients were missing photos and audio/video files in articles. This was due to loss of IPv6 connectivity (which affects a subset of Internet providers) in the [[Codfw cluster]] (which serves a subset of regions) for upload.wikimedia.org.<!-- Reminder: No private information on this page! -->
As this switch was a spine, ~50% traffic toward that row (from cr2) was transiting through it.
 
Row B being at this time the row hosting the load-balancer in front of upload-lb.codfw, this was the most visible impact.
 
Monitoring triggered and the interface between asw-b7-codfw and cr2-codfw was disabled, forcing traffic through the cr1<->asw-b2-codfw link. Resolving the upload-lb issue.
 
Replacing the switch didn't solve the underlying IPv6 issue, showing that it was not a hardware issue. Forcing a virtual-chassis master failover solved what we think was a Junos (switch operating system) bug.
 
Note that at the time of the issue, our Juniper support contract was expired, preventing us from opening a JTAC case.
 
 
'''Impact''': Thanks to [[:en:Happy_Eyeballs|Happy Eyeballs]] there was no visible user impact (or, at worse, a slight latency increase).
 
For 8 minutes, we had a partial loss of IPv6 connectivity (which affects a subset of Internet providers) in the [[Codfw cluster]] (which serves a subset of regions) for upload.wikimedia.org.<!-- Reminder: No private information on this page! -->


'''Documentation''':
'''Documentation''':
*<mark>Todo (Link to relevant source code, graphs, or logs)</mark>
*Original maintenance/incident task, [[phab:T295118|T295118]]


==Actionables==
==Actionables==


* Original maintenace/incident task, [[phab:T295118|T295118]]
* Icinga check for ipv6 host reachability, [[phab:T163996|T163996]]
* Icinga check for ipv6 host reachability, [[phab:T163996|T163996]]

Revision as of 14:25, 2 December 2021

document status: in-review

Summary

After preemptively replacing one of codfw row B spine switch (asw-b7-codfw) for signs of disk failure, the new switch was silently discarding IPv6 traffic (through and within the switch).

As this switch was a spine, ~50% traffic toward that row (from cr2) was transiting through it.

Row B being at this time the row hosting the load-balancer in front of upload-lb.codfw, this was the most visible impact.

Monitoring triggered and the interface between asw-b7-codfw and cr2-codfw was disabled, forcing traffic through the cr1<->asw-b2-codfw link. Resolving the upload-lb issue.

Replacing the switch didn't solve the underlying IPv6 issue, showing that it was not a hardware issue. Forcing a virtual-chassis master failover solved what we think was a Junos (switch operating system) bug.

Note that at the time of the issue, our Juniper support contract was expired, preventing us from opening a JTAC case.


Impact: Thanks to Happy Eyeballs there was no visible user impact (or, at worse, a slight latency increase).

For 8 minutes, we had a partial loss of IPv6 connectivity (which affects a subset of Internet providers) in the Codfw cluster (which serves a subset of regions) for upload.wikimedia.org.

Documentation:

  • Original maintenance/incident task, T295118

Actionables

  • Icinga check for ipv6 host reachability, T163996