You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
As part of switching primary Labs DNS service from holmium to labservices1001, Andrew caused an outage of labs-recursor0 and labs-recursor1, thus breaking resolution of internal Labs IPs from about 05:22 to about 05:31.
- [previously] Andrew is moving Labs designate and DNS services from holmium to labservices1001, in pursuit of the ultimate renaming of holmium.
- [05:07] Andrew merge a patch which is meant, via hiera, to exchanges the assignments of labs-recursor0 and labs-recursor1. This is consistent with the 'primary service on labservices1001' goal, but ignores the fact that each labs-recursor0's IP is routed to the rack containing holmium and /not/ routed to the rack containing labservices1001, and likewise for the IP for labs-recursor1. So, this patch should have broken DNS immediately -- it did not due to a mistake in the patch which accidentally assigned the IP for labs-recursor0 is assigned to both holmium and labservices1001. Consequently labs-recursor0 still works and labs-recursor1 does not.
- [05:17] Andrew merges a second patch patch which corrects the typo. At this point labs-recursor0 is assigned to labservices1001 and labs-recursor1 to holmium, both unroutable.
- [05:22] The above patch is applied, and the first diamond alerts start showing up about DNS resolution failure.
- [05:30] Andrew realizes the source of the problem, submits a patch returning the IPs to their original homes.
- [05:32] The above patch is merged, and normal service is restored.