You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-02 Cloud VPS networking

From Wikitech-static
< Incident documentation
Revision as of 18:08, 10 January 2022 by imported>Krinkle (Created page with "{{irdoc|status=review}} <!-- The status field should be one of: * {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. * {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on Incident documentation. * {{irdoc|status=final}} --> ==Summary== After a kernel upgrade for several Cloud VPS network components (cloudnet, cloudgw servers;...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: in-review

Summary

After a kernel upgrade for several Cloud VPS network components (cloudnet, cloudgw servers; see T291813), we found problems with Toolforge NFS in Kubernetes. Later LDAP connections were found to be affected. Eventually it turned out to be a problem with all ingress traffic to the network edge for cloud VMs (except those with floating IPs, which were unaffected). The issue was resolved by rolling back the kernel upgrade.

Impact: For about 1 hour and 40 minutes, Toolforge services and VMs in Cloud VPS may have experienced connectivity issues.

Actionables

  • Improve automated testing and monitoring of cloud networking, T294955
  • Set up static route for cr-codfw, T295288
  • Avoid keepalived flaps when rebooting servers, T294956