You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Incident documentation/20161221-ToollabsKubernetesAPI: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Alexandros Kosiaris
No edit summary
Line 1: Line 1:
''A non-user affecting Kubernetes API downtime lasting about 1hour''
#REDIRECT [[Incidents/20161221-ToollabsKubernetesAPI]]
== Summary ==
''Kubernetes was down a for a about 1 hour due to a faulty puppet patch, a deployment script not doing much error handling and human error''
== Timeline ==
* Dec 21 2016 ~12:10 UTC Alex merges the first (kube-apiserver) of the 3 patches, and' on tools-puppetmaster
* Dec 21 2016 ~12:15 UTC Alex merges the kube-scheduler one ( after the kube-apiserver one is succesfull
* 8 Dec 21 2016 ~12:16 UTC Alex notes the kube-scheduler not starting with an error of--leader-elect=true not being a valid parameter. Realizes something has gone wrong version wise and finds out /usr/local/bin/kube-scheduler is now a symlink to /usr/bin/kube-schedular. That's a mistake in the patch, should have been vice-versa. /usr/bin/kube-scheduler is a left over on tools-k8s-master-01 from a previous unrelated deployment method.
* Dec 21 2016 12:20 UTC Alex runs /usr/local/bin/deploy-master since that is the deployment method that was used for that binary in the first place in order to restore the correct version. The script has a bug and truncates /usr/local/bin/kube-apiserver, but that goes unnoticed for a while.
* Dec 21 2016 12:21 UTC Alex, decides to rollback the kube-apiserver puppet patch as well. Turns out that was a mistake, puppet restarts the service and the empty binary now fails to start (obviously). Now both services are down.
* Dec 21 2016 12:25 UTC Alerting in #wikimedia-labs, alex continues investigation
* Dec 21 2016 12:26 UTC Yuvi offers help
* Dec 21 2016 12:30 UTC After some discussion on IRC, Yuvi kicks off a new kubernetes build to get a working binary
* Dec 21 2016 ~12:35 UTC Alex realizes that puppet clientbucket can save some work and get kube-scheduler restored. kube-apiserver still missing though
* Dec 21 2016 ~12:45 UTC Alex, since everything API is down anyway, tries the 3rd patch just to see if it works (it did!), while waiting for the build to finish.
* Dec 21 2016 13:03 UTC Yuvi is done with the build partially. The build had failed due to some disk space issue but what at least generated the kube-apiserver, kube-scheduler and kube-controller-manager binaries, but not kubectl.
* Dec 21 2016 ~13:05 UTC deploy-master is ran by Yuvi, deploys the above binaries and truncates kubectl
* Dec 21 2016 13:09 UTC Alex sees everything running now on tools-puppetmaster
* Dec 21 2016 13:10 UTC Yuvi notes kubectl not returning anything. Some confusion ensues.
* Dec 21 2016 13:16 UTC The truncated binary is noticed and a copy is brought in from a worker.
* Dec 21 2016 13:21 UTC Everything is up and running
== Conclusions ==
* Half-baked deployment script are land mines waiting to be stepped on. That specific one is going away
* More communication before deployment of puppet changes can be useful
== Actionables ==
* Have labs use the debian packages for kubernetes
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022