You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Kubernetes"

From Wikitech-static
Jump to navigation Jump to search
imported>Wolfgang Kandek
imported>JMeybohm
 
(21 intermediate revisions by 12 users not shown)
Line 1: Line 1:
{{Kubernetes nav}}
:''For information about Kubernetes in the Toolforge environment see [[Help:Toolforge/Kubernetes]].''
:''For information about Kubernetes in the Toolforge environment see [[Help:Toolforge/Kubernetes]].''
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.
== Clusters ==
The list of currently maintained clusters in WMF, split by realm and team is at [[Kubernetes/Clusters]]


== Packages ==
== Packages ==
Line 16: Line 22:


* Learn more about [[Deployment pipeline/Migration/Tutorial | Migrating a service to kubernetes]] and [[Deployment pipeline]] generally.
* Learn more about [[Deployment pipeline/Migration/Tutorial | Migrating a service to kubernetes]] and [[Deployment pipeline]] generally.
* [[Deployments on kubernetes]]


== Debugging ==
== Debugging ==
Line 23: Line 28:


== Administration ==
== Administration ==
=== Create a new cluster ===
Documentation for creating a new cluster is in [[Kubernetes/Clusters/New]]
=== Add a new service ===
=== Add a new service ===
To add a new service to the clusters:
To add a new service named '''service-foo''' to the clusters of the '''main''' group:


* Ensure the service has it's ports registered at: [[Service ports]]
#Ensure the service has it's ports registered at: [[Service ports]]
*Create deployment user/tokens in the puppet private (you can use a random generated password, no strict guideline for it) and public repos.
#Create deployment user/tokens in the puppet private (you can use a randomly generated 22-character [A-z0-9] password) and public repos. You need to edit the <code>hieradata/common/profile/kubernetes.yaml</code> file in the private repository - specifically the <code>profile::kubernetes::infrastructure_user</code> key, as in the example below:<syntaxhighlight lang="yaml">
**Example 1
profile::kubernetes::infrastructure_users:
*** https://gerrit.wikimedia.org/r/c/labs/private/+/613101 (plus actual data in the private repo, see <code>1edf14c0</code> )
    main:
*** https://gerrit.wikimedia.org/r/c/operations/puppet/+/613104
        client-infrastructure:
**Example 2 - eventstreams-internal (T269160)
            token: <REDACTED>
***https://gerrit.wikimedia.org/r/655879 (plus the actual data in the private repo, see <code>6689496a</code> and <code>376c92ad</code>)
            groups: [system:masters]
***https://gerrit.wikimedia.org/r/c/operations/puppet/+/656129
...
* Add a Kubernetes namespace:
+       service-foo:
**Example 1
+            token: <YOUR_TOKEN>
*** https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645376 (ignore the change to calico/default-kubernetes-policy)
</syntaxhighlight>You also need to tell the deployment server how to set up the kubeconfig files, which is done by modifying the <code>profile::kubernetes::deployment_server::services</code> hiera key (<code>hieradata/common/profile/kubernetes/deployment_server.yaml</code>) as in the example below:<syntaxhighlight lang="yaml">
**Example 2 - eventstreams-internal (T269160)
 
***https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/644612
profile::kubernetes::deployment_server::services:
*At this point, you can safely merge the change (after somebody from Service Ops validates it of course). Please do it though when you have time to run the following command, to avoid impacting other people rolling out changes later on.
  main:
*The first thing to do is to work in staging, updating the admin config.
    mathoid:
**On deploy1002:  <code>sudo -i; cd /srv/deployment-charts/helmfile.d/admin/staging/; kube_env admin staging; ./cluster-helmfile.sh -i apply</code>
      usernames:
**The command above should show you a change in namespaces/quotas/etc.. related to your new service. If this is not the case (for example, you also see other changes) ping somebody from the Service Ops team! There might be some work waiting to be applied.
        - name: mathoid
*Then you can proceed to deploy the new service to staging for real. Don't worry for TLS (if needed) since in staging it will be added a default config for your service auto-magically. Different thing is Production, but there is a step later on about it :D
...
**On deploy1002: <code>cd /srv/deployment-charts/helmfile.d/services/YOUR-SERVICE-NAME-HERE; helmfile -e staging -i apply</code>
+    service-foo:
**The magic command above will show a diff related to the new service, make sure  that everything looks fine and then hit Yes to proceed.
+        usernames:
**You should now be able to test your new service in staging! You can use the handy endpoint <code>http(s)://staging.svc.eqiad.wmnet:$YOUR-SERVICE-PORT</code> to quickly test if everything works as expected.
+           - name: service-foo
*Now you can move to Production!
+              owner: mwdeploy
*Create certificates for the new service, if it has an HTTPS endpoint (remember that this step for staging is automatically handled for you, but for production it is not).
+              group: wikidev
**[[Enable TLS for Kubernetes deployments]]
+              mode: "0640"
*If the new service requires specific secrets, commit them to <code>/srv/private/hieradata/role/common/deployment_server.yaml</code>
</syntaxhighlight>Please note that the owner/group/mode here refer to the file permissions of your kubeconfig file ("/etc/kubernetes/service-foo-<cluster_name>.config"), determining which users/groups will be able to use this kubeconfig. Typically for normal service users you don't need to define them, as the defaults are correct.
*At this point, you need to update the admin config for eqiad and codfw (if you have configs for both of course):
#Ask Sevice Ops the private data for your service. This is done by adding an entry for service-foo under <code>profile::kubernetes::deployment_server_secrets::services</code> in the private repository (<code>role/common/deployment_server.yaml</code>). Secrets will most likely needed for all clusters, including staging.
**On deploy1002: <code>sudo -i; cd /srv/deployment-charts/helmfile.d/admin/codfw/; kube_env admin codfw; ./cluster-helmfile.sh -i apply</code>
#Add a Kubernetes namespace. Example commit:
**On deploy1002: <code>sudo -i; cd /srv/deployment-charts/helmfile.d/admin/eqiad/; kube_env admin eqiad; ./cluster-helmfile.sh -i apply</code>
#* '''kubernetes namespace:''' deployment-charts https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/693124
*Then the final step, namely deploying the new service:
#At this point, you can safely merge the changes (after '''somebody from Service Ops validates'''). After merging, it is important to run the commands in the next step, so to avoid impacting other people rolling out changes later on.
**On deploy1002: <code>cd /srv/deployment-charts/helmfile.d/services/YOUR-SERVICE-NAME-HERE; helmfile -e codfw -i apply</code>
#Setting up in staging-codfw cluster (and then to the other clusters)
**On deploy1002: <code>cd /srv/deployment-charts/helmfile.d/services/YOUR-SERVICE-NAME-HERE; helmfile -e eqiad -i apply</code>
'''On a cumin server'''
sudo cumin -b 4 -s 2 kubemaster* 'run-puppet-agent'
'''On deploy1002:'''
sudo run-puppet-agent
  sudo -i
cd /srv/deployment-charts/helmfile.d/admin_ng/
helmfile -e staging-codfw -i apply
The command above should show you a diff in namespaces/quotas/etc.. related to your new service. If you don't see a diff, ping somebody from the Service Ops team! Check that everything is ok:
  kube_env $YOUR-SERVICE-NAME staging-codfw
  kubectl get ns
  kubectl get pods
You should be able to see info about your namespace. <code>kubectl get pods </code> should show a tiller pod.<br>
'''Repeat for the staging-eqiad, eqiad and codfw clusters even if you aren't ready to fully deploy your service. Leaving undeployed things will impede further operations by other people.'''
 
==== Deploy a service to staging ====
At this point you should have a a Chart for your service (TODO: link to docs?), and will need to setup a <code>helmfile.d/services</code> directory in the {{Gitweb|project=operations/deployment-charts}} repository for the deployment. You can copy the structure (helmfile.yaml, values.yaml, values-staging.yaml, etc.) from {{Gitweb|project=operations/deployment-charts|file=helmfile.d/services/_example_}} and customize as needed.
 
You can proceed to deploy the new service to staging for real. Don't worry for TLS (if needed) since in staging it will be added a default config for your service auto-magically. Things are slightly different for production.
 
'''On deploy1002:'''
  cd /srv/deployment-charts/helmfile.d/services/service-foo
  helmfile -e staging -i apply
The command above will show a diff related to the new service, make sure  that everything looks fine and then hit Yes to proceed.
 
==== Testing a service ====
#Now we can test the service in staging. Use the very handy endpoint: <code>http(s)://staging.svc.eqiad.wmnet:$YOUR-SERVICE-PORT</code> to quickly test if everything works as expected.
==== Deploy a service to production ====
#Create certificates for the new service, if it has an HTTPS endpoint (remember that this step for staging is automatically handled for you, but for production it is not).
#[[Kubernetes/Enabling TLS|Enable TLS for Kubernetes deployments]]
#At this point, you need to update the admin config for eqiad and codfw (if you have configs for both of course):
#*On deploy1002: <code>sudo -i; cd /srv/deployment-charts/helmfile.d/admin/codfw/; kube_env admin codfw; ./cluster-helmfile.sh -i apply</code>
#*On deploy1002: <code>sudo -i; cd /srv/deployment-charts/helmfile.d/admin/eqiad/; kube_env admin eqiad; ./cluster-helmfile.sh -i apply</code>
#Then the final step, namely deploying the new service:
#*On deploy1002: <code>cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e codfw -i apply</code>
#*On deploy1002: <code>cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e eqiad -i apply</code>
The service can now be accessed via the registered port on any of the kubernetes nodes (for manual testing).
The service can now be accessed via the registered port on any of the kubernetes nodes (for manual testing).


Line 71: Line 115:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
# kubectl drain kubernetes1001.eqiad.wmnet
# kubectl drain --ignore-daemonsets kubernetes1001.eqiad.wmnet
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1002.eqiad.wmnet/10.64.16.75
Line 102: Line 146:


The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.
The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.
=== Restarting calico-node ===
calico-node maintains a BGP session with the core routers if you intend to restart this service you should use the following procedure
# drain the node on the kube controler as shown above
# <code>systemctl restart calico-node</code> on the kube worker
# Wait for BGP sessions on the coure router to re-established
# uncordon the node on the kube controler as shown above
you can use the following command on the cour routers to check BGP status (use <code>match 64602</code> for codfw)
<syntaxhighlight lang="shell-session">
# show bgp summary | match 64601     
10.64.0.121          64601        220        208      0      2      32:13 Establ
10.64.0.145          64601    824512    795240      0      1 12w1d 21:45:51 Establ
10.64.16.75          64601        161        152      0      2      23:25 Establ
10.64.32.18          64601    824596    795247      0      2 12w1d 21:46:45 Establ
10.64.32.23          64601        130        123      0      2      18:59 Establ
10.64.48.52          64601    782006    754152      0      3 11w4d 11:13:52 Establ
2620:0:861:101:10:64:0:121      64601        217        208      0      2      32:12 Establ
2620:0:861:101:10:64:0:145      64601    824472    795240      0      1 12w1d 21:45:51 Establ
2620:0:861:102:10:64:16:75      64601        160        152      0      2      23:25 Establ
2620:0:861:103:10:64:32:18      64601    824527    795246      0      1 12w1d 21:46:45 Establ
2620:0:861:103:10:64:32:23      64601        130        123      0      2      18:59 Establ
2620:0:861:107:10:64:48:52      64601    782077    754154      0      2 11w4d 11:14:13 Establ
</syntaxhighlight>


=== Restarting specific components ===
=== Restarting specific components ===
Line 146: Line 163:
</syntaxhighlight>
</syntaxhighlight>


===Reinitialize a complete cluster===
If, for whatever reason, we need to reinitialize a kubernetes cluster on a new etcd backing store. The following steps could be used as guideline. They might also help in understanding how the clusters are set up and how to set up new ones.
#Create puppet change, pointing k8s (and calico) to the new etcd cluster, see:
##{{Gerrit|558355}} and {{Gerrit|558473}}
#Populate IPPool and BGP nodes in the new calico etcd backend
##On a random node of the kubernetes cluster:<syntaxhighlight lang="bash">
cp /etc/calico/calicoctl.cfg .
# Modify the etcdEndpoints config in ./calicoctl.cfg to point to new etcd
# Set asNumber (64601 for eqiad, 64603 for codfw)
calicoctl config set asNumber 6460X --config=calicoctl.cfg
calicoctl config set nodeToNodeMesh off --config=calicoctl.cfg
# FIXME: This assumes we still have access to the old etcd to read bgppeer
#        and ippool data from.
calicoctl get -o yaml bgppeer | calicoctl create -f - --config=calicoctl.cfg
calicoctl get -o yaml ippool | calicoctl create -f - --config=calicoctl.cfg
# Create a basic default profile for the kube-system namespace in order to
# allow kube-system tiller to talk to the kubernetes API to deploy the
# calico-policy-controller (avoid catch-22).
#
# When the calico-policy-controller is started, it will sync things and this
# simple profile will be updated and set up correctly.
calicoctl create -f - --config=calicoctl.cfg <<_EOF_
- apiVersion: v1
  kind: profile
  metadata:
    name: k8s_ns.kube-system
    tags:
    - k8s_ns.kube-system
  spec:
    egress:
    - action: allow
      destination: {}
      source: {}
    ingress:
    - action: allow
      destination: {}
      source: {}
_EOF_
</syntaxhighlight>
#Schedule downtime for
##services running on the cluster
##kubernetes nodes and master<syntaxhighlight lang="bash">
sudo cookbook sre.hosts.downtime -r 'Reinitialize eqiad k8s cluster with new etcd' -t TXXX -H 4 'A:eqiad and (A:kubernetes-masters or A:kubernetes-workers)'
</syntaxhighlight>
#Depool services from discovery/edge caches
#Delete all helmfile managed namespaces (to be sure we see errors/missing things early)
#Disable puppet on master and k8s nodes<syntaxhighlight lang="bash">
sudo cumin 'A:eqiad and (A:kubernetes-masters or A:kubernetes-workers)' "disable-puppet 'Reinitialize eqiad k8s cluster with new etcd - TXXXX'"
</syntaxhighlight>
#Stop apiserver and calico node on k8s nodes
#Merge puppet changes
#Enable and run puppet on the k8s nodes
#Enable puppet on 1 apiserver and run it
#Disable puppet on apiserver again
#Edit <code>/etc/default/kube-apiserver</code> to disable PodSecurityPolicy controller
#Start API server (running without PodSecurityPolicy controller now)
#Run <code>deployment-chars/helmfile.d/admin/initialize_cluster.sh</code> for the cluster
#Restart kubelet on all kubernetes nodes<syntaxhighlight lang="bash">
sudo cumin 'A:eqiad and A:kubernetes-workers' 'systemctl restart kubelet'
</syntaxhighlight>
#Enable puppet on kubernetes masters again and run it. This will restart API server with PodSecurityPolicy controller
#Run <code>helmfile.d/admin/eqiad/cluster-helmfile.sh</code>
#Deploy all services via a for loop and helmfile sync commands


=== Switch the active staging cluster (eqiad<->codfw) ===
=== Switch the active staging cluster (eqiad<->codfw) ===
Line 225: Line 177:
</syntaxhighlight>
</syntaxhighlight>
*Make sure all service deployments are up to date after the switch (e.g. deploy them all)
*Make sure all service deployments are up to date after the switch (e.g. deploy them all)
=== Managing pods, jobs and cronjobs ===
Commands should be run from the [[Deployment_server|deployment servers]] (at the time of this writing [[deploy1002]]).
You need to set the correct context, for example:
kube_env admin eqiad
Other choices are codfw, staging-eqiad and staging-codfw.
The management commands is called [[kubectl]].
==== Listing cronjobs, jobs and pods ====
kubectl get cronjobs -n <namespace>
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>
==== Deleting a job ====
kubectl delete job <job id>
==== Updating the docker image run by a CronJob ====
The relationship between the resources is the following:
Cronjob --spawns--> Job(s) --spawns--> Pod(s)
Note: Technically speaking, it's a tight control loop that lives in kube-controller-manager that does the spawning part, but adding that to the above would make this more confusing.
Under normal conditions a docker image version will be updated when a new deploy happens. The cronjob will have the new version. However, already created jobs by the CronJob will not be stopped until they have run to completion.
When the job finishes, the cronjob will create new job(s), which in turn will create new pod(s).
Depending on the correlation between a CronJob scheduling and the job run time there might be a window of time where despite the new deployment, the old job is still running.
Deleting the kubernetes pod created by the job itself will NOT work, i.e. the job will still exist and it will create a new pod (which will still have the old image).
So, if we are dealing with a long running kubernetes Job one can get the same effect by deleting the kubernetes job created by the cronjob.
[[phab:T280076]] is an example where this was needed.
==== Checking which image version a cronjob is using ====
kubectl describe pod <pod in question>
(look for Image:)


== See also ==
== See also ==
Line 230: Line 224:
* [[Kubernetes/Clusters/New|Adding a new Kubernetes cluster]]
* [[Kubernetes/Clusters/New|Adding a new Kubernetes cluster]]
* [[Deployments on kubernetes|Deployments on Kubernetes]]
* [[Deployments on kubernetes|Deployments on Kubernetes]]
* [[Kubernetes/Kubernetes Workshop|Kubernetes Workshop]]
* [[Kubernetes/Kubernetes Education|Kubernetes Education]]


==Toolforge Info==
==Toolforge Info==
Line 238: Line 232:


[[Category:Kubernetes]]
[[Category:Kubernetes]]
[[Category:SRE Service Operations]]

Latest revision as of 09:13, 8 December 2021

For information about Kubernetes in the Toolforge environment see Help:Toolforge/Kubernetes.

Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.

Clusters

The list of currently maintained clusters in WMF, split by realm and team is at Kubernetes/Clusters

Packages

We deploy kubernetes in WMF production using Debian packages where appropriate. There is an upgrade policy in place for defining the timeframe and versions we run at every point in time. It's under Kubernetes/Kubernetes_Infrastructure_upgrade_policy. For more technical information on how we build the Debian packages have a look at Kubernetes/Packages

Images

For how our images are built and maintained have a look at Kubernetes/Images

Services

A service in Kubernetes is an 'abstract way to expose an application running on a set of workloads as a network service'.

Debugging

For a quick intro into the debugging actions one can take during a problem in production look at Kubernetes/Helm. There will also be a guide posted under Kubernetes/Kubectl

Administration

Create a new cluster

Documentation for creating a new cluster is in Kubernetes/Clusters/New

Add a new service

To add a new service named service-foo to the clusters of the main group:

  1. Ensure the service has it's ports registered at: Service ports
  2. Create deployment user/tokens in the puppet private (you can use a randomly generated 22-character [A-z0-9] password) and public repos. You need to edit the hieradata/common/profile/kubernetes.yaml file in the private repository - specifically the profile::kubernetes::infrastructure_user key, as in the example below:
    profile::kubernetes::infrastructure_users:
        main:
            client-infrastructure:
                token: <REDACTED>
                groups: [system:masters]
    ...
    +        service-foo:
    +            token: <YOUR_TOKEN>
    
    You also need to tell the deployment server how to set up the kubeconfig files, which is done by modifying the profile::kubernetes::deployment_server::services hiera key (hieradata/common/profile/kubernetes/deployment_server.yaml) as in the example below:
    profile::kubernetes::deployment_server::services:
      main:
        mathoid:
          usernames:
            - name: mathoid
    ...
    +    service-foo:
    +        usernames:
    +            - name: service-foo
    +              owner: mwdeploy
    +              group: wikidev
    +              mode: "0640"
    
    Please note that the owner/group/mode here refer to the file permissions of your kubeconfig file ("/etc/kubernetes/service-foo-<cluster_name>.config"), determining which users/groups will be able to use this kubeconfig. Typically for normal service users you don't need to define them, as the defaults are correct.
  3. Ask Sevice Ops the private data for your service. This is done by adding an entry for service-foo under profile::kubernetes::deployment_server_secrets::services in the private repository (role/common/deployment_server.yaml). Secrets will most likely needed for all clusters, including staging.
  4. Add a Kubernetes namespace. Example commit:
  5. At this point, you can safely merge the changes (after somebody from Service Ops validates). After merging, it is important to run the commands in the next step, so to avoid impacting other people rolling out changes later on.
  6. Setting up in staging-codfw cluster (and then to the other clusters)

On a cumin server

sudo cumin -b 4 -s 2 kubemaster* 'run-puppet-agent'

On deploy1002:

sudo run-puppet-agent
sudo -i
cd /srv/deployment-charts/helmfile.d/admin_ng/
helmfile -e staging-codfw -i apply

The command above should show you a diff in namespaces/quotas/etc.. related to your new service. If you don't see a diff, ping somebody from the Service Ops team! Check that everything is ok:

 kube_env $YOUR-SERVICE-NAME staging-codfw
 kubectl get ns
 kubectl get pods 

You should be able to see info about your namespace. kubectl get pods should show a tiller pod.
Repeat for the staging-eqiad, eqiad and codfw clusters even if you aren't ready to fully deploy your service. Leaving undeployed things will impede further operations by other people.

Deploy a service to staging

At this point you should have a a Chart for your service (TODO: link to docs?), and will need to setup a helmfile.d/services directory in the operations/deployment-charts repository for the deployment. You can copy the structure (helmfile.yaml, values.yaml, values-staging.yaml, etc.) from helmfile.d/services/_example_ and customize as needed.

You can proceed to deploy the new service to staging for real. Don't worry for TLS (if needed) since in staging it will be added a default config for your service auto-magically. Things are slightly different for production.

On deploy1002:

 cd /srv/deployment-charts/helmfile.d/services/service-foo
 helmfile -e staging -i apply

The command above will show a diff related to the new service, make sure that everything looks fine and then hit Yes to proceed.

Testing a service

  1. Now we can test the service in staging. Use the very handy endpoint: http(s)://staging.svc.eqiad.wmnet:$YOUR-SERVICE-PORT to quickly test if everything works as expected.

Deploy a service to production

  1. Create certificates for the new service, if it has an HTTPS endpoint (remember that this step for staging is automatically handled for you, but for production it is not).
  2. Enable TLS for Kubernetes deployments
  3. At this point, you need to update the admin config for eqiad and codfw (if you have configs for both of course):
    • On deploy1002: sudo -i; cd /srv/deployment-charts/helmfile.d/admin/codfw/; kube_env admin codfw; ./cluster-helmfile.sh -i apply
    • On deploy1002: sudo -i; cd /srv/deployment-charts/helmfile.d/admin/eqiad/; kube_env admin eqiad; ./cluster-helmfile.sh -i apply
  4. Then the final step, namely deploying the new service:
    • On deploy1002: cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e codfw -i apply
    • On deploy1002: cd /srv/deployment-charts/helmfile.d/services/service-foo; helmfile -e eqiad -i apply

The service can now be accessed via the registered port on any of the kubernetes nodes (for manual testing).

If you need the service to be easily accessible from outside of the cluster, you might want to add Add a new load balanced service.

Rebooting a worker node

The unpolite way

To reboot a worker node, you can just reboot it in our environment. The platform will understand the event and respawn the pods on other nodes. However the system does not automatically rebalance itself currently (pods are not rescheduled on the node after it has been rebooted)

The polite way (recommended)

If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers on 2019-12-11 are at around 60 seconds.

# kubectl drain --ignore-daemonsets kubernetes1001.eqiad.wmnet
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1003.eqiad.wmnet/10.64.32.23
kubernetes1004.eqiad.wmnet/10.64.48.52
kubernetes1005.eqiad.wmnet/10.64.0.145
kubernetes1006.eqiad.wmnet/10.64.32.18
# kubectl get nodes
NAME                         STATUS                     ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready,SchedulingDisabled   <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready                      <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready                      <none>    231d      v1.12.9

When the node has been rebooted, it can be configured to reaccept pods using kubectl uncordon, e.g.

# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl get nodes
NAME                         STATUS    ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready     <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready     <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready     <none>    231d      v1.12.9

The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.

Restarting specific components

kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.

commands would be

sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler

Restarting the API server

It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.

sudo systemctl restart kube-apiserver


Switch the active staging cluster (eqiad<->codfw)

We do have one staging cluster per DC, mostly to separate staging of kubernetes and components from staging of the services running on top of it. To provide staging services during work on one of the clusters, we can (manually) switch between the DCs:

Managing pods, jobs and cronjobs

Commands should be run from the deployment servers (at the time of this writing deploy1002).

You need to set the correct context, for example:

kube_env admin eqiad

Other choices are codfw, staging-eqiad and staging-codfw.

The management commands is called kubectl.

Listing cronjobs, jobs and pods

kubectl get cronjobs -n <namespace>
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>

Deleting a job

kubectl delete job <job id>

Updating the docker image run by a CronJob

The relationship between the resources is the following:

Cronjob --spawns--> Job(s) --spawns--> Pod(s)

Note: Technically speaking, it's a tight control loop that lives in kube-controller-manager that does the spawning part, but adding that to the above would make this more confusing.

Under normal conditions a docker image version will be updated when a new deploy happens. The cronjob will have the new version. However, already created jobs by the CronJob will not be stopped until they have run to completion.

When the job finishes, the cronjob will create new job(s), which in turn will create new pod(s).

Depending on the correlation between a CronJob scheduling and the job run time there might be a window of time where despite the new deployment, the old job is still running.

Deleting the kubernetes pod created by the job itself will NOT work, i.e. the job will still exist and it will create a new pod (which will still have the old image).

So, if we are dealing with a long running kubernetes Job one can get the same effect by deleting the kubernetes job created by the cronjob.

phab:T280076 is an example where this was needed.

Checking which image version a cronjob is using

kubectl describe pod <pod in question>

(look for Image:)

See also

Toolforge Info