You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Kubernetes: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jeena Huneidi
(Link to kubernetes migration tutorial)
imported>JMeybohm
 
(56 intermediate revisions by 19 users not shown)
Line 1: Line 1:
{{Kubernetes nav}}
:''For information about Kubernetes in the Toolforge environment see [[Help:Toolforge/Kubernetes]].''
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.
== Clusters ==
The list of currently maintained clusters in WMF, split by realm and team is at [[Kubernetes/Clusters]]
== Packages ==
We deploy kubernetes in WMF production using Debian packages where appropriate. There is an upgrade policy in place for defining the timeframe and versions we run at every point in time. It's under [[Kubernetes/Kubernetes_Infrastructure_upgrade_policy]]. For more technical information on how we build the Debian packages have a look at [[Kubernetes/Packages]]


== Images ==
== Images ==
Line 6: Line 17:
== Services ==
== Services ==


A service in Kubernetes is an 'abstract way to expose an application running on a set of [[Pod]]s as a network service'.
A service in Kubernetes is an ''"abstract way to expose an application running on a set of workloads as a network service"''. That creates an overload of the term, as we also use the term ''"services"'' to describe how our various in-house developed applications are exposed to the rest of the infrastructure or the public. It's worthwhile to make sure one is on the same page as the other side when having a conversation around ''"services".'' Below there are some links to basic documentation about both concepts to help differentiate between them.


* https://kubernetes.io/docs/concepts/services-networking/service/
* https://kubernetes.io/docs/concepts/services-networking/service/


* Learn more about [[Deployment pipeline/Migration/Tutorial | Migrating a service to kubernetes]] and [[Deploying a service in kubernetes]].
* Learn more about [[Deployment pipeline/Migration/Tutorial | Migrating a service to kubernetes]] and [[Deployment pipeline]] generally.


== Debugging ==
== Debugging ==
Line 17: Line 28:


== Administration ==
== Administration ==
=== Create a new cluster ===
Documentation for creating a new cluster is in [[Kubernetes/Clusters/New]]
=== Add a new service ===
Documentation on how to deploy a new service can be found at [[Kubernetes/Add_a_new_service]]


=== Rebooting a worker node ===
=== Rebooting a worker node ===
Line 27: Line 45:
If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers on 2019-12-11 are at around 60 seconds.
If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers on 2019-12-11 are at around 60 seconds.


<source lang="shell-session">
<syntaxhighlight lang="shell-session">
# kubectl drain kubernetes1001.eqiad.wmnet
# kubectl drain --ignore-daemonsets kubernetes1001.eqiad.wmnet
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1002.eqiad.wmnet/10.64.16.75
Line 43: Line 61:
kubernetes1005.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
kubernetes1005.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
</source>
</syntaxhighlight>


When the node has been rebooted, it can be configured to reaccept pods using '''kubectl uncordon''', e.g.
When the node has been rebooted, it can be configured to reaccept pods using '''kubectl uncordon''', e.g.
<source lang="shell-session">
<syntaxhighlight lang="shell-session">
# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl get nodes
# kubectl get nodes
Line 56: Line 74:
kubernetes1005.eqiad.wmnet  Ready    <none>    231d      v1.12.9
kubernetes1005.eqiad.wmnet  Ready    <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready    <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready    <none>    231d      v1.12.9
</source>
</syntaxhighlight>


The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.
The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.


=== Restarting calico-node ===
=== Restarting specific components ===
 
kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.
 
commands would be <syntaxhighlight lang="bash">
sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler
</syntaxhighlight>
 
=== Restarting the API server ===
 
It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.
<syntaxhighlight lang="bash">
sudo systemctl restart kube-apiserver
</syntaxhighlight>
 
If you need to restart all API servers, it might be wise to start with the ones that are not currently leading the cluster (to avoid multiple leader elections). The current leader is stored in the <code>control-plane.alpha.kubernetes.io/leader</code> annotation of the kube-scheduler endpoint:<syntaxhighlight lang="bash">
kubectl -n kube-system describe ep kube-scheduler
</syntaxhighlight>
 
=== Switch the active staging cluster (eqiad<->codfw) ===
We do have one staging cluster per DC, mostly to separate staging of kubernetes and components from staging of the services running on top of it. To provide staging services during work on one of the clusters, we can (manually) switch between the DCs:
 
* Switch staging.svc.eqiad.wmnet to point to the new active k8s cluster (we should have a better solution/DNS name for this at some point)
** https://gerrit.wikimedia.org/r/c/operations/dns/+/667982
* Switch the definition of "staging" on the deployment servers:
** https://gerrit.wikimedia.org/r/c/operations/puppet/+/667996
* Switch CI and releases to the other kubestagemaster:
** https://gerrit.wikimedia.org/r/c/operations/puppet/+/668114
** <syntaxhighlight lang="bash">
sudo cumin -b 3 'O:ci::master or O:releases or O:deployment_server' 'run-puppet-agent -q'
</syntaxhighlight>
*Make sure all service deployments are up to date after the switch (e.g. deploy them all)
 
=== Managing pods, jobs and cronjobs ===
 
Commands should be run from the [[Deployment_server|deployment servers]] (at the time of this writing [[deploy1002]]).
 
You need to set the correct context, for example:
kube_env <your service> eqiad
Other choices are codfw, staging.
 
The management commands is called [[kubectl]]. You may find some more inspiration on kubectl commands at [[Kubernetes/kubectl_Cheat_Sheet]]
 
==== Listing cronjobs, jobs and pods ====
kubectl get cronjobs -n <namespace>
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>
 
==== Deleting a job ====
kubectl delete job <job id>
 
==== Updating the docker image run by a CronJob ====
 
The relationship between the resources is the following:
 
Cronjob --spawns--> Job(s) --spawns--> Pod(s)
 
Note: Technically speaking, it's a tight control loop that lives in kube-controller-manager that does the spawning part, but adding that to the above would make this more confusing.
 
Under normal conditions a docker image version will be updated when a new deploy happens. The cronjob will have the new version. However, already created jobs by the CronJob will not be stopped until they have run to completion.
 
When the job finishes, the cronjob will create new job(s), which in turn will create new pod(s).


calico-node maitains a BGP session with the core routers if you intend to restart this service you should use the following procedure
Depending on the correlation between a CronJob scheduling and the job run time there might be a window of time where despite the new deployment, the old job is still running.


# drain the node on the kube controler as shown above
Deleting the kubernetes pod created by the job itself will NOT work, i.e. the job will still exist and it will create a new pod (which will still have the old image).
# <code>systemctl restart calico-node</code> on the kube worker
# Wait for BGP sessions on the coure router to re-established
# uncordon the node on the kube controler as shown above


you can use the following command on the cour routers to check BGP status (use <code>match 64602</code> for codfw)
So, if we are dealing with a long running kubernetes Job one can get the same effect by deleting the kubernetes job created by the cronjob.


<source lang="shell-session">
[[phab:T280076]] is an example where this was needed.
# show bgp summary | match 64601     
10.64.0.121          64601        220        208      0      2      32:13 Establ
10.64.0.145          64601    824512    795240      0      1 12w1d 21:45:51 Establ
10.64.16.75          64601        161        152      0      2      23:25 Establ
10.64.32.18          64601    824596    795247      0      2 12w1d 21:46:45 Establ
10.64.32.23          64601        130        123      0      2      18:59 Establ
10.64.48.52          64601    782006    754152      0      3 11w4d 11:13:52 Establ
2620:0:861:101:10:64:0:121      64601        217        208      0      2      32:12 Establ
2620:0:861:101:10:64:0:145      64601    824472    795240      0      1 12w1d 21:45:51 Establ
2620:0:861:102:10:64:16:75      64601        160        152      0      2      23:25 Establ
2620:0:861:103:10:64:32:18      64601    824527    795246      0      1 12w1d 21:46:45 Establ
2620:0:861:103:10:64:32:23      64601        130        123      0      2      18:59 Establ
2620:0:861:107:10:64:48:52      64601    782077    754154      0      2 11w4d 11:14:13 Establ
</source>


=== Restarting specific components ===


kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.
==== Recreate pods (of deployments, daemonsets, statefulsets, ...) ====
Pods which are backed by workloads controllers (such as Deployments or Daemonsets) can be easily recreated, without the need to manually delete them, using `kubectl rollout`. This will make sure that the update strategy specified for the set of pods as well as disruption budgets etc. are properly honored.
 
To restart all pods of a specific Deployment/Daemonset:<syntaxhighlight lang="bash">
kubectl -n NAMESPACE rollout restart [deployment|daemonset|statefulset|...] NAME
</syntaxhighlight>You may also restart all Pods of all Deployments/Daemonsets in a specific namespace just by omitting the name. The command will immediately return (e.g. not wait for the process to complete) and the scheduler will do the actual rolling restart  in background for you.


commands would be
In order to restart workload across multiple namespaces, one can use something like:
<syntaxhighlight lang="bash">
kubectl get ns -l app.kubernetes.io/managed-by=Helm -o jsonpath='{.items[*].metadata.name}' | xargs -L1 -d ' ' kubectl rollout restart deployment -n
</syntaxhighlight>


sudo systemctl restart kube-controller-manager
With or without label filters. The above ensures that for example workload in pre-defined namespaces (like kube-system) does not get restarted.
sudo systemctl restart kube-scheduler


=== Restarting the API server ===
==== Running a rolling restart of a Helmfile service ====


It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.
To rolling-restart a service described by a Helmfile, you don't need to use <code>kubectl</code>; instead, run


sudo systemctl restart kube-apiserver
<syntaxhighlight lang="bash">
cd /srv/deployment-charts/helmfile.d/services/${SERVICE?}
helmfile -e ${CLUSTER?} --state-values-set roll_restart=1 sync
</syntaxhighlight>


== See also ==
== See also ==
* [[Portal:Toolforge/Admin/Kubernetes|Toolforge Kubernetes cluster design and administration]]
 
* [[Help:Toolforge/Web|Toolforge Kubernetes webservice help]]
* [[Kubernetes/Clusters/New|Adding a new Kubernetes cluster]]
* [[Help:Toolforge/Kubernetes|Toolforge Kubernetes general help]]
* [[Kubernetes/Deployments|Deployments on Kubernetes]]
* [[Kubernetes/Kubernetes Education|Kubernetes Education]]
 
==Toolforge Info==
*[[Portal:Toolforge/Admin/Kubernetes|Toolforge Kubernetes cluster design and administration]]
*[[Help:Toolforge/Web|Toolforge Kubernetes webservice help]]
*[[Help:Toolforge/Kubernetes|Toolforge Kubernetes general help]]


[[Category:Kubernetes]]
[[Category:Kubernetes]]
[[Category:SRE Service Operations]]

Latest revision as of 12:50, 2 May 2022

For information about Kubernetes in the Toolforge environment see Help:Toolforge/Kubernetes.

Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.

Clusters

The list of currently maintained clusters in WMF, split by realm and team is at Kubernetes/Clusters

Packages

We deploy kubernetes in WMF production using Debian packages where appropriate. There is an upgrade policy in place for defining the timeframe and versions we run at every point in time. It's under Kubernetes/Kubernetes_Infrastructure_upgrade_policy. For more technical information on how we build the Debian packages have a look at Kubernetes/Packages

Images

For how our images are built and maintained have a look at Kubernetes/Images

Services

A service in Kubernetes is an "abstract way to expose an application running on a set of workloads as a network service". That creates an overload of the term, as we also use the term "services" to describe how our various in-house developed applications are exposed to the rest of the infrastructure or the public. It's worthwhile to make sure one is on the same page as the other side when having a conversation around "services". Below there are some links to basic documentation about both concepts to help differentiate between them.

Debugging

For a quick intro into the debugging actions one can take during a problem in production look at Kubernetes/Helm. There will also be a guide posted under Kubernetes/Kubectl

Administration

Create a new cluster

Documentation for creating a new cluster is in Kubernetes/Clusters/New

Add a new service

Documentation on how to deploy a new service can be found at Kubernetes/Add_a_new_service

Rebooting a worker node

The unpolite way

To reboot a worker node, you can just reboot it in our environment. The platform will understand the event and respawn the pods on other nodes. However the system does not automatically rebalance itself currently (pods are not rescheduled on the node after it has been rebooted)

The polite way (recommended)

If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers on 2019-12-11 are at around 60 seconds.

# kubectl drain --ignore-daemonsets kubernetes1001.eqiad.wmnet
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1003.eqiad.wmnet/10.64.32.23
kubernetes1004.eqiad.wmnet/10.64.48.52
kubernetes1005.eqiad.wmnet/10.64.0.145
kubernetes1006.eqiad.wmnet/10.64.32.18
# kubectl get nodes
NAME                         STATUS                     ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready,SchedulingDisabled   <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready                      <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready                      <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready                      <none>    231d      v1.12.9

When the node has been rebooted, it can be configured to reaccept pods using kubectl uncordon, e.g.

# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl get nodes
NAME                         STATUS    ROLES     AGE       VERSION
kubernetes1001.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet   Ready     <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet   Ready     <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet   Ready     <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet   Ready     <none>    231d      v1.12.9

The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.

Restarting specific components

kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.

commands would be

sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler

Restarting the API server

It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.

sudo systemctl restart kube-apiserver

If you need to restart all API servers, it might be wise to start with the ones that are not currently leading the cluster (to avoid multiple leader elections). The current leader is stored in the control-plane.alpha.kubernetes.io/leader annotation of the kube-scheduler endpoint:

kubectl -n kube-system describe ep kube-scheduler

Switch the active staging cluster (eqiad<->codfw)

We do have one staging cluster per DC, mostly to separate staging of kubernetes and components from staging of the services running on top of it. To provide staging services during work on one of the clusters, we can (manually) switch between the DCs:

Managing pods, jobs and cronjobs

Commands should be run from the deployment servers (at the time of this writing deploy1002).

You need to set the correct context, for example:

kube_env <your service> eqiad

Other choices are codfw, staging.

The management commands is called kubectl. You may find some more inspiration on kubectl commands at Kubernetes/kubectl_Cheat_Sheet

Listing cronjobs, jobs and pods

kubectl get cronjobs -n <namespace>
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace>

Deleting a job

kubectl delete job <job id>

Updating the docker image run by a CronJob

The relationship between the resources is the following:

Cronjob --spawns--> Job(s) --spawns--> Pod(s)

Note: Technically speaking, it's a tight control loop that lives in kube-controller-manager that does the spawning part, but adding that to the above would make this more confusing.

Under normal conditions a docker image version will be updated when a new deploy happens. The cronjob will have the new version. However, already created jobs by the CronJob will not be stopped until they have run to completion.

When the job finishes, the cronjob will create new job(s), which in turn will create new pod(s).

Depending on the correlation between a CronJob scheduling and the job run time there might be a window of time where despite the new deployment, the old job is still running.

Deleting the kubernetes pod created by the job itself will NOT work, i.e. the job will still exist and it will create a new pod (which will still have the old image).

So, if we are dealing with a long running kubernetes Job one can get the same effect by deleting the kubernetes job created by the cronjob.

phab:T280076 is an example where this was needed.


Recreate pods (of deployments, daemonsets, statefulsets, ...)

Pods which are backed by workloads controllers (such as Deployments or Daemonsets) can be easily recreated, without the need to manually delete them, using `kubectl rollout`. This will make sure that the update strategy specified for the set of pods as well as disruption budgets etc. are properly honored.

To restart all pods of a specific Deployment/Daemonset:

kubectl -n NAMESPACE rollout restart [deployment|daemonset|statefulset|...] NAME

You may also restart all Pods of all Deployments/Daemonsets in a specific namespace just by omitting the name. The command will immediately return (e.g. not wait for the process to complete) and the scheduler will do the actual rolling restart in background for you.

In order to restart workload across multiple namespaces, one can use something like:

kubectl get ns -l app.kubernetes.io/managed-by=Helm -o jsonpath='{.items[*].metadata.name}' | xargs -L1 -d ' ' kubectl rollout restart deployment -n

With or without label filters. The above ensures that for example workload in pre-defined namespaces (like kube-system) does not get restarted.

Running a rolling restart of a Helmfile service

To rolling-restart a service described by a Helmfile, you don't need to use kubectl; instead, run

cd /srv/deployment-charts/helmfile.d/services/${SERVICE?}
helmfile -e ${CLUSTER?} --state-values-set roll_restart=1 sync

See also

Toolforge Info