You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Kubernetes: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Alexandros Kosiaris
No edit summary
imported>Btullis
(Moved Administration to a subpage and updated cluster information)
 
(60 intermediate revisions by 22 users not shown)
Line 1: Line 1:
{{Kubernetes nav}}
:''For information about Kubernetes in the Toolforge environment see [[Help:Toolforge/Kubernetes]].''
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.
'''[[w:Kubernetes|Kubernetes]]''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.


== Images ==
== Clusters ==


For how our images are built and maintained have a look at [[Kubernetes/Images]]
We maintain Kubernetes clusters in both the [[SRE/Production access|production]] and the [[Help:Cloud Services introduction|cloud services]] realms.
== Services ==


A service in Kubernetes is an 'abstract way to expose an application running on a set of [[Pod]]s as a network service'.
Most of the information on this page and its subpages applies to the clusters in the production realm, although some techniques and tools are broadly applicable to other WMF clusters and Kubernetes in general.


* https://kubernetes.io/docs/concepts/services-networking/service/
The '''[[Kubernetes/Clusters]]''' page contains the definitive list of currently maintained clusters in the production realm, along with information about who manages them and each cluster's specific purpose.


* Learn more about [[Deploying a service in kubernetes]].
For information relating to the Kubernetes clusters in the cloud services realm, please see [[Kubernetes#Toolforge info|Toolforge info]].


== Debugging ==
== Packages ==


For a quick intro into the debugging actions one can take during a problem in production look at [[Kubernetes/Helm]]. There will also be a guide posted under [[Kubernetes/Kubectl]]
We deploy kubernetes in WMF production using Debian packages where appropriate. There is an upgrade policy in place for defining the timeframe and versions we run at every point in time. It's under [[Kubernetes/Kubernetes_Infrastructure_upgrade_policy]]. For more technical information on how we build the Debian packages have a look at [[Kubernetes/Packages]]


== Administration ==
== Images ==


=== Rebooting a worker node ===
For how our images are built and maintained have a look at [[Kubernetes/Images]]
== Services ==


==== The unpolite way ====
A service in Kubernetes is an ''"abstract way to expose an application running on a set of workloads as a network service"''. That creates an overload of the term, as we also use the term ''"services"'' to describe how our various in-house developed applications are exposed to the rest of the infrastructure or the public. It's worthwhile to make sure one is on the same page as the other side when having a conversation around ''"services".'' Below there are some links to basic documentation about both concepts to help differentiate between them.
To reboot a worker node, you can just reboot it in our environment. The platform will understand the event and respawn the pods on other nodes. However the system does not automatically rebalance itself currently (pods are not rescheduled on the node after it has been rebooted)


==== The polite way (recommended) ====
* https://kubernetes.io/docs/concepts/services-networking/service/


If you feel like being more polite, use kubectl drain, it will configure the worker node to no longer create new pods and move the existing pods to other workers. Draining the node will take time. Rough numbers on 2019-12-11 are at around 60 seconds.
* Learn more about [[Deployment pipeline/Migration/Tutorial | Migrating a service to kubernetes]] and [[Deployment pipeline]] generally.


<source lang="shell-session">
== Deployment Charts ==
# kubectl drain kubernetes1001.eqiad.wmnet
We use a git repository called [[gerrit:plugins/gitiles/operations/deployment-charts/+/refs/heads/master/|operations/deployment-charts]] to manage all of the applications and deployments to Kubernetes clusters in the production realm.
# kubectl describe pods  --all-namespaces | awk  '$1=="Node:" {print $NF}' | sort -u
kubernetes1002.eqiad.wmnet/10.64.16.75
kubernetes1003.eqiad.wmnet/10.64.32.23
kubernetes1004.eqiad.wmnet/10.64.48.52
kubernetes1005.eqiad.wmnet/10.64.0.145
kubernetes1006.eqiad.wmnet/10.64.32.18
# kubectl get nodes
NAME                        STATUS                    ROLES    AGE      VERSION
kubernetes1001.eqiad.wmnet  Ready,SchedulingDisabled  <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet  Ready                      <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet  Ready                      <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet  Ready                      <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready                      <none>    231d      v1.12.9
</source>


When the node has been rebooted, it can be configured to reaccept pods using '''kubectl uncordon''', e.g.
See [[Kubernetes/Deployment Charts]] for more detailed information about the respository structure and its various functions.
<source lang="shell-session">
# kubectl uncordon kubernetes1001.eqiad.wmnet
# kubectl get nodes
NAME                        STATUS    ROLES    AGE      VERSION
kubernetes1001.eqiad.wmnet  Ready    <none>    2y352d    v1.12.9
kubernetes1002.eqiad.wmnet  Ready    <none>    2y352d    v1.12.9
kubernetes1003.eqiad.wmnet  Ready    <none>    2y352d    v1.12.9
kubernetes1004.eqiad.wmnet  Ready    <none>    559d      v1.12.9
kubernetes1005.eqiad.wmnet  Ready    <none>    231d      v1.12.9
kubernetes1006.eqiad.wmnet  Ready    <none>    231d      v1.12.9
</source>


The pods are not rebalanced automatically, i.e. the rebooted node is free of pods initially.
It primarly contains [[Helm]] charts and [[Helmfile]] deployments.


=== Restarting calico-node ===
The services and deployments that are defined within the repository are a combination of:


calico-node maitains a BGP session with the core routers if you intend to restart this service you should use the following procedure
* WMF software, running on [[Kubernetes/Images#Services%20images|service images]] managed by the [[deployment pipeline]]
* WMF forks of third-party software, also running on [[Kubernetes/Images#Services images|service images]] managed by the [[deployment pipeline]]
* WMF builds of third-party software, running on [[Kubernetes/Images#Production images|production images]] and built with [https://doc.wikimedia.org/docker-pkg/ docker-pkg]
See [[Kubernetes/Deployments]] for instructions regarding day-to-day deployment of Kubernetes [[Kubernetes#services|services]].


# drain the node on the kube controler as shown above
== Debugging ==
# <code>systemctl restart calico-node</code> on the kube worker
# Wait for BGP sessions on the coure router to re-established
# uncordon the node on the kube controler as shown above


you can use the following command on the cour routers to check BGP status (use <code>match 64602</code> for codfw)
For a quick intro into the debugging actions one can take during a problem in production look at [[Kubernetes/Helm]]. There will also be a guide posted under [[Kubernetes/Kubectl]]


<source lang="shell-session">
== Administration ==
# show bgp summary | match 64601     
See [[Kubernetes/Administration]] for collected instructions and runbooks for such tasks as:
10.64.0.121          64601        220        208      0      2      32:13 Establ
10.64.0.145          64601    824512    795240      0      1 12w1d 21:45:51 Establ
10.64.16.75          64601        161        152      0      2      23:25 Establ
10.64.32.18          64601    824596    795247      0      2 12w1d 21:46:45 Establ
10.64.32.23          64601        130        123      0      2      18:59 Establ
10.64.48.52          64601    782006    754152      0      3 11w4d 11:13:52 Establ
2620:0:861:101:10:64:0:121      64601        217        208      0      2      32:12 Establ
2620:0:861:101:10:64:0:145      64601    824472    795240      0      1 12w1d 21:45:51 Establ
2620:0:861:102:10:64:16:75      64601        160        152      0      2      23:25 Establ
2620:0:861:103:10:64:32:18      64601    824527    795246      0      1 12w1d 21:46:45 Establ
2620:0:861:103:10:64:32:23      64601        130        123      0      2      18:59 Establ
2620:0:861:107:10:64:48:52      64601    782077    754154      0      2 11w4d 11:14:13 Establ
</source>
 
=== Restarting specific components ===
 
kube-controller-manager and kube-scheduler are components of the API server. In production multiple ones run and perform via the API an election to determine which one is the master. Restarting both is without grave consequences so it's safe to do. However both are critical components in as such that there are required for the overall cluster to function smoothly. kube-scheduler is crucial for node failovers, pod evictions, etc while kube-controller-manager packs multiple controller components and is critical for responding to pod failures, depools etc.
 
commands would be
 
sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler


=== Restarting the API server ===
* [[Kubernetes/Administration#Rebooting worker nodes|Rebooting worker nodes]]
* [[Kubernetes/Administration#Restarting%20specific%20components|Restarting specific components]]
* [[Kubernetes/Administration#Managing pods, jobs and cronjobs|Managing pods, jobs, and, cronjobs]]


It's behind LVS in production, it's fine to restart it as long as enough time is given between the restarts across the cluster.
=== See also ===


sudo systemctl restart kube-apiserver
* [[Kubernetes/Clusters/New|Adding a new Kubernetes cluster]]
* [[Kubernetes/Deployments|Deployments on Kubernetes]]
* [[Kubernetes/Kubernetes Education|Kubernetes Education]]


== See also ==
==Toolforge Info==
* [[Portal:Toolforge/Admin/Kubernetes|Toolforge Kubernetes cluster design and administration]]
*[[Portal:Toolforge/Admin/Kubernetes|Toolforge Kubernetes cluster design and administration]]
* [[Help:Toolforge/Web|Toolforge Kubernetes webservice help]]
*[[Help:Toolforge/Web|Toolforge Kubernetes webservice help]]
* [[Help:Toolforge/Kubernetes|Toolforge Kubernetes general help]]
*[[Help:Toolforge/Kubernetes|Toolforge Kubernetes general help]]


[[Category:Kubernetes]]
[[Category:Kubernetes]]
[[Category:SRE Service Operations]]

Latest revision as of 17:09, 25 November 2022

For information about Kubernetes in the Toolforge environment see Help:Toolforge/Kubernetes.

Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. This page collects some notes/docs on the Kubernetes setup in the Foundation production environment.

Clusters

We maintain Kubernetes clusters in both the production and the cloud services realms.

Most of the information on this page and its subpages applies to the clusters in the production realm, although some techniques and tools are broadly applicable to other WMF clusters and Kubernetes in general.

The Kubernetes/Clusters page contains the definitive list of currently maintained clusters in the production realm, along with information about who manages them and each cluster's specific purpose.

For information relating to the Kubernetes clusters in the cloud services realm, please see Toolforge info.

Packages

We deploy kubernetes in WMF production using Debian packages where appropriate. There is an upgrade policy in place for defining the timeframe and versions we run at every point in time. It's under Kubernetes/Kubernetes_Infrastructure_upgrade_policy. For more technical information on how we build the Debian packages have a look at Kubernetes/Packages

Images

For how our images are built and maintained have a look at Kubernetes/Images

Services

A service in Kubernetes is an "abstract way to expose an application running on a set of workloads as a network service". That creates an overload of the term, as we also use the term "services" to describe how our various in-house developed applications are exposed to the rest of the infrastructure or the public. It's worthwhile to make sure one is on the same page as the other side when having a conversation around "services". Below there are some links to basic documentation about both concepts to help differentiate between them.

Deployment Charts

We use a git repository called operations/deployment-charts to manage all of the applications and deployments to Kubernetes clusters in the production realm.

See Kubernetes/Deployment Charts for more detailed information about the respository structure and its various functions.

It primarly contains Helm charts and Helmfile deployments.

The services and deployments that are defined within the repository are a combination of:

See Kubernetes/Deployments for instructions regarding day-to-day deployment of Kubernetes services.

Debugging

For a quick intro into the debugging actions one can take during a problem in production look at Kubernetes/Helm. There will also be a guide posted under Kubernetes/Kubectl

Administration

See Kubernetes/Administration for collected instructions and runbooks for such tasks as:

See also

Toolforge Info