You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Kubernetes/Clusters/New: Difference between revisions
imported>Alexandros Kosiaris No edit summary |
imported>Klausman |
||
(10 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
{{note| This guide assumes you have a basic understanding of the various kubernetes components. If you don't, please refer to https://kubernetes.io/docs/concepts/overview/components/ }} | |||
{{note| This guide assumes you have a basic understanding of the various kubernetes components. If you don't | {{warning | This guide has been written to instruct a WMF SRE, it is NOT meant to be followed by non-SRE people. }} | ||
{{warning | This guide has been written to instruct a WMF SRE, it is NOT meant to be followed by non SRE people }} | {{Kubernetes nav}} | ||
= Intro = | = Intro = | ||
Line 26: | Line 25: | ||
== Versions == | == Versions == | ||
Kubernetes versioning is important and brutal. You might want to have a peek at our kubernetes components upgrade policy [Kubernetes/Kubernetes_Infrastructure_upgrade_policy] | Kubernetes versioning is important and brutal. You might want to have a peek at our kubernetes components upgrade policy [[Kubernetes/Kubernetes_Infrastructure_upgrade_policy]] | ||
This guide currently covers kubernetes 1.16, calico 3.16, helm 2.17 | This guide currently covers kubernetes 1.16, calico 3.16, helm 2.17 | ||
Line 89: | Line 88: | ||
etcd is a distributed datastore using the Raft algorithm for consensus. It is used by kubernetes to store cluster configuration as well as deployment data. In WMF it is also used for pybal, so there is some knowledge. | etcd is a distributed datastore using the Raft algorithm for consensus. It is used by kubernetes to store cluster configuration as well as deployment data. In WMF it is also used for pybal, so there is some knowledge. | ||
Depending on the critically of your new cluster, request an odd (recommended value is '''3''') number of small VMs on '''phabricator vm-requests project'''. Then use [[Ganeti]] to create those VMs, followed by the guide in the dedicated page [[Etcd]] | Depending on the critically of your new cluster, request an odd (recommended value is '''3''') number of small VMs on '''phabricator vm-requests project''' via [[SRE_Team_requests#Virtual_machine_requests_(Production)]]. Then use [[Ganeti]] to create those VMs, followed by the guide in the dedicated page [[Etcd]] | ||
== Control-plane == | == Control-plane == | ||
Line 144: | Line 143: | ||
== Node == | == Node (worker) == | ||
{{warning| Debian stretch is the only distribution+version that is supported by this guide. Debian buster support is tracked at [https://phabricator.wikimedia.org/T245272 T245272] }} | {{warning| Debian stretch is the only distribution+version that is supported by this guide. Debian buster support is tracked at [https://phabricator.wikimedia.org/T245272 T245272] }} | ||
Line 158: | Line 157: | ||
* This setup is tested both with version 4.9 as well as 4.19 of the linux kernel. 4.19 is the recommended one currently. | * This setup is tested both with version 4.9 as well as 4.19 of the linux kernel. 4.19 is the recommended one currently. | ||
=== Puppet/hiera === | === General Puppet/hiera setup === | ||
In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer. | In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see [[Puppet_coding#Organization]] for a primer. | ||
Create a new role for your nodes. The best way forward is to copy <code>role::kubernetes::staging::worker</code> and set a proper system::role description. Something like the following should be good enough | Create a new role for your nodes. The best way forward is to copy <code>role::kubernetes::staging::worker</code> and set a proper system::role description. Something like the following should be good enough | ||
Line 209: | Line 208: | ||
profile::calico::kubernetes::calico_cni::token: dummytoken4 | profile::calico::kubernetes::calico_cni::token: dummytoken4 | ||
profile::calico::kubernetes::calicoctl::token: dummytoken5 | profile::calico::kubernetes::calicoctl::token: dummytoken5 | ||
{{warning| After the re-image the nodes will NOT be automatically added to the cluster if you have never applied <code>helmfile.d/ | === Access to restricted docker images === | ||
If your nodes need access to restricted docker images (see: [[phab:T273521|T273521]] for context), you have provide credentials for the docker registry to your nodes. This can be done by adding the hiera key <code>profile::kubernetes::node::docker_kubernetes_user_password</code> to the file <code>hieradata/role/common/foo/worker.yaml</code> in in the [[Puppet#private%20puppet|private puppet]] repository. | |||
See [[Docker-registry#Access_control]] on how to find the correct password. | |||
{{Warning|Because of the way docker works, you will need to ensure a puppet run on all docker registry nodes after puppet has run the kubernetes nodes with docker registry credentials set. See [https://gerrit.wikimedia.org/r/c/operations/puppet/+/672537 672537] for details.<br /> | |||
<code>sudo cumin -b 2 -s 5 'A:docker-registry' 'run-puppet-agent -q'</code>}} | |||
=== Adding Nodes === | |||
For adding nodes (based on the generic setup described above) please follow [[Kubernetes/Clusters/Add_or_remove_nodes]] | |||
{{warning| After the re-image the nodes will NOT be automatically added to the cluster if you have never applied <code>helmfile.d/admin_ng</code>, see [[Kubernetes/Clusters/New#Apply RBAC rules and PSPs]]. You only need to do that once}} | |||
== | == Apply RBAC rules and PSPs == | ||
If you have your <code>helmfile.d/admin_ng</code> ready you can apply at least RBAC and Pod Security Policies | If you have your <code>helmfile.d/admin_ng</code> ready you can apply at least RBAC and Pod Security Policies | ||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng | '''Note:''' these commands need to be run as logged-in root (just prefixing them with sudo will ''not'' work). | ||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng | |||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=rbac-rules sync | |||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=pod-security-policies sync | |||
After this stage your nodes will registered to the API, but will not be ready to receive pods, cause you lack the next section. | After this stage your nodes will registered to the API, but will not be ready to receive pods, cause you lack the next section. | ||
== Label Kubernetes Masters == | |||
For some clusters there is the need to add specific node labels to identify roles, for example the master nodes part of the control plane: | |||
'''Note:''' these commands need to be run as logged-in root (just prefixing them with sudo will ''not'' work). As root, you may need to run <code>kube_env admin somecluster</code> as well<syntaxhighlight lang="bash"> | |||
kubectl label nodes ml-serve-ctrl1001.eqiad.wmnet node-role.kubernetes.io/master="" | |||
kubectl label nodes ml-serve-ctrl1002.eqiad.wmnet node-role.kubernetes.io/master="" | |||
</syntaxhighlight>Due to https://github.com/kubernetes/kubernetes/issues/84912#issuecomment-551362981, we cannot add the above labels to the ones set by the Kubelet when registering the node, so this step needs to be done manually when bootstrapping the cluster. The labels will be useful for NetworkPolicies, for example to identify traffic coming from the master nodes towards a certain pod (likely a webhook). | |||
== Networking == | == Networking == | ||
Line 231: | Line 251: | ||
=== Core routers === | === Core routers === | ||
* Craft a change like the following https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 | * Craft a change like the following https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 (before proceeding make sure that the Network Engineers are aware of what you are going to do). | ||
* Apply it to the core routers using [[Homer]] | * Apply it to the core routers using [[Homer]] | ||
*Don't worry about BGP alerts, the important bit is doing this step and the next one one after the other (to establish iBGP sessions). | |||
=== Calico node/controllers === | === Calico node/controllers === | ||
Line 240: | Line 261: | ||
At this stage you can probably deploy the entire helmfile.d structure in 1 go but since RBAC/PSPs are already covered above we are going to just mention calico here. | At this stage you can probably deploy the entire helmfile.d structure in 1 go but since RBAC/PSPs are already covered above we are going to just mention calico here. | ||
# remember to root-login via sudo -i first | |||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico-crds sync | $ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico-crds sync | ||
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico sync | $ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico sync | ||
Note: if the second command times out, you may need to [[Kubernetes/Clusters/New#Namespaces|sync the namespaces first]]. | |||
There are dependencies between the 2 so you don't really need this level of release by relase, but for clarity: | There are dependencies between the 2 so you don't really need this level of release by relase, but for clarity: | ||
Line 249: | Line 273: | ||
If this succeeds, you are almost ready to deploy workloads, but have a look for 2 rather crucial cluster tools below. | If this succeeds, you are almost ready to deploy workloads, but have a look for 2 rather crucial cluster tools below. | ||
To check if calico works:<syntaxhighlight lang="bash"> | |||
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad | |||
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset | |||
NAME READY UP-TO-DATE AVAILABLE AGE | |||
deployment.apps/calico-kube-controllers 1/1 1 1 2m29s | |||
deployment.apps/calico-typha 1/1 1 1 2m29s | |||
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE | |||
daemonset.apps/calico-node 4 4 4 4 4 kubernetes.io/os=linux 2m29s | |||
</syntaxhighlight>And if you want to check on the routers, ssh to one of them (like cr1-eqiad.wikimedia.org) and run the following:<syntaxhighlight lang="bash"> | |||
$ show bgp neighbor | |||
[..] | |||
Description: ml-serve1002 | |||
Group: Kubemlserve4 Routing-Instance: master | |||
Forwarding routing-instance: master | |||
Type: External State: Established Flags: <Sync> | |||
Last State: OpenConfirm Last Event: RecvKeepAlive | |||
Last Error: None | |||
[..] | |||
</syntaxhighlight>You should see an established session for all the k8s workers of your cluster. | |||
== Cluster tools == | == Cluster tools == | ||
Line 263: | Line 309: | ||
helmfile -e <mycluster> -l name=coredns sync | helmfile -e <mycluster> -l name=coredns sync | ||
To check that everything is up and running as expected:<syntaxhighlight lang="bash"> | |||
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad | |||
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset | |||
NAME READY UP-TO-DATE AVAILABLE AGE | |||
deployment.apps/calico-kube-controllers 1/1 1 1 15m | |||
deployment.apps/calico-typha 1/1 1 1 15m | |||
deployment.apps/coredns 4/4 4 4 49s | |||
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE | |||
daemonset.apps/calico-node 4 4 4 4 4 kubernetes.io/os=linux 15m | |||
</syntaxhighlight>After the deployment of the coredns pods, you are free to merge a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/673985 to configure the coredns service ip to all kubelets/workers. | |||
=== Eventrouter === | === Eventrouter === | ||
Line 268: | Line 326: | ||
Eventrouter aggregates and sends to logstash kubernetes events. | Eventrouter aggregates and sends to logstash kubernetes events. | ||
Deploy it with | Deploy it with on the deployment host: | ||
# remember to root-login via sudo -i first | |||
helmfile -e <mycluster> -l name=eventrouter sync | helmfile -e <mycluster> -l name=eventrouter sync | ||
To check that everything is up and running as expected, see what's done for coredns. | |||
== Namespaces == | == Namespaces == | ||
Line 280: | Line 340: | ||
The same structure also holds limitRanges and resourceQuotas. Note that it's a pretty opinionated way | The same structure also holds limitRanges and resourceQuotas. Note that it's a pretty opinionated way | ||
Creating them is done with the following command: | Creating them is done with the following command on the deployment host: | ||
# remember to root-login via sudo -i first | |||
helmfile -e staging-codfw -l name=namespaces sync | helmfile -e staging-codfw -l name=namespaces sync | ||
Line 316: | Line 377: | ||
# k8s-staging: | # k8s-staging: | ||
# client_token: eqiaddummystaging | # client_token: eqiaddummystaging | ||
The above will only add the config for the new Prometheus instance on the master nodes: | |||
*prometheus100[3,4].eqiad.wmnet | |||
* prometheus200[3,4].codfw.wmnet | |||
Please verify on them that the new systemd units are working as expected. Once done, you can follow up with: | |||
* https://gerrit.wikimedia.org/r/c/operations/puppet/+/674279 (This requires a reload for apache2 on the prometheus nodes to pick up the new config. Please sync with Observability before doing anything). | |||
* https://gerrit.wikimedia.org/r/c/operations/puppet/+/674313 | |||
The first code review also need a puppet run on the grafana nodes to pick up the new config. Once done, you should be able to see the new cluster in the Kubernetes Grafana dashboards! | |||
=== LVM creation === | === LVM creation === | ||
This is unfortunately currently manual. Decide what kind of speed class and disk space you want (essentially HDD vs SSD) and run on the correct nodes (the ones having <code>prometheus</code> role the commands | This is unfortunately currently manual, requiring the creation of lvm volumes on multiple prometheus nodes: | ||
* prometheus100[3,4].eqiad.wmnet | |||
* prometheus200[3,4].codfw.wmnet | |||
'''Please follow up with a member of the Observability team first to let them know what you are doing, so they are aware.''' | |||
Decide what kind of speed class and disk space you want (essentially HDD vs SSD) and run on the correct nodes (the ones having <code>prometheus</code> role the commands | |||
Set the size, the name of the k8s cluster and speed class | Set the size, the name of the k8s cluster and speed class | ||
Line 329: | Line 408: | ||
lvcreate --size ${SIZE}GB --name prometheus-${CLUSTER_NAME} ${VG} | lvcreate --size ${SIZE}GB --name prometheus-${CLUSTER_NAME} ${VG} | ||
mkfs.ext4 /dev/mapper/$ | mkfs.ext4 /dev/mapper/$(echo $VG | sed -e 's/-/--/')-prometheus--$(echo $CLUSTER_NAME | sed -e s/-/--/) | ||
mkdir /srv/prometheus/${CLUSTER_NAME} | mkdir /srv/prometheus/${CLUSTER_NAME} | ||
echo "/dev/${VG}/prometheus-${CLUSTER_NAME} /srv/prometheus/${CLUSTER_NAME} ext4 defaults 0 0" >> /etc/fstab | echo "/dev/${VG}/prometheus-${CLUSTER_NAME} /srv/prometheus/${CLUSTER_NAME} ext4 defaults 0 0" >> /etc/fstab |
Revision as of 15:12, 24 November 2021
![]() | This guide assumes you have a basic understanding of the various kubernetes components. If you don't, please refer to https://kubernetes.io/docs/concepts/overview/components/ |
![]() | This guide has been written to instruct a WMF SRE, it is NOT meant to be followed by non-SRE people. |
Intro
This is a guide for setting up or reinitializing a new cluster from scratch or almost scratch, using all the already present wikimedia infrastructure. A quick primer:
A vanilla kubernetes is made up of the following components:
- Control plane
- etcd
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- Node
- kube-proxy
- kubelet
Note that upstream documents also refer to another control-plane component, namely cloud-controller-manager. We don't run cloud-controller-manager as we are not in a cloud.
In our infrastructure the first 3 components (kube-apiserver, kube-controller-manager, kube-scheduler) are assumed to be collocated on the same servers and talk over localhost. Kubelet and kube-proxy are assumed to be collocated on every kubernetes node. etcd is assumed to be on 3 nodes that are dedicated and different from all the others. Those assumptions might be attacked at some point and things changed, these docs will be updated when that happens.
Our services/main cluster also uses calico as CNI (container networking interface) and helm as a deployment tool. Those are covered as well in the networking and deployment sections.
Versions
Kubernetes versioning is important and brutal. You might want to have a peek at our kubernetes components upgrade policy Kubernetes/Kubernetes_Infrastructure_upgrade_policy
This guide currently covers kubernetes 1.16, calico 3.16, helm 2.17
Prerequisites
- Make sure you accept the restrictions about the versions above.
- Allocate IP spaces for your cluster.
- Calculate the maximum amount of pods you want to support and figure out using a subnet calculator (e.g. sipcalc) what IPv4 subnet you require (e.g. if you want a 100 pods, 128 pod IPs should be ok, so a /25 is enough). If you plan on max 1000 pods, you need 4 /24s (256 IPs) so a /22. Allocate them as active in Netbox. We can always add more pools after, but with IPv4 it's better to keep things a bit tidied. Don't forget IPv6. Allocate a /64. It should be enough regardless of amount of pods and will allow for growth.
- Calculate the maximum amount of services you want to have (obviously it will be smaller than the amount of pods. Unless you plan to expose >250 services a /24 should be more than enough). Allocate it in Netbox. Don't forget IPv6. Allocate a /64. It should be enough regardless of growth
helmfile.d structure
We use extensively helmfile for all deployments, including creating all the cluster configuration.
Clone "https://gerrit.wikimedia.org/r/operations/deployment-charts" and navigate to helmfile.d/admin_ng/values
hierarchy. The directories there are 1 per cluster. Copy one of those and amend it to fit your cluster.
Important things that WILL require alteration are:
File calico-values.yaml
# This is before coredns works, we can't rely on internal DNS, so use the external one kubernetesApi: host: <myclusterdns e.g kubestagemaster.svc.codfw.wmnet> # You must have already a certificate by cergen for that port: 6443 BGPConfiguration: asNumber: 64602 IPPools: # These are the IP spaces you reserved for the cluster. It of course varies per DC ipv4-1: cidr: "myipv4/24" ipv6: cidr: "myipv6/64" BGPPeers: # This actually per DC. It represents the IPv4+IPv6 IP of the core routers. Make sure to have the correct ones (which should happen if you copied the correct DC to start with) cr1-codfw-ipv4: asNumber: 14907 peerIP: "208.80.153.192" cr2-codfw-ipv4: asNumber: 14907 peerIP: "208.80.153.193" cr1-codfw-ipv6: asNumber: 14907 peerIP: "2620:0:860:ffff::1" cr2-codfw-ipv6: asNumber: 14907 peerIP: "2620:0:860:ffff::2"
File coredns-values.yaml
# This is before coredns works, we can't rely on internal DNS, so use the external one kubernetesApi: host: <myclusterdns> port: 6443 service: # This is the cluster level IP that coredns will listen on. It MUST be in the service ip range you reserved previously and it MUST NOT be the very first one (.1) as that is internally used by kubernetes clusterIP: X.Y.Z.W
Components
etcd
etcd is a distributed datastore using the Raft algorithm for consensus. It is used by kubernetes to store cluster configuration as well as deployment data. In WMF it is also used for pybal, so there is some knowledge.
Depending on the critically of your new cluster, request an odd (recommended value is 3) number of small VMs on phabricator vm-requests project via SRE_Team_requests#Virtual_machine_requests_(Production). Then use Ganeti to create those VMs, followed by the guide in the dedicated page Etcd
Control-plane
![]() | Packages have been created and are ready for use on Debian stretch+buster. Other versions aren't currently supported |
Servers
The control plane houses kube-apiserver, kube-controller-manager, kube-scheduler. For this guide kube-controller-manager and kube-scheduler are assumed to talk to localhost kube-apiserver. If > 1 control-plane nodes exists, those 2 components will perform elections over the API about which is the main one at any given point in time (detection and failover is automatic).
Depending on the criticality of having the control plane always working request 1 or 2 small VMs on phabricator vm-requests project. Then use Ganeti to create those VMs.
Puppet/hiera
In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer.
Create a new role for your nodes. The best way forward is to copy role::kubernetes::staging::master
and set a proper system::role
description. . Something like the following should be good enough
class role::foo::main { include ::profile::standard include ::profile::base::firewall # Sets up docker on the machine include ::profile::kubernetes::master system::role { 'foo::main': description => 'foo control plane server', } }
If you are going to have >1 control plane nodes, add profile::lvs::realserver
to the list of profiles included
Create the proper hiera files corresponding to your new role. e.g. if your new role is called role::foo::main
then you want the following hiera files
- hieradata/role/common/foo/master.yaml. This is where non-DC specific hiera values go. You can copy hieradata/role/common/kubernetes/staging/master.yaml, make sure to change keys, lvs configuration
- hieradata/role/codfw/foo/master.yaml. This is codfw specific data. Mostly service cluster ip ranges and etcd things should be in there. Make sure to set the correct cluster service IP range that you reserved earlier as well as a list of the etcd hosts you created previously.
- hieradata/role/eqiad/foo/master.yaml. This is eqiad specific data. Same rules apply as above
- Create the corresponding private puppet repo and labs/private tokens. It should be just
profile::kubernetes::master::controllermanager_token:
. You can obtain them from the repos themselves (remember that labs/private is full of dummy tokens) - Create the certificates using Cergen in the puppet private repo.
- Put the public cert that was obtained from the above step in the public repo under the
files/ssl
directory with the proper name.
Apply the above role to your new node(s)
All of the above can be done in 1 patch while using the puppet compiler
LVS
![]() | only needed if >1 control plane nodes have been created |
Follow LVS#Add a new load balanced service
Users/tokens
Our user/token populating process is currently hardwired to work across all clusters the same way. You will get all the users that the main services kubernetes clusters have. That is a limitation of our lack of a proper authentication layer that we have not yet solved.
Node (worker)
![]() | Debian stretch is the only distribution+version that is supported by this guide. Debian buster support is tracked at T245272 |
This setup is meant (and achieves) to provide a hands off approach to node provisioning/reprovisioning/imaging etc. That is from the moment the node is declared ready to be put in service and the puppet role (and respective hiera) has been applied, a single re-image should suffice for the node the registered to the API and be ready to receive traffic.
Notes
- The setup has only been tested with the specific partman recipe present in
partman/custom/kubernetes-node.cfg
. It creates a specific vg the is meant to be deleted and recreated by puppet on first role apply. - docker is mean to be used as the CRE. Other runtime engines aren't currently supported
- Currently docker is using the lvm devicemapper. This graph driver is deprecated. When we move to buster or bullseye we expect to not use the devicemapper graph driver and rather rely on overlay graph driver
- The CNI of choice is calico and it is deployed via a Kubernetes Daemonset. A node component is running on every node and is the one providing connectivity to pods. Failure of that components means pods have no connectivity
- This setup is tested both with version 4.9 as well as 4.19 of the linux kernel. 4.19 is the recommended one currently.
General Puppet/hiera setup
In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer.
Create a new role for your nodes. The best way forward is to copy role::kubernetes::staging::worker
and set a proper system::role description. Something like the following should be good enough
class role::foo::worker { include ::profile::standard include ::profile::base::firewall include ::profile::base::linux419 # Sets up docker on the machine include ::profile::docker::storage include ::profile::docker::engine # Setup kubernetes stuff include ::profile::kubernetes::node # Setup calico include ::profile::calico::kubernetes system::role { 'foo::worker': description => 'foo worker node', } }
In case you expect to expose services via LVS, add profile::lvs::realserver
in the list of profiles you include.
Create the proper hiera files corresponding to your new role. e.g. if your new role is called role::foo::worker then you want the following hiera files
- hieradata/role/common/foo/worker.yaml. This is where non-DC specific hiera values go. You can copy hieradata/role/common/kubernetes/staging/worker.yaml. It should mostly not require changes.
- hieradata/role/codfw/foo/worker.yaml. This is codfw specific data. You need to update:
# Enter your control plain DNS profile::kubernetes::master_fqdn: <foo> # The list of control-plain nodes. This is used to open up firewall rules profile::kubernetes::master_hosts: - main1 - main2 # The IP coredns will listen on. It needs to be in your service IP cluster range. Don't use .1 it's used internally by kubernetes profile::kubernetes::node::kubelet_cluster_dns: X.Y.Z.W # Enter your control plain DNS profile::rsyslog::kubernetes::kubernetes_url: <foo>
- hieradata/role/eqiad/foo/worker.yaml. This is eqiad specific data. Same rules apply as above
Make sure to create the corresponding private puppet repo and labs/private tokens. You don't get to generate them on your own currently as they are shared across all cluster until we can have a better solution. So reuse what the services k8s cluster users. Things to define:
profile::kubernetes::node::kubeproxy_token: dummytoken1 profile::kubernetes::node::kubelet_token: dummytoken2 profile::rsyslog::kubernetes::token: dummytoken3 profile::calico::kubernetes::calico_cni::token: dummytoken4 profile::calico::kubernetes::calicoctl::token: dummytoken5
Access to restricted docker images
If your nodes need access to restricted docker images (see: T273521 for context), you have provide credentials for the docker registry to your nodes. This can be done by adding the hiera key profile::kubernetes::node::docker_kubernetes_user_password
to the file hieradata/role/common/foo/worker.yaml
in in the private puppet repository.
See Docker-registry#Access_control on how to find the correct password.
![]() | Because of the way docker works, you will need to ensure a puppet run on all docker registry nodes after puppet has run the kubernetes nodes with docker registry credentials set. See 672537 for details. sudo cumin -b 2 -s 5 'A:docker-registry' 'run-puppet-agent -q' |
Adding Nodes
For adding nodes (based on the generic setup described above) please follow Kubernetes/Clusters/Add_or_remove_nodes
![]() | After the re-image the nodes will NOT be automatically added to the cluster if you have never applied helmfile.d/admin_ng , see Kubernetes/Clusters/New#Apply RBAC rules and PSPs. You only need to do that once |
Apply RBAC rules and PSPs
If you have your helmfile.d/admin_ng
ready you can apply at least RBAC and Pod Security Policies
Note: these commands need to be run as logged-in root (just prefixing them with sudo will not work).
$ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=rbac-rules sync $ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng# helmfile -e <my_cluster> -l name=pod-security-policies sync
After this stage your nodes will registered to the API, but will not be ready to receive pods, cause you lack the next section.
Label Kubernetes Masters
For some clusters there is the need to add specific node labels to identify roles, for example the master nodes part of the control plane:
Note: these commands need to be run as logged-in root (just prefixing them with sudo will not work). As root, you may need to run kube_env admin somecluster
as well
kubectl label nodes ml-serve-ctrl1001.eqiad.wmnet node-role.kubernetes.io/master=""
kubectl label nodes ml-serve-ctrl1002.eqiad.wmnet node-role.kubernetes.io/master=""
Due to https://github.com/kubernetes/kubernetes/issues/84912#issuecomment-551362981, we cannot add the above labels to the ones set by the Kubelet when registering the node, so this step needs to be done manually when bootstrapping the cluster. The labels will be useful for NetworkPolicies, for example to identify traffic coming from the master nodes towards a certain pod (likely a webhook).
Networking
First of all, have a look in Network design for how a DC (not a caching pop) is cable network wise. It will help get an idea of what it is you are going to be doing in this section.
What we are going to do in this section is have the nodes talk BGP to the cr*-<site> core routers (aka the juniper routers) and vice versa (it's a bidirectional protocol).
Core routers
- Craft a change like the following https://gerrit.wikimedia.org/r/c/operations/homer/public/+/661055 (before proceeding make sure that the Network Engineers are aware of what you are going to do).
- Apply it to the core routers using Homer
- Don't worry about BGP alerts, the important bit is doing this step and the next one one after the other (to establish iBGP sessions).
Calico node/controllers
Now you can deploy all calico components
At this stage you can probably deploy the entire helmfile.d structure in 1 go but since RBAC/PSPs are already covered above we are going to just mention calico here.
# remember to root-login via sudo -i first $ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico-crds sync $ deploy100X:/srv/deployment-charts/helmfile.d/admin_ng$ helmfile -e <my_cluster> -l name=calico sync
Note: if the second command times out, you may need to sync the namespaces first.
There are dependencies between the 2 so you don't really need this level of release by relase, but for clarity:
* The CRDs (Custom Resource Definitions) are calico's way of storing its' data in the Kubernetes API
* The calico release itself will setup a calico-node pod in every node with hostNetwork: true
(that is it will not have its own IP address but rather share it with the host), 1 calico typha pod and 1 calico kube-controllers pod.
If this succeeds, you are almost ready to deploy workloads, but have a look for 2 rather crucial cluster tools below.
To check if calico works:
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/calico-kube-controllers 1/1 1 1 2m29s
deployment.apps/calico-typha 1/1 1 1 2m29s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/calico-node 4 4 4 4 4 kubernetes.io/os=linux 2m29s
And if you want to check on the routers, ssh to one of them (like cr1-eqiad.wikimedia.org) and run the following:
$ show bgp neighbor
[..]
Description: ml-serve1002
Group: Kubemlserve4 Routing-Instance: master
Forwarding routing-instance: master
Type: External State: Established Flags: <Sync>
Last State: OpenConfirm Last Event: RecvKeepAlive
Last Error: None
[..]
You should see an established session for all the k8s workers of your cluster.
Cluster tools
There are 2 cluster level tools you probably want:
CoreDNS
![]() | There are many many horror stories regarding DNS and kubernetes. Because of those we were late in adopting CoreDNS. As an infrastructure piece it hasn't yet created problems but we keep an eye on it |
CoreDNS is the deployment and service that provides outgoing DNS resolution to pods as well as internal DNS discovery. It is NOT used by deployments that are hostNetwork: true
(e.g. calico-node) in our setup on purpose.
Assuming you populated the helmfile.d/admin_ng/values/<cluster>/
it can be populated with
helmfile -e <mycluster> -l name=coredns sync
To check that everything is up and running as expected:
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kube_env admin ml-serve-eqiad
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl -n kube-system get deployment,daemonset
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/calico-kube-controllers 1/1 1 1 15m
deployment.apps/calico-typha 1/1 1 1 15m
deployment.apps/coredns 4/4 4 4 49s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/calico-node 4 4 4 4 4 kubernetes.io/os=linux 15m
After the deployment of the coredns pods, you are free to merge a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/673985 to configure the coredns service ip to all kubelets/workers.
Eventrouter
Eventrouter aggregates and sends to logstash kubernetes events.
Deploy it with on the deployment host:
# remember to root-login via sudo -i first helmfile -e <mycluster> -l name=eventrouter sync
To check that everything is up and running as expected, see what's done for coredns.
Namespaces
![]() | Namespaces are populated in a pretty opinionated way in the main clusters, populating limitRanges, resourceQuotas and tillers per namespace |
Namespaces are created using helmfile and the main clusters (production + staging) all share them, however they are overridable per cluster. The main key is at helmfile.d/admin_ng. An example of augmenting it is at helmfile.d/admin_ng/staging
The same structure also holds limitRanges and resourceQuotas. Note that it's a pretty opinionated way
Creating them is done with the following command on the deployment host:
# remember to root-login via sudo -i first helmfile -e staging-codfw -l name=namespaces sync
Which means that if you don't want the main namespaces populated (which makes sense), your best bet is to skip running that command. Alternatively override the main values for your cluster.
Prometheus
![]() | Before enabling scraping you will need to create the LVM volumes manually |
Prometheus talks to the api and discovers the API server, nodes, pods, endpoints and services. In WMF we only scrape the API server, the nodes and the pods. We have 2 nodes per DC doing the scraping. Those will need to be properly configured to scrape the new cluster.
This happens via the files:
- hieradata/role/eqiad/prometheus.yaml
- hieradata/role/codfw/prometheus.yaml
An example stanza is pasted below, hopefully it's self documenting.
# A hash containing configuration for kubernetes clusters. profile::prometheus::kubernetes::clusters: k8s: enabled: true master_host: 'kubemaster.svc.codfw.wmnet' port: 9906 class_name: role::kubernetes::worker k8s-staging: enabled: false master_host: 'kubestagemaster.svc.codfw.wmnet' port: 9907 class_name: role::kubernetes::staging::worker # In the private repo a stanza like the following is required # profile::prometheus::kubernetes::cluster_tokens: # k8s: # client_token: eqiaddummy # k8s-staging: # client_token: eqiaddummystaging
The above will only add the config for the new Prometheus instance on the master nodes:
- prometheus100[3,4].eqiad.wmnet
- prometheus200[3,4].codfw.wmnet
Please verify on them that the new systemd units are working as expected. Once done, you can follow up with:
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/674279 (This requires a reload for apache2 on the prometheus nodes to pick up the new config. Please sync with Observability before doing anything).
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/674313
The first code review also need a puppet run on the grafana nodes to pick up the new config. Once done, you should be able to see the new cluster in the Kubernetes Grafana dashboards!
LVM creation
This is unfortunately currently manual, requiring the creation of lvm volumes on multiple prometheus nodes:
- prometheus100[3,4].eqiad.wmnet
- prometheus200[3,4].codfw.wmnet
Please follow up with a member of the Observability team first to let them know what you are doing, so they are aware.
Decide what kind of speed class and disk space you want (essentially HDD vs SSD) and run on the correct nodes (the ones having prometheus
role the commands
Set the size, the name of the k8s cluster and speed class
SIZE=X CLUSTER_NAME=CLUSTER_NAME VG=vg-hdd
Then run (careful, this IS NOT idempotent)
lvcreate --size ${SIZE}GB --name prometheus-${CLUSTER_NAME} ${VG} mkfs.ext4 /dev/mapper/$(echo $VG | sed -e 's/-/--/')-prometheus--$(echo $CLUSTER_NAME | sed -e s/-/--/) mkdir /srv/prometheus/${CLUSTER_NAME} echo "/dev/${VG}/prometheus-${CLUSTER_NAME} /srv/prometheus/${CLUSTER_NAME} ext4 defaults 0 0" >> /etc/fstab mount /srv/prometheus/${CLUSTER_NAME}
And you should be good to go.