You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Kubernetes/Clusters/Add or remove nodes: Difference between revisions
imported>JMeybohm (→Add to calico: Fix hiera path for calico bgp_peers) |
imported>JMeybohm No edit summary |
||
Line 26: | Line 26: | ||
You can get the right region and zone values for your node from [[Netbox]], like https://netbox.wikimedia.org/search/?q=foo-node1001 | You can get the right region and zone values for your node from [[Netbox]], like https://netbox.wikimedia.org/search/?q=foo-node1001 | ||
== Add node to BGP == | == Add node to BGP == | ||
=== Add to homer === | === Add to homer === | ||
Nodes (in the calico setup) need to be able to establish BGP with the core routers. To be able to, they need to be added to as neighbors in {{Gitweb|project=operations/homer/public|file=config/sites.yaml}} of the operations/homer/public repository: | Nodes (in the calico setup) need to be able to establish BGP with the core routers. To be able to, they need to be added to as neighbors in {{Gitweb|project=operations/homer/public|file=config/sites.yaml}} of the operations/homer/public repository: | ||
Line 52: | Line 49: | ||
- foo_node1001.eqiad.wmnet | - foo_node1001.eqiad.wmnet | ||
</syntaxhighlight> | </syntaxhighlight> | ||
== Reimage the node == | |||
Then use the [[Server_Lifecycle/Reimage#How_to_run_it|re-image script]] to image you nodes, apply puppet and so on. | |||
{{warning| Reimaging will bring the node up with Kernel 4.9 and Kernel 4.19 will be installed by puppet later. So a reboot after the initial puppet run is required! }} | |||
== Add to conftool/LVS == | == Add to conftool/LVS == |
Revision as of 09:26, 22 November 2021
![]() | This guide assumes you have a basic understanding of the various kubernetes components. If you don't, please refer to https://kubernetes.io/docs/concepts/overview/components/ |
![]() | This guide has been written to instruct a WMF SRE, it is NOT meant to be followed by non-SRE people. |
Intro
This is a guide for adding or removing nodes from existing Kubernetes clusters.
![]() | Debian stretch is the only distribution+version that is supported by this guide. Debian buster support is tracked at T245272 |
Adding a node
With the creation of a Kubernetes cluster a Puppet role for the workers has been created (see: Kubernetes/Clusters/New#General_Puppet/hiera_setup)
- Apply the proper kubernetes worker puppet role for your cluster to the node in manifests/site.pp.
- Apply the partman recipe
partman/custom/kubernetes-node.cfg
for the node in modules/install_server/files/autoinstall/netboot.cfg.
Add node specific hiera data
You need to add node specific data, like the failure-domain/topology annotations:
This can be done by creating the file hieradata/hosts/foo-node1001.yaml
:
profile::kubernetes::node::kubelet_node_labels:
- failure-domain.beta.kubernetes.io/region=codfw
- failure-domain.beta.kubernetes.io/zone=row-c
You can get the right region and zone values for your node from Netbox, like https://netbox.wikimedia.org/search/?q=foo-node1001
Add node to BGP
Add to homer
Nodes (in the calico setup) need to be able to establish BGP with the core routers. To be able to, they need to be added to as neighbors in config/sites.yaml of the operations/homer/public repository:
eqiad:
[...]
foo_neighbors:
foo_node1001: {4: <Node IPv4>, 6: <Node IPv6}
You will have to run homer, once that change is merged. See: Homer#Running_Homer_from_cumin_hosts_(recommended)
Add to calico
In addition, all nodes are BGP peers for each other. So we need to extend the the hiera key profile::calico::kubernetes::bgp_peers
for this Kubernetes cluster with the new nodes FQDN in: hieradata/role/<DATACENTER>/<CLUSTER>/worker.yaml
e.g.:
profile::calico::kubernetes::bgp_peers:
- cr1-eqiad.wikimedia.org
- cr2-eqiad.wikimedia.org
[...]
- foo_node1001.eqiad.wmnet
Reimage the node
Then use the re-image script to image you nodes, apply puppet and so on.
![]() | Reimaging will bring the node up with Kernel 4.9 and Kernel 4.19 will be installed by puppet later. So a reboot after the initial puppet run is required! |
Add to conftool/LVS
If the Kubernetes cluster is exposing services via LVS (production clusters usually do, staging ones don't), you need to add the nodes FQDN to the cluster in conftool-data as well. For eqiad in conftool-data/node/eqiad.yaml like:
eqiad:
foo:
[...]
foo_node1001.eqiad.wmnet: [kubesvc]
Done
Please ensure you've followed all necessary steps from Server_Lifecycle#Staged_->_Active
Your node should now join the cluster and have workload scheduled automatically (like calico daemonsets). You can check with:
kubectl get nodes
Removing a node
Drain workload
First step to remove a node is to drain workload from it. This is also to ensure that the workload actually still fits the cluster:
kubectl drain --ignore-daemonsets foo-node1001.datacenter.wmnet
You can verify success by looking at what is still scheduled on the node:
kubectl describe node foo-node1001.datacenter.wmnet
Decommission
You can now follow the steps outlined in Server_Lifecycle#Active_->_Decommissioned
Ensure to also remove:
- The node specific hiera data (from Kubernetes/Clusters/Add_or_remove_nodes#Add node specific hiera data)
- The BGP config for homer and calico (from Kubernetes/Clusters/Add_or_remove_nodes#Add node to BGP)
Delete the node from Kubernetes API
The step left is to delete the node from Kubernetes:
kubectl delete node foo-node1001.datacenter.wmnet