You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
< Kubernetes‎ | Clusters
Revision as of 17:31, 10 March 2021 by imported>Alexandros Kosiaris (→‎Intro)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


This is a guide for setting up or reinitializing a new cluster from scratch or almost scratch, using all the already present wikimedia infrastructure. A quick primer:

A vanilla kubernetes is made up of the following components:

  • Control plane
    • etcd
    • kube-apiserver
    • kube-controller-manager
    • kube-scheduler
  • Node
    • kube-proxy
    • kubelet

Note that upstream documents also refer to another control-plane component, namely cloud-controller-manager. We don't run cloud-controller-manager as we are not in a cloud.

In our infrastructure the first 3 components (kube-apiserver, kube-controller-manager, kube-scheduler) are assumed to be collocated on the same servers and talk over localhost. Kubelet and kube-proxy are assumed to be collocated on every kubernetes node. etcd is assumed to be on 3 nodes that are dedicated and different from all the others. Those assumptions might be attacked at some point and things changed, these docs will be updated when that happens.

Our services/main cluster also uses calico as CNI (container networking interface) and helm as a deployment tool. Those are covered as well in the networking and deployment sections.


Kubernetes versioning is important and brutal. You might want to have a peek at our kubernetes components upgrade policy [Kubernetes/Kubernetes_Infrastructure_upgrade_policy]

This guide currently covers kubernetes 1.16, calico 3.16, helm 2.17


  • Make sure you accept the restrictions about the versions above.
  • Allocate IP spaces for your cluster.
    • Calculate the maximum amount of pods you want to support and figure out using a subnet calculator (e.g. sipcalc) what IPv4 subnet you require (e.g. if you want a 100 pods, 128 pod IPs should be ok, so a /25 is enough). If you plan on max 1000 pods, you need 4 /24s (256 IPs) so a /22. Allocate them as active in Netbox. We can always add more pools after, but with IPv4 it's better to keep things a bit tidied. Don't forget IPv6. Allocate a /64. It should be enough regardless of amount of pods and will allow for growth.
    • Calculate the maximum amount of services you want to have (obviously it will be smaller than the amount of pods. Unless you plan to expose >250 services a /24 should be more than enough). Allocate it in Netbox. Don't forget IPv6. Allocate a /64. It should be enough regardless of growth



etcd is a distributed datastore using the Raft algorithm for consensus. It is used by kubernetes to store cluster configuration as well as deployment data. In WMF it is also used for pybal, so there is some knowledge.

Depending on the critically of your new cluster, request an odd (recommended value is 3) number of small VMs on phabricator vm-requests project. Then use Ganeti to create those VMs, followed by the guide in the dedicated page Etcd



The control plane houses kube-apiserver, kube-controller-manager, kube-scheduler. For this guide kube-controller-manager and kube-scheduler are assumed to talk to localhost kube-apiserver. If > 1 control-plane nodes exists, those 2 components will perform elections over the API about which is the main one at any given point in time (detection and failover is automatic).

Depending on the criticality of having the control plane always working request 1 or 2 small VMs on phabricator vm-requests project. Then use Ganeti to create those VMs.


In our setup puppet roles are the way we instruct hiera to do lookups, but they don't have any functionality themselves (see Puppet_coding#Organization for a primer.

Create a new role for your nodes. The best way forward is to copy role::kubernetes::staging::master and set a proper system::role description. If you are going to have only 1 control plane node, remove profile::lvs::realserver as you don't need it.

Create the proper hiera files corresponding to your new role. e.g. if your new role is called role::foo::main then you want the following hiera files

  • hieradata/role/common/foo/master.yaml. This is where non-DC specific hiera values go. You can copy hieradata/role/common/kubernetes/staging/master.yaml, make sure to change keys, lvs configuration
  • hieradata/role/codfw/kubernetes/staging/master.yaml. This is codfw specific data. Mostly service cluster ip ranges and etcd things should be in there. Make sure to set the correct cluster service IP range that you reserved earlier as well as a list of the etcd hosts you created previously.
  • hieradata/role/eqiad/kubernetes/staging/master.yaml. This is eqiad specific data. Same rules apply as above
  • Create the corresponding private puppet repo and labs/private tokens. It should be just profile::kubernetes::master::controllermanager_token:. You can obtain them from the repos themselves (remember that labs/private is full of dummy tokens)
  • Create the certificates using Cergen in the puppet private repo.
  • Put the public cert that was obtained from the above step in the public repo under the files/ssl directory with the proper name.

Apply the above role to your new node(s)

All of the above can be done in 1 patch while using the puppet compiler


Follow LVS#Add a new load balanced service


Our user/token populating process is currently hardwired to work across all clusters the same way. You will get all the users that the main services kubernetes clusters have. That is a limitation of our lack of a proper authentication layer that we have not yet solved.

helmfile.d structure

We use extensively helmfile for all deployments, including creating all the cluster configuration.

Clone "" and navigate to helmfile.d/admin_ng/values hierarchy. The directories there are 1 per cluster. Copy one of those and amend it to fit your cluster.

Important things that WILL require alteration are:

File calico-values.yaml

# This is before coredns works, we can't rely on internal DNS, so use the external one
 host: <myclusterdns e.g kubestagemaster.svc.codfw.wmnet> # You must have already a certificate by cergen for that
 port: 6443
 asNumber: 64602
  # These are the IP spaces you reserved for the cluster. It of course varies per DC
     cidr: "myipv4/24"
     cidr: "myipv6/64"
 # This actually per DC. It represents the IPv4+IPv6 IP of the core routers. Make sure to have the correct ones (which should happen if you copied the correct DC to start with)
   asNumber: 14907
   peerIP: ""
   asNumber: 14907
   peerIP: ""
   asNumber: 14907
   peerIP: "2620:0:860:ffff::1"
   asNumber: 14907
   peerIP: "2620:0:860:ffff::2"

File coredns-values.yaml

# This is before coredns works, we can't rely on internal DNS, so use the external one
  host: <myclusterdns>
  port: 6443
  # This is the cluster level IP that coredns will listen on. It MUST be in the service ip range you reserved previously and it MUST NOT be the very first one (.1) as that is internally used by kubernetes
  clusterIP: X.Y.Z.W




Cluster tools

There are 2 cluster level tools you probably want:


CoreDNS is the deployment and service that provides outgoing DNS resolution to pods as well as internal DNS discovery. It is NOT used by deployments that are hostNetwork: true (e.g. calico-node) in our setup on purpose.

Assuming you populated the helmfile.d/admin_ng/values/<cluster>/ it can be populated with

helmfile -e <mycluster> -l name=coredns sync


Eventrouter aggregates and sends to logstash kubernetes events.

Deploy it with

helmfile -e <mycluster> -l name=eventrouter sync


Namespaces are created using helmfile and the main clusters (production + staging) all share them, however they are overridable per cluster. The main key is at helmfile.d/admin_ng. An example of augmenting it is at helmfile.d/admin_ng/staging

The same structure also holds limitRanges and resourceQuotas. Note that it's a pretty opinionated way

Creating them is done with the following command:

helmfile -e staging-codfw -l name=namespaces sync

Which means that if you don't want the main namespaces populated (which makes sense), your best bet is to skip running that command. Alternatively override the main values for your cluster.


Prometheus talks to the api and discovers the API server, nodes, pods, endpoints and services. In WMF we only scrape the API server, the nodes and the pods. We have 2 nodes per DC doing the scraping. Those will need to be properly configured to scrape the new cluster.

This happens via the files:

  • hieradata/role/eqiad/prometheus.yaml
  • hieradata/role/codfw/prometheus.yaml

An example stanza is pasted below, hopefully it's self documenting.

# A hash containing configuration for kubernetes clusters.
    enabled: true
    master_host: 'kubemaster.svc.codfw.wmnet'
    port: 9906
    class_name: role::kubernetes::worker
    enabled: false
    master_host: 'kubestagemaster.svc.codfw.wmnet'
    port: 9907
    class_name: role::kubernetes::staging::worker
# In the private repo a stanza like the following is required
# profile::prometheus::kubernetes::cluster_tokens:
# k8s:
#   client_token: eqiaddummy
# k8s-staging:
#   client_token: eqiaddummystaging

LVM creation

This is unfortunately currently manual. Decide what kind of speed class and disk space you want (essentially HDD vs SSD) and run on the correct nodes (the ones having prometheus role the commands

Set the size, the name of the k8s cluster and speed class


Then run (careful, this IS NOT idempotent)

lvcreate --size ${SIZE}GB --name prometheus-${CLUSTER_NAME} ${VG}
mkfs.ext4 /dev/mapper/${VG}-prometheus-${CLUSTER_NAME}
mkdir /srv/prometheus/${CLUSTER_NAME}
echo "/dev/${VG}/prometheus-${CLUSTER_NAME}	/srv/prometheus/${CLUSTER_NAME}	ext4	defaults	0	0" >> /etc/fstab
mount /srv/prometheus/${CLUSTER_NAME}

And you should be good to go.