You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Toolforge/Admin/Kubernetes: Difference between revisions
imported>Bstorm (→Building toolforge specific images: Adding a note about the default now being "toolforge" for image names) |
imported>BryanDavis (→Node management: update control nodes) |
||
(33 intermediate revisions by 10 users not shown) | |||
Line 5: | Line 5: | ||
{{Notice|For help on using kubernetes in Toolforge, see the [[Help:Toolforge/Kubernetes|Kubernetes help]] documentation.}} | {{Notice|For help on using kubernetes in Toolforge, see the [[Help:Toolforge/Kubernetes|Kubernetes help]] documentation.}} | ||
== | == Sub pages == | ||
{{Special:Prefixindex/{{FULLPAGENAME}}/|hideredirects=1|stripprefix=1}} | |||
{{TOC right}} | |||
== | == Upstream Documentation == | ||
If you need tutorials, information or reference material, check out https://kubernetes.io/docs/home/. | |||
The documentation can be adjusted to the version of Kubernetes we currently have deployed. | |||
== Cluster Build == | |||
The entire build process for reference and reproducibility is documented at [[Portal:Toolforge/Admin/Kubernetes/Deploying]]. | |||
== Components == | |||
K8S components are generally in two 'planes' - the control plane and the worker plane. You can also find more info about the general architecture of kubernetes (along with a nice diagram!) on [https://kubernetes.io/docs/concepts/overview/components/ the upstream documentation]. | |||
The most specific information on the build of our setup is available in the build documentation at [[Portal:Toolforge/Admin/Kubernetes/Deploying]] | |||
=== Control Plane === | |||
Kubernetes control plane nodes makes global decisions about the cluster. This is where all the control and scheduling happen. Currently, most of these (except etcd) run on each of three control nodes. The three nodes are redundant, load balanced by the service object inside the cluster and haproxy outside it. | |||
=== | |||
This is | |||
==== Etcd ==== | ==== Etcd ==== | ||
{{See also|Portal:Toolforge/Admin/Kubernetes/Deploying#etcd_nodes}} | |||
Kubernetes stores all state in [[etcd]] - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup. | |||
We currently use a 3 node cluster, named <code>tools-k8s-etcd-[4-6]</code>. They're all smallish Debian Buster instances configured largely by the same [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/etcd/ etcd puppet code] we use in production. | |||
==== The API Server ==== | |||
This is the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components, ''must'' go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality: | |||
This is the heart of the | |||
* Authentication & Authorization | * Authentication & Authorization | ||
* Validation | * Validation | ||
* Read / Write access to all the API endpoints | * Read / Write access to all the API endpoints | ||
* Watch functionality for endpoints, which notifies clients when state changes for a particular resource | * Watch functionality for endpoints, which notifies clients when state changes for a particular resource | ||
When you are interacting with the | When you are interacting with the Kubernetes API, this is the server that is serving your requests. | ||
It | The API server runs on each control plane node, currently <code>tools-k8s-control-1/2/3</code>. It listens on port 6443, using its own internal CA for TLS and authentication, and should be accessed outside the cluster via the [[Portal:Toolforge/Admin/Kubernetes/Deploying#front_proxy_(haproxy)|haproxy frontend]] at <code>k8s.tools.eqiad1.wikimedia.cloud</code>. The localhost insecure port is disabled. All certs for the cluster's use in API server communication are provisioned using the [https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/ certificates API]. Please note that we do use the cluster root CA in the certificates API. The wording in the upstream documentation is to warn the users that this is only one way to configure it. That API can be used for other types of certs as well, if a cluster builder so chooses. | ||
==== Controller Manager ==== | ==== Controller Manager ==== | ||
All other cluster-level functions are currently performed by the Controller Manager. For instance, <code>ReplicationController</code> objects are created and updated by the replication controller (constantly checking & spawning new pods if necessary), and nodes are discovered, managed, and monitored by the node controller. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match. | All other cluster-level functions are currently performed by the Controller Manager. For instance, deprecated <code>ReplicationController</code> objects are created and updated by the replication controller (constantly checking & spawning new pods if necessary), and nodes are discovered, managed, and monitored by the node controller. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match. | ||
The Controller Manager also runs on the k8s | The Controller Manager also runs on the k8s control nodes, and communicates with the API server with appropriate TLS over the ClusterIP of the API server. It runs as the in a static pod. | ||
==== | ==== The scheduler ==== | ||
This simply polls the API for [ | This simply polls the API for [https://kubernetes.io/docs/concepts/workloads/pods/pod/ Pods] with no assigned node, and selects appropriate healthy worker nodes for them. This is also a conceptually very simple reconciliation loop, and it is possible to replace the one we use (the default <code>kube-scheduler</code>) with a custom scheduler (and hence isn't part of Controller Manager). | ||
The scheduler runs on the k8s | The scheduler runs on the k8s control nodes in a static Pod and communicates with the API server over mutual TLS like all other components. The scheduler makes decisions via a process of filtering out nodes that are incapable of running tasks, and then scoring the remaining ones according to a complex ranking system. The scoring rules can be controlled somewhat by using scheduling profiles and plugins, but we haven't implemented anything custom in that regard. | ||
=== Worker plane === | === Worker plane === | ||
The worker plane refers to the components of the nodes on which actual user code is executed in containers. In tools these are named <code>tools-worker- | The worker plane refers to the components of the nodes on which actual user code is executed in containers. In tools these are named <code>tools-k8s-worker-*</code>, and run as Debian Buster instances. | ||
==== Kubelet ==== | ==== Kubelet ==== | ||
Kubelet is the interface between kubernetes and the container engine (in our case, [[W:Docker (software)|Docker]]). It checks for new pods scheduled on the node it is running on, and makes sure they are running with appropriate volumes / images / permissions. It also does the health checks of the running pods & updates state of them in the k8s API. You can think of it as a reconciliation loop where it checks what pods must be running / not-running in its node, and makes sure that matches reality. | Kubelet is the interface between kubernetes and the container engine (in our case, [[W:Docker (software)|Docker]]), deployed via Debian packages rather than static pods. It checks for new pods scheduled on the node it is running on, and makes sure they are running with appropriate volumes / images / permissions. It also does the health checks of the running pods & updates state of them in the k8s API. You can think of it as a reconciliation loop where it checks what pods must be running / not-running in its node, and makes sure that matches reality. | ||
This runs on each node and communicates with the k8s API server over TLS, authenticated with a client certificate (puppet node certificate + CA). It runs as root since it needs to communicate with docker, and being granted access to docker is root equivalent. | This runs on each node and communicates with the k8s API server over TLS, authenticated with a client certificate (puppet node certificate + CA). It runs as root since it needs to communicate with docker, and being granted access to docker is root equivalent. | ||
Line 91: | Line 69: | ||
==== Kube-Proxy ==== | ==== Kube-Proxy ==== | ||
kube-proxy is responsible for making sure that k8s service IPs work across the cluster. | kube-proxy is responsible for making sure that k8s service IPs work across the cluster. It is effectively an iptables management system. Its reconciliation loop is to get list of service IPs across the cluster, and make sure NAT rules for all of those exist in the node. | ||
This is run as root, since it needs to use iptables. You can list the rules on any worker node with <code>iptables -t nat -L</code> | This is run as root, since it needs to use iptables. You can list the rules on any worker node with <code>iptables -t nat -L</code> | ||
Line 97: | Line 75: | ||
==== Docker ==== | ==== Docker ==== | ||
We're currently using Docker as our container engine. We | We're currently using Docker as our container engine. We place up-to-date docker packages in our thirdparty/k8s repo, and pin versions in puppet. Configuration of the docker service is handled in puppet. | ||
==== | ==== Calico ==== | ||
Calico is the container overlay network and network policy system we use to allow all the containers to think they're on the same network. We currently use a /16 (192.168.0.0), from which each node gets a /24 and allocates an IP per container. It is currently a fairly bare minimum configuration to get the network going. | |||
Calico is configured to use the Kubernetes storage, and therefore it is able to use the same etcd cluster as Kubernetes. It runs on worker nodes as a DaemonSet. | |||
=== Proxy === | === Proxy === | ||
{{tracked| | {{See also|Portal:Toolforge/Admin/Kubernetes/Networking and ingress}} | ||
We need to be able to get http requests from the outside internet to pods running on kubernetes. | {{tracked|T234037|resolved}} | ||
We need to be able to get http requests from the outside internet to pods running on kubernetes. We have a [https://kubernetes.github.io/ingress-nginx/ NGINX ingress controller] the handles this behind the main Toolforge proxy. Any time the [[Portal:Toolforge/Admin/Dynamicproxy|DynamicProxy]] setup doesn't have a service listed, the incoming request will be proxied to the haproxy of the Kubernetes control plane on the port specified in hiera key <code>profile::toolforge::k8s::ingress_port</code> (currently 30000), which forwards the request to the ingress controllers. Currently, the DynamicProxy only actually serves Gridengine web services. | |||
This allows both | This allows both Gridengine and Kubernetes based web services to co-exist under the tools.wmflabs.org domain and the toolforge.org domain. | ||
=== Infrastructure centralized Logging === | === Infrastructure centralized Logging === | ||
{{Note|type=warning|content=This section is totally false at this time. Central logging needs rebuilding.}} | |||
We aggregate all logs from syslog (so docker, kubernetes components, flannel, etcd, etc) into a central instance from all kubernetes related hosts. This is for both simplicity as well as to try capture logs that would be otherwise lost to kernel issues. You can see these logs in the logging host, which can be found in [[Hiera:Tools]] as <code>k8s::sendlogs::centralserver</code>, in <code>/srv/syslog</code>. The current central logging host is <code>tools-logs-01</code>. Note that this is not related to logging for applications running on top of kubernetes at all. | We aggregate all logs from syslog (so docker, kubernetes components, flannel, etcd, etc) into a central instance from all kubernetes related hosts. This is for both simplicity as well as to try capture logs that would be otherwise lost to kernel issues. You can see these logs in the logging host, which can be found in [[Hiera:Tools]] as <code>k8s::sendlogs::centralserver</code>, in <code>/srv/syslog</code>. The current central logging host is <code>tools-logs-01</code>. Note that this is not related to logging for applications running on top of kubernetes at all. | ||
== Authentication & Authorization == | == Authentication & Authorization == | ||
{{See also|Portal:Toolforge/Admin/Kubernetes/RBAC and PSP|Portal:Toolforge/Admin/Kubernetes/Certificates}} | |||
In Kubernetes, there is no inherent concept of a user object at this time, but several methods of authentication to the API server by end users are allowed. They mostly require some external mechanism to generate OIDC tokens, x.509 certs or similar. The most native and convenient mechanism available to us seemed to be x.509 certificates provisioned using the Certificates API. This is managed with the [https://gerrit.wikimedia.org/g/labs/tools/maintain-kubeusers maintain-kubeusers service] that runs inside the cluster. | |||
Services that run inside the cluster that are not managed by tool accounts are generally authenticated with a provisioned service account. Therefore, they use a service account token to authenticate. | |||
Since the PKI structure of certificates is so integral to how everything in the system authenticates itself, further information can be found at [[Portal:Toolforge/Admin/Kubernetes/Certificates]] | |||
Permissions and authorization are handled via [[Portal:Toolforge/Admin/Kubernetes/RBAC and PSP|role-based access control and pod security policy]] | |||
{{tracked|T173312|resolved}} | |||
Tool accounts are Namespaced accounts - for each tool we create a Kubernetes Namespace, and inside the namespace they have access to create a specific set of resources (RCs, Pods, Services, Secrets, etc). There are resource based (CPU/IO/Disk) quotas imposed on a per-namespace basis described here: [[News/2020_Kubernetes_cluster_migration#What_are_the_primary_changes_with_moving_to_the_new_cluster?]]. More documentation to come | |||
# | |||
=== Admin accounts === | |||
{{tracked|T246059|open}} | |||
The [https://gerrit.wikimedia.org/g/labs/tools/maintain-kubeusers maintain-kubeusers service] creates admin accounts from the <code>$project.admin</code> LDAP group. Admin accounts are basically users with the "view" permission, which allows read access to most (not all) Kubernetes resources. They have the additional benefit of having the ability to [https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation impersonate] any user in the environment. This can be useful for troubleshooting in addition to allowing the administrators to assume cluster-admin privileges without logging directly into a control plane host. | |||
For example, bstorm is an admin account on toolsbeta Kubernetes and can therefore see all namespaces: | |||
<syntaxhighlight lang=shell-session> | |||
bstorm@toolsbeta-sgebastion-04:~$ kubectl get pods --all-namespaces | |||
NAMESPACE NAME READY STATUS RESTARTS AGE | |||
ingress-admission ingress-admission-55fb8554b5-5sr82 1/1 Running 0 48d | |||
ingress-admission ingress-admission-55fb8554b5-n64xz 1/1 Running 0 48d | |||
ingress-nginx nginx-ingress-64dc7c9c57-6zmzz 1/1 Running 0 48d | |||
</syntaxhighlight> | |||
However, bstorm cannot write or delete resources directly: | |||
<syntaxhighlight lang=shell-session> | |||
bstorm@toolsbeta-sgebastion-04:~$ kubectl delete pods test-85d69fb4f9-r22rl -n tool-test | |||
Error from server (Forbidden): pods "test-85d69fb4f9-r22rl" is forbidden: User "bstorm" cannot delete resource "pods" in API group "" in the namespace "tool-test" | |||
</syntaxhighlight> | |||
She can use the <code>kubectl-sudo</code> plugin (which internally impersonates the <code>system:masters</code> group) to delete resources: | |||
<syntaxhighlight lang=shell-session> | |||
bstorm@toolsbeta-sgebastion-04:~$ kubectl sudo delete pods test-85d69fb4f9-r22rl -n tool-test | |||
pod "test-85d69fb4f9-r22rl" deleted" | |||
</syntaxhighlight> | |||
== NFS, LDAP and User IDs == | == NFS, LDAP and User IDs == | ||
{{warning|This section is entirely obsolete. It needs to be replaced with the information on PSPs and PodPresets}} | |||
Kubernetes by default allows users to run their containers with any UID they want, including root (0). This is problematic for multiple reasons: | Kubernetes by default allows users to run their containers with any UID they want, including root (0). This is problematic for multiple reasons: | ||
Line 169: | Line 157: | ||
== Monitoring == | == Monitoring == | ||
{{ | {{warning|This section needs a large update}} | ||
The Kubernetes cluster contains multiple components responsible for cluster monitoring: | |||
* [https://github.com/kubernetes-sigs/metrics-server metrics-server] (per-container metrics) | |||
* [https://github.com/google/cadvisor cadvisor] (per-container metrics) | |||
* [https://github.com/kubernetes/kube-state-metrics kube-state-metrics] (cluster-level metrics) | |||
Data from those services is fed into the [[Portal:Toolforge/Admin/Prometheus|Prometheus servers]]. We have no alerting yet, but that should change at some point. | |||
== Docker Images == | == Docker Images == | ||
Line 180: | Line 173: | ||
# Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet) | # Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet) | ||
# Access control is provided totally by us, less dependent on dockerhub | # Access control is provided totally by us, less dependent on dockerhub | ||
# Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment | |||
This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container. | This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container. | ||
The decision to follow this approach was last discussed and re-evaluated at [[Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc]]. | |||
=== Image building === | === Image building === | ||
Images are built on the '''tools-docker- | Images are built on the '''tools-docker-imagebuilder-01''' instance, which is setup with appropriate credentials (and a hole in the proxy for the docker registry) to allow pushing. Note that you need to be root to build / push docker containers. Suggest using <code>sudo -i</code> for it - since docker looks for credentials in the user's home directory, and it is only present in root's home directory. | ||
==== Building base image ==== | ==== Building base image ==== | ||
We | We use base images from https://docker-registry.wikimedia.org/ as the starting point for the Toolforge images. There once was a separate process for creating our own base images, but that system is no longer used. | ||
==== Building toolforge specific images ==== | ==== Building toolforge specific images ==== | ||
These are present in the git repository <code>operations/docker-images/toollabs-images</code>. There is a | These are present in the git repository <code>operations/docker-images/toollabs-images</code>. There is a base image called <code>docker-registry.tools.wmflabs.org/toolforge-buster-sssd</code> that inherits from the wikimedia-buster base image but adds the toolforge debian repository + ldap SSSD support. All Toolforge related images should be named <code>docker-registry.tools.wmflabs.org/toolforge-$SOMETHING</code>. The structure should be fairly self explanatory. There is a clone of it in <code>/srv/images/toolforge</code> on the docker builder host. | ||
You can rebuild any particular image by running the <code>build.py</code> script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from ''and'' all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry. | You can rebuild any particular image by running the <code>build.py</code> script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from ''and'' all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry. | ||
Line 199: | Line 195: | ||
Example of rebuilding the python2 images: | Example of rebuilding the python2 images: | ||
<syntaxhighlight lang="shell-session"> | <syntaxhighlight lang="shell-session"> | ||
$ ssh tools-docker- | $ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud | ||
$ screen | $ screen | ||
$ sudo su | $ sudo su | ||
$ cd /srv/images/ | $ cd /srv/images/toolforge | ||
$ git fetch | $ git fetch | ||
$ git log --stat HEAD..@{upstream} | $ git log --stat HEAD..@{upstream} | ||
$ git rebase @{upstream} | $ git rebase @{upstream} | ||
$ ./build.py --push python2/base | $ ./build.py --push python2-sssd/base | ||
</syntaxhighlight> | </syntaxhighlight> | ||
By default, the script will build the ''testing'' tag of any image, which will not be pulled by [[Help:Toolforge/Web#Web_Service_Introduction|webservice]] and it will build with the prefix of ''toolforge''. Webservice pulls the ''latest'' tag. If the image you are working on is ready to be automatically applied to all newly-launched containers, you should add the <code>--tag latest</code> argument to your build.py command | By default, the script will build the ''testing'' tag of any image, which will not be pulled by [[Help:Toolforge/Web#Web_Service_Introduction|webservice]] and it will build with the prefix of ''toolforge''. Webservice pulls the ''latest'' tag. If the image you are working on is ready to be automatically applied to all newly-launched containers, you should add the <code>--tag latest</code> argument to your build.py command: | ||
<syntaxhighlight lang="shell-session"> | <syntaxhighlight lang="shell-session"> | ||
$ ./build.py --tag latest | $ ./build.py --tag latest --push python2-sssd/base | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Line 223: | Line 219: | ||
All of the <code>web</code> images install our locally managed <code>toollabs-webservice</code> package. When it is updated to fix bugs or add new features the Docker images need to be rebuilt. This is typically a good time to ensure that all apt managed packages are updated as well by rebuilding all of the images from scratch: | All of the <code>web</code> images install our locally managed <code>toollabs-webservice</code> package. When it is updated to fix bugs or add new features the Docker images need to be rebuilt. This is typically a good time to ensure that all apt managed packages are updated as well by rebuilding all of the images from scratch: | ||
<syntaxhighlight lang="shell-session"> | <syntaxhighlight lang="shell-session"> | ||
$ ssh tools-docker- | $ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud | ||
$ screen | $ screen | ||
$ sudo su | $ sudo su | ||
$ cd /srv/images/ | $ cd /srv/images/toolforge | ||
$ git fetch | $ git fetch | ||
$ git log --stat HEAD..@{upstream} | $ git log --stat HEAD..@{upstream} | ||
$ git reset --hard origin/master | $ git reset --hard origin/master | ||
$ ./ | $ ./rebuild_all.sh | ||
</syntaxhighlight> | </syntaxhighlight> | ||
See [[Portal:Toolforge/Admin/Docker-registry]] for more info on the docker registry setup. | See [[Portal:Toolforge/Admin/Kubernetes/Docker-registry]] for more info on the docker registry setup. | ||
==== Managing images available for tools ==== | |||
Available images are managed in [[gitlab:repos/cloud/toolforge/image-config/|image-config]]. Here is how to add a new image: | |||
* Add the new image name in the [[gitlab:repos/cloud/toolforge/image-config/|image-config]] repository | |||
** Deploy this change to toolsbeta: <code>cookbook wmcs.toolforge.k8s.component.deploy --git-url [[gitlab:repos/cloud/toolforge/image-config/|https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/]]</code> | |||
** Deploy this change to tools: <code>cookbook wmcs.toolforge.k8s.component.deploy --git-url [[gitlab:repos/cloud/toolforge/image-config/|https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/]] --project tools --deploy-node-hostname tools-k8s-control-1.tools.eqiad1.wikimedia.cloud</code> | |||
** Recreate the jobs-api pods in the Toolsbeta cluster, to make them read the new ConfigMap | |||
*** SSH to the bastion: <code>ssh toolsbeta-sgebastion-05.toolsbeta.eqiad1.wikimedia.cloud</code> | |||
*** Find the pod ids: <code>kubectl get pod -n jobs-api</code> | |||
*** Delete the pods, K8s will replace them with new ones: <code>kubectl sudo delete pod -n jobs-api {pod-name}</code> | |||
** Do the same in the Tools cluster (same instructions, but use <code>login.toolforge.org</code> as the SSH bastion) | |||
* From a bastion, check you can run the new image with <code>webservice {image-name} shell</code> | |||
* From a bastion, check the new image is listed when running <code>toolforge-jobs images</code> | |||
* Update the [https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes Toolforge/Kubernetes] wiki page to include the new image | |||
== Building new nodes == | == Building new nodes == | ||
{{See also|Portal:Toolforge/Admin/Kubernetes/Deploying}} | |||
=== Bastion nodes === | === Bastion nodes === | ||
Kubernetes bastion nodes provide | Kubernetes bastion nodes provide <code>kubectl</code> access to the cluster, installed from the thirdparty/k8s repo. This is in puppet and no other special configuration is required. | ||
=== Worker nodes === | === Worker nodes === | ||
Worker nodes are where user containers/pods are actually executed. They are large nodes running Debian | Build nodes according to the information here [[Portal:Toolforge/Admin/Kubernetes/Deploying#worker_nodes]] | ||
Worker nodes are where user containers/pods are actually executed. They are large nodes running Debian Buster. | |||
=== Builder nodes === | === Builder nodes === | ||
Line 300: | Line 274: | ||
# Run <code>sudo rm -rf /var/lib/puppet/ssl</code> on the new instance. | # Run <code>sudo rm -rf /var/lib/puppet/ssl</code> on the new instance. | ||
# Run puppet on the new instance again. This will make puppet create a new certificate signing request and send it to the puppetmaster. If you get errors similar to [[phab:P3623|this]], it means there already existed an instance with the same name attached to the puppetmaster that wasn't decomissioned properly. You can run <nowiki><code>sudo puppet cert clean $fqdn</code></nowiki> on the puppetmaster and then repeat steps 3 and 4. | # Run puppet on the new instance again. This will make puppet create a new certificate signing request and send it to the puppetmaster. If you get errors similar to [[phab:P3623|this]], it means there already existed an instance with the same name attached to the puppetmaster that wasn't decomissioned properly. You can run <nowiki><code>sudo puppet cert clean $fqdn</code></nowiki> on the puppetmaster and then repeat steps 3 and 4. | ||
# On the puppetmaster (<code>tools-puppetmaster- | # On the puppetmaster (<code>tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud</code>), run <code>sudo puppet cert sign <fqdn></code>, where fqdn is the fqdn of the new instance. This should not be automated away (the signing) since we depend on only signed clients having access for secrets we store in the puppetmaster. | ||
# Run puppet again on the new instance, and it should run to completion now! | # Run puppet again on the new instance, and it should run to completion now! | ||
== Administrative Actions == | == Administrative Actions == | ||
Perform these actions from a [[Portal:Toolforge/Admin/Kubernetes#Node_naming_conventions|toolforge bastion]]. | |||
=== Quota management=== | |||
Resource quotas and limit ranges in Kubernetes are set on a namespace scope. Default quotas are created for Toolforge tools by the maintain-kubeusers service. The difference between them is that resource quotas are for how much of a resource all pods can use, collectively while limit ranges limit how much CPU or RAM a particular container (''not'' pod) may consume. | |||
To view a quota for a tool, you can use your admin account (your login user if you are in the tools.admin group) and run <code>kubectl -n tool-$toolname get resourcequotas</code> to list them. There should be only one with the same name as the namespace in most cases. To see what's in there the easy way is to output the yaml of the quota like <code>kubectl -n tool-cdnjs get resourcequotas tool-cdnjs -o yaml</code> Likewise, you can check the limit range with <code>kubectl -n tool-$toolname describe limitranges</code> | |||
If you want to update a quota for a user who has completed the process for doing so on [https://phabricator.wikimedia.org/project/manage/4834/ phabricator], it is as simple as editing the Kubernetes object. Your admin account needs to impersonate cluster-admin to do this, such as: | |||
<syntaxhighlight lang=shell-session> | |||
$ kubectl sudo edit resourcequota tool-cdnjs -n tool-cdnjs | |||
</syntaxhighlight> | |||
The same can be done for a limit range. | |||
<syntaxhighlight lang=shell-session> | |||
$ kubectl sudo edit limitranges tool-mix-n-match -n tool-mix-n-match | |||
</syntaxhighlight> | |||
Requests can be fulfilled by bumping whichever quota item is requested according to the approved request, but do not change the NodePort services from 0 because we don't allow those for technical reasons. | |||
See also [[Help:Toolforge/Kubernetes#Namespace-wide_quotas]] | |||
=== Node management === | === Node management === | ||
You can run these as any user on | You can run these as any user on a kubernetes control node (currently tools-k8s-control-{4,5,6}.tools.eqiad1.wikimedia.cloud). It is ok to kill pods on individual nodes - the controller manager will notice they are gone soon and recreate them elsewhere. | ||
==== Getting a list of nodes ==== | ==== Getting a list of nodes ==== | ||
Line 313: | Line 309: | ||
<code>kubectl get node</code> | <code>kubectl get node</code> | ||
==== | ==== Cordoning a node ==== | ||
This | This prevents new pods from being scheduled on it, but does not kill currently running pods there. | ||
<code>kubectl | <code>kubectl cordon $node_hostname</code> | ||
==== | ==== Depooling a node ==== | ||
This | This deletes all running pods in that node as well as marking it as unschedulable. The <code>--delete-local-data --force</code> allows deleting paws containers (since those won't be automatically respawned) | ||
<code>kubectl | <code>kubectl drain --ignore-daemonsets --delete-emptydir-data --force $node_hostname</code> | ||
==== | ==== Uncordon/Repool a node ==== | ||
Make sure that the node shows up as 'ready' in <code>kubectl get node</code> before repooling it! | Make sure that the node shows up as 'ready' in <code>kubectl get node</code> before repooling it! | ||
Line 333: | Line 329: | ||
==== Decommissioning a node ==== | ==== Decommissioning a node ==== | ||
When you are permanently | When you are permanently decommissioning a node, you need to do the following: | ||
# Depool the node | # Depool the node: <code>kubectl drain --delete-local-data --force $node_fqdn</code> | ||
# Remove the node: <code>kubectl delete node $node_fqdn</code> | |||
# Shutdown the node using Horizon or <code>openstack</code> commands | |||
# (optional) Wait a bit if you feel that this node may need to be recovered for some reason | |||
# Delete the node using Horizon or <code>openstack</code> commands | |||
# Clean its puppet certificate: Run <code>sudo puppet cert clean $fqdn</code> on the tools puppetmaster | # Clean its puppet certificate: Run <code>sudo puppet cert clean $fqdn</code> on the tools puppetmaster | ||
# Remove it from the list of worker nodes in | # Remove it from the list of worker nodes in the ''profile::toolforge::k8s::worker_nodes'' hiera key for haproxy nodes (in prefixpuppet tools-k8s-haproxy). | ||
=== pods management === | === pods management === | ||
Line 346: | Line 346: | ||
Please read [[Portal:Toolforge/Admin/Kubernetes/Pod_tracing]] | Please read [[Portal:Toolforge/Admin/Kubernetes/Pod_tracing]] | ||
== Custom admission controllers == | == Custom admission controllers == | ||
To get the security features we need in our environment, we have written and deployed a few additional [ | To get the security features we need in our environment, we have written and deployed a few additional [https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#admission-webhooks admission webhooks]. Since kubernetes is written in Go, so are these admission controllers to take advantage of using the same objects, etc. The custom controllers are documented largely in their README files. | ||
See [[Portal:Toolforge/Admin/Kubernetes/Custom_components]] | |||
=== Ingress Admission Webhook === | |||
This prevents Toolforge users from creating arbitrary ingresses that might incorrectly or maliciously route traffic https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/ingress-admission-controller/ | |||
=== | === Registry Admission Webhook === | ||
This webhook controller prevents pods from external image repositories from running. It does not apply to <code>kube-system</code> or other namespaces we specify in the webhook config because these images are used from the upstream systems directly. https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/registry-admission-webhook/ | |||
=== | === Volume Admission Webhook === | ||
This mutating admission webhook mounts NFS volumes to tool pods labelled with <code>toolforge: tool</code>. It replaced kubernetes PodPresets which were removed in the 1.20 update. https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/volume-admission-controller/ | |||
It | |||
== Common issues == | == Common issues == | ||
Line 465: | Line 377: | ||
!How to find active one? | !How to find active one? | ||
|- | |- | ||
|Kubernetes | |Kubernetes Control Node | ||
|tools- | |tools-k8s-control- | ||
|Hiera: <code>k8s:: | |Hiera: <code>profile::toolforge::k8s::control_nodes</code> | ||
|- | |- | ||
|Kubernetes worker node | |Kubernetes worker node | ||
|tools-worker- | |tools-k8s-worker- | ||
|Run <code>kubectl get node</code> on the kubernetes master host | |Run <code>kubectl get node</code> on the kubernetes master host | ||
|- | |- | ||
|Kubernetes etcd | |Kubernetes etcd | ||
|tools-k8s-etcd- | |tools-k8s-etcd- | ||
|All nodes with the given prefix, usually | |All nodes with the given prefix, usually | ||
|- | |- | ||
Line 490: | Line 398: | ||
|- | |- | ||
|Bastions | |Bastions | ||
|tools- | |tools-sgebastion | ||
|DNS: | |DNS: login.toolforge.org and dev.toolforge.org | ||
|- | |- | ||
|Web Proxies | |Web Proxies | ||
|tools-proxy | |tools-proxy | ||
|DNS: tools.wmflabs.org. | |DNS: tools.wmflabs.org. and toolforge.org. | ||
Hiera: <code>active_proxy_host</code> | Hiera: <code>active_proxy_host</code> | ||
|- | |- | ||
|GridEngine worker node | |GridEngine worker node | ||
( | (Debian Stretch) | ||
|tools- | |tools-sgeexec-09 | ||
| | | | ||
|- | |- | ||
|GridEngine webgrid node | |GridEngine webgrid node | ||
(Lighttpd, | (Lighttpd, Strech) | ||
|tools- | |tools-sgewebgrid-lighttpd-09 | ||
| | | | ||
|- | |- | ||
|GridEngine webgrid node | |GridEngine webgrid node | ||
(Generic, | (Generic, Stretch) | ||
|tools- | |tools-sgewebgrid-generic-09 | ||
| | | | ||
|- | |- | ||
|GridEngine master | |GridEngine master | ||
|tools- | |tools-sgegrid-master | ||
| | | | ||
|- | |- | ||
|GridEngine master shadow | |GridEngine master shadow | ||
|tools- | |tools-sgegrid-shadow | ||
| | | | ||
|- | |- | ||
Line 540: | Line 438: | ||
|- | |- | ||
|Cron runner | |Cron runner | ||
|tools- | |tools-sgecron- | ||
|Hiera: active_cronrunner | |Hiera: active_cronrunner | ||
|- | |- | ||
Line 553: | Line 451: | ||
==See also== | ==See also== | ||
* [[PAWS/Tools/Admin]] | * [[PAWS/Tools/Admin]] | ||
* [[Portal:Toolforge/Admin/lima-kilo]] |
Latest revision as of 23:56, 27 April 2023
Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. Kubernetes was selected in 2015 by the Cloud Services team as the replacement for Grid Engine in the Toolforge project.[1] Usage of k8s by Tools began in mid-2016.[2]
![]() | For help on using kubernetes in Toolforge, see the Kubernetes help documentation. |
Sub pages
Upstream Documentation
If you need tutorials, information or reference material, check out https://kubernetes.io/docs/home/. The documentation can be adjusted to the version of Kubernetes we currently have deployed.
Cluster Build
The entire build process for reference and reproducibility is documented at Portal:Toolforge/Admin/Kubernetes/Deploying.
Components
K8S components are generally in two 'planes' - the control plane and the worker plane. You can also find more info about the general architecture of kubernetes (along with a nice diagram!) on the upstream documentation.
The most specific information on the build of our setup is available in the build documentation at Portal:Toolforge/Admin/Kubernetes/Deploying
Control Plane
Kubernetes control plane nodes makes global decisions about the cluster. This is where all the control and scheduling happen. Currently, most of these (except etcd) run on each of three control nodes. The three nodes are redundant, load balanced by the service object inside the cluster and haproxy outside it.
Etcd
Kubernetes stores all state in etcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup.
We currently use a 3 node cluster, named tools-k8s-etcd-[4-6]
. They're all smallish Debian Buster instances configured largely by the same etcd puppet code we use in production.
The API Server
This is the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components, must go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality:
- Authentication & Authorization
- Validation
- Read / Write access to all the API endpoints
- Watch functionality for endpoints, which notifies clients when state changes for a particular resource
When you are interacting with the Kubernetes API, this is the server that is serving your requests.
The API server runs on each control plane node, currently tools-k8s-control-1/2/3
. It listens on port 6443, using its own internal CA for TLS and authentication, and should be accessed outside the cluster via the haproxy frontend at k8s.tools.eqiad1.wikimedia.cloud
. The localhost insecure port is disabled. All certs for the cluster's use in API server communication are provisioned using the certificates API. Please note that we do use the cluster root CA in the certificates API. The wording in the upstream documentation is to warn the users that this is only one way to configure it. That API can be used for other types of certs as well, if a cluster builder so chooses.
Controller Manager
All other cluster-level functions are currently performed by the Controller Manager. For instance, deprecated ReplicationController
objects are created and updated by the replication controller (constantly checking & spawning new pods if necessary), and nodes are discovered, managed, and monitored by the node controller. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.
The Controller Manager also runs on the k8s control nodes, and communicates with the API server with appropriate TLS over the ClusterIP of the API server. It runs as the in a static pod.
The scheduler
This simply polls the API for Pods with no assigned node, and selects appropriate healthy worker nodes for them. This is also a conceptually very simple reconciliation loop, and it is possible to replace the one we use (the default kube-scheduler
) with a custom scheduler (and hence isn't part of Controller Manager).
The scheduler runs on the k8s control nodes in a static Pod and communicates with the API server over mutual TLS like all other components. The scheduler makes decisions via a process of filtering out nodes that are incapable of running tasks, and then scoring the remaining ones according to a complex ranking system. The scoring rules can be controlled somewhat by using scheduling profiles and plugins, but we haven't implemented anything custom in that regard.
Worker plane
The worker plane refers to the components of the nodes on which actual user code is executed in containers. In tools these are named tools-k8s-worker-*
, and run as Debian Buster instances.
Kubelet
Kubelet is the interface between kubernetes and the container engine (in our case, Docker), deployed via Debian packages rather than static pods. It checks for new pods scheduled on the node it is running on, and makes sure they are running with appropriate volumes / images / permissions. It also does the health checks of the running pods & updates state of them in the k8s API. You can think of it as a reconciliation loop where it checks what pods must be running / not-running in its node, and makes sure that matches reality.
This runs on each node and communicates with the k8s API server over TLS, authenticated with a client certificate (puppet node certificate + CA). It runs as root since it needs to communicate with docker, and being granted access to docker is root equivalent.
Kube-Proxy
kube-proxy is responsible for making sure that k8s service IPs work across the cluster. It is effectively an iptables management system. Its reconciliation loop is to get list of service IPs across the cluster, and make sure NAT rules for all of those exist in the node.
This is run as root, since it needs to use iptables. You can list the rules on any worker node with iptables -t nat -L
Docker
We're currently using Docker as our container engine. We place up-to-date docker packages in our thirdparty/k8s repo, and pin versions in puppet. Configuration of the docker service is handled in puppet.
Calico
Calico is the container overlay network and network policy system we use to allow all the containers to think they're on the same network. We currently use a /16 (192.168.0.0), from which each node gets a /24 and allocates an IP per container. It is currently a fairly bare minimum configuration to get the network going.
Calico is configured to use the Kubernetes storage, and therefore it is able to use the same etcd cluster as Kubernetes. It runs on worker nodes as a DaemonSet.
Proxy
We need to be able to get http requests from the outside internet to pods running on kubernetes. We have a NGINX ingress controller the handles this behind the main Toolforge proxy. Any time the DynamicProxy setup doesn't have a service listed, the incoming request will be proxied to the haproxy of the Kubernetes control plane on the port specified in hiera key profile::toolforge::k8s::ingress_port
(currently 30000), which forwards the request to the ingress controllers. Currently, the DynamicProxy only actually serves Gridengine web services.
This allows both Gridengine and Kubernetes based web services to co-exist under the tools.wmflabs.org domain and the toolforge.org domain.
Infrastructure centralized Logging
![]() | This section is totally false at this time. Central logging needs rebuilding. |
We aggregate all logs from syslog (so docker, kubernetes components, flannel, etcd, etc) into a central instance from all kubernetes related hosts. This is for both simplicity as well as to try capture logs that would be otherwise lost to kernel issues. You can see these logs in the logging host, which can be found in Hiera:Tools as k8s::sendlogs::centralserver
, in /srv/syslog
. The current central logging host is tools-logs-01
. Note that this is not related to logging for applications running on top of kubernetes at all.
Authentication & Authorization
In Kubernetes, there is no inherent concept of a user object at this time, but several methods of authentication to the API server by end users are allowed. They mostly require some external mechanism to generate OIDC tokens, x.509 certs or similar. The most native and convenient mechanism available to us seemed to be x.509 certificates provisioned using the Certificates API. This is managed with the maintain-kubeusers service that runs inside the cluster.
Services that run inside the cluster that are not managed by tool accounts are generally authenticated with a provisioned service account. Therefore, they use a service account token to authenticate.
Since the PKI structure of certificates is so integral to how everything in the system authenticates itself, further information can be found at Portal:Toolforge/Admin/Kubernetes/Certificates
Permissions and authorization are handled via role-based access control and pod security policy
Tool accounts are Namespaced accounts - for each tool we create a Kubernetes Namespace, and inside the namespace they have access to create a specific set of resources (RCs, Pods, Services, Secrets, etc). There are resource based (CPU/IO/Disk) quotas imposed on a per-namespace basis described here: News/2020_Kubernetes_cluster_migration#What_are_the_primary_changes_with_moving_to_the_new_cluster?. More documentation to come
Admin accounts
The maintain-kubeusers service creates admin accounts from the $project.admin
LDAP group. Admin accounts are basically users with the "view" permission, which allows read access to most (not all) Kubernetes resources. They have the additional benefit of having the ability to impersonate any user in the environment. This can be useful for troubleshooting in addition to allowing the administrators to assume cluster-admin privileges without logging directly into a control plane host.
For example, bstorm is an admin account on toolsbeta Kubernetes and can therefore see all namespaces:
bstorm@toolsbeta-sgebastion-04:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ingress-admission ingress-admission-55fb8554b5-5sr82 1/1 Running 0 48d
ingress-admission ingress-admission-55fb8554b5-n64xz 1/1 Running 0 48d
ingress-nginx nginx-ingress-64dc7c9c57-6zmzz 1/1 Running 0 48d
However, bstorm cannot write or delete resources directly:
bstorm@toolsbeta-sgebastion-04:~$ kubectl delete pods test-85d69fb4f9-r22rl -n tool-test
Error from server (Forbidden): pods "test-85d69fb4f9-r22rl" is forbidden: User "bstorm" cannot delete resource "pods" in API group "" in the namespace "tool-test"
She can use the kubectl-sudo
plugin (which internally impersonates the system:masters
group) to delete resources:
bstorm@toolsbeta-sgebastion-04:~$ kubectl sudo delete pods test-85d69fb4f9-r22rl -n tool-test
pod "test-85d69fb4f9-r22rl" deleted"
NFS, LDAP and User IDs
![]() | This section is entirely obsolete. It needs to be replaced with the information on PSPs and PodPresets |
Kubernetes by default allows users to run their containers with any UID they want, including root (0). This is problematic for multiple reasons:
- They can then mount any path in the worker instance as r/w and do whatever they want. This basically gives random users full root on all the instances
- They can mount NFS and read / write all tools' data, which is terrible and unacceptable.
So by default, being able to access the k8s api is the same as being able to access the Docker socket, which is root equivalent. This is bad for a multi-tenant system like ours, where we'd like to have multiple users running in the same k8s cluster.
Fortunately, unlike docker, k8s does allow us to write admission controllers that can place additional restrictions / modifications on what k8s users can do. We utilize this in the form of a UidEnforcer
admission controller that enforces the following:
- All namespaces must have a
RunAsUser
annotation - Pods (and their constituent containers) can run only with that UID
In addition, we establish the following conventions:
- Each tool gets its own Namespace
- During namespace creation, we add the RunAsUser annotation to match the UID of the tool in LDAP
- Namsepace creation / modification is a restricted operation that only admins can perform.
This essentially provides us with a setup where users who can today run a process with user id X with Grid Engine / Bastions are the only people who can continue to do so with K8S as well. This works out great for dealing with NFS permissions and such as well.
Monitoring
![]() | This section needs a large update |
The Kubernetes cluster contains multiple components responsible for cluster monitoring:
- metrics-server (per-container metrics)
- cadvisor (per-container metrics)
- kube-state-metrics (cluster-level metrics)
Data from those services is fed into the Prometheus servers. We have no alerting yet, but that should change at some point.
Docker Images
We restrict only running images from the Tools Docker registry, which is available publicly (and inside tools) at docker-registry.tools.wmflabs.org
. This is for the following purposes:
- Making it easy to enforce our Open Source Code only guideline
- Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
- Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
- Access control is provided totally by us, less dependent on dockerhub
- Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment
This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container.
The decision to follow this approach was last discussed and re-evaluated at Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc.
Image building
Images are built on the tools-docker-imagebuilder-01 instance, which is setup with appropriate credentials (and a hole in the proxy for the docker registry) to allow pushing. Note that you need to be root to build / push docker containers. Suggest using sudo -i
for it - since docker looks for credentials in the user's home directory, and it is only present in root's home directory.
Building base image
We use base images from https://docker-registry.wikimedia.org/ as the starting point for the Toolforge images. There once was a separate process for creating our own base images, but that system is no longer used.
Building toolforge specific images
These are present in the git repository operations/docker-images/toollabs-images
. There is a base image called docker-registry.tools.wmflabs.org/toolforge-buster-sssd
that inherits from the wikimedia-buster base image but adds the toolforge debian repository + ldap SSSD support. All Toolforge related images should be named docker-registry.tools.wmflabs.org/toolforge-$SOMETHING
. The structure should be fairly self explanatory. There is a clone of it in /srv/images/toolforge
on the docker builder host.
You can rebuild any particular image by running the build.py
script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from and all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry.
Example of rebuilding the python2 images:
$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git rebase @{upstream}
$ ./build.py --push python2-sssd/base
By default, the script will build the testing tag of any image, which will not be pulled by webservice and it will build with the prefix of toolforge. Webservice pulls the latest tag. If the image you are working on is ready to be automatically applied to all newly-launched containers, you should add the --tag latest
argument to your build.py command:
$ ./build.py --tag latest --push python2-sssd/base
You will probably want to clean up intermediate layers after building new containers:
$ docker ps --no-trunc -aqf "status=exited" | xargs docker rm
$ docker images --no-trunc | grep '<none>' | awk '{ print $3 }' | xargs -r docker rmi
All of the web
images install our locally managed toollabs-webservice
package. When it is updated to fix bugs or add new features the Docker images need to be rebuilt. This is typically a good time to ensure that all apt managed packages are updated as well by rebuilding all of the images from scratch:
$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git reset --hard origin/master
$ ./rebuild_all.sh
See Portal:Toolforge/Admin/Kubernetes/Docker-registry for more info on the docker registry setup.
Managing images available for tools
Available images are managed in image-config. Here is how to add a new image:
- Add the new image name in the image-config repository
- Deploy this change to toolsbeta:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/
- Deploy this change to tools:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --project tools --deploy-node-hostname tools-k8s-control-1.tools.eqiad1.wikimedia.cloud
- Recreate the jobs-api pods in the Toolsbeta cluster, to make them read the new ConfigMap
- SSH to the bastion:
ssh toolsbeta-sgebastion-05.toolsbeta.eqiad1.wikimedia.cloud
- Find the pod ids:
kubectl get pod -n jobs-api
- Delete the pods, K8s will replace them with new ones:
kubectl sudo delete pod -n jobs-api {pod-name}
- SSH to the bastion:
- Do the same in the Tools cluster (same instructions, but use
login.toolforge.org
as the SSH bastion)
- Deploy this change to toolsbeta:
- From a bastion, check you can run the new image with
webservice {image-name} shell
- From a bastion, check the new image is listed when running
toolforge-jobs images
- Update the Toolforge/Kubernetes wiki page to include the new image
Building new nodes
Bastion nodes
Kubernetes bastion nodes provide kubectl
access to the cluster, installed from the thirdparty/k8s repo. This is in puppet and no other special configuration is required.
Worker nodes
Build nodes according to the information here Portal:Toolforge/Admin/Kubernetes/Deploying#worker_nodes Worker nodes are where user containers/pods are actually executed. They are large nodes running Debian Buster.
Builder nodes
Builder nodes are where you can create new Docker images and upload them to the Docker registry.
You can provision a new builder node with the following:
- Provision a new image using a name starting with
tools-docker-builder-
- Switch worker to new puppetmaster from steps below, and run puppet until it has no errors.
- Edit hiera to set
docker::builder_host
to the new hostname - Run puppet on the host named by
docker::registry
in hiera to allow uploading images
Switch to new puppetmaster
You need to switch the node to the tools puppetmaster first. This is common for all roles. This is because we require secret storage, and that is impossible with the default labs puppetmaster. This process should be made easier / simpler at some point, but until then...
- Make sure puppet has run at least once on the new instance. On second run, it will produce a large blob of red error messages about SSL certificates. So just run puppet until you get that :)
- Run
sudo rm -rf /var/lib/puppet/ssl
on the new instance. - Run puppet on the new instance again. This will make puppet create a new certificate signing request and send it to the puppetmaster. If you get errors similar to this, it means there already existed an instance with the same name attached to the puppetmaster that wasn't decomissioned properly. You can run <code>sudo puppet cert clean $fqdn</code> on the puppetmaster and then repeat steps 3 and 4.
- On the puppetmaster (
tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud
), runsudo puppet cert sign <fqdn>
, where fqdn is the fqdn of the new instance. This should not be automated away (the signing) since we depend on only signed clients having access for secrets we store in the puppetmaster. - Run puppet again on the new instance, and it should run to completion now!
Administrative Actions
Perform these actions from a toolforge bastion.
Quota management
Resource quotas and limit ranges in Kubernetes are set on a namespace scope. Default quotas are created for Toolforge tools by the maintain-kubeusers service. The difference between them is that resource quotas are for how much of a resource all pods can use, collectively while limit ranges limit how much CPU or RAM a particular container (not pod) may consume.
To view a quota for a tool, you can use your admin account (your login user if you are in the tools.admin group) and run kubectl -n tool-$toolname get resourcequotas
to list them. There should be only one with the same name as the namespace in most cases. To see what's in there the easy way is to output the yaml of the quota like kubectl -n tool-cdnjs get resourcequotas tool-cdnjs -o yaml
Likewise, you can check the limit range with kubectl -n tool-$toolname describe limitranges
If you want to update a quota for a user who has completed the process for doing so on phabricator, it is as simple as editing the Kubernetes object. Your admin account needs to impersonate cluster-admin to do this, such as:
$ kubectl sudo edit resourcequota tool-cdnjs -n tool-cdnjs
The same can be done for a limit range.
$ kubectl sudo edit limitranges tool-mix-n-match -n tool-mix-n-match
Requests can be fulfilled by bumping whichever quota item is requested according to the approved request, but do not change the NodePort services from 0 because we don't allow those for technical reasons.
See also Help:Toolforge/Kubernetes#Namespace-wide_quotas
Node management
You can run these as any user on a kubernetes control node (currently tools-k8s-control-{4,5,6}.tools.eqiad1.wikimedia.cloud). It is ok to kill pods on individual nodes - the controller manager will notice they are gone soon and recreate them elsewhere.
Getting a list of nodes
kubectl get node
Cordoning a node
This prevents new pods from being scheduled on it, but does not kill currently running pods there.
kubectl cordon $node_hostname
Depooling a node
This deletes all running pods in that node as well as marking it as unschedulable. The --delete-local-data --force
allows deleting paws containers (since those won't be automatically respawned)
kubectl drain --ignore-daemonsets --delete-emptydir-data --force $node_hostname
Uncordon/Repool a node
Make sure that the node shows up as 'ready' in kubectl get node
before repooling it!
kubectl uncordon $node_fqdn
Decommissioning a node
When you are permanently decommissioning a node, you need to do the following:
- Depool the node:
kubectl drain --delete-local-data --force $node_fqdn
- Remove the node:
kubectl delete node $node_fqdn
- Shutdown the node using Horizon or
openstack
commands - (optional) Wait a bit if you feel that this node may need to be recovered for some reason
- Delete the node using Horizon or
openstack
commands - Clean its puppet certificate: Run
sudo puppet cert clean $fqdn
on the tools puppetmaster - Remove it from the list of worker nodes in the profile::toolforge::k8s::worker_nodes hiera key for haproxy nodes (in prefixpuppet tools-k8s-haproxy).
pods management
Administrative actions related to concrete pods/tools.
pods causing too much traffic
Please read Portal:Toolforge/Admin/Kubernetes/Pod_tracing
Custom admission controllers
To get the security features we need in our environment, we have written and deployed a few additional admission webhooks. Since kubernetes is written in Go, so are these admission controllers to take advantage of using the same objects, etc. The custom controllers are documented largely in their README files.
See Portal:Toolforge/Admin/Kubernetes/Custom_components
Ingress Admission Webhook
This prevents Toolforge users from creating arbitrary ingresses that might incorrectly or maliciously route traffic https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/ingress-admission-controller/
Registry Admission Webhook
This webhook controller prevents pods from external image repositories from running. It does not apply to kube-system
or other namespaces we specify in the webhook config because these images are used from the upstream systems directly. https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/registry-admission-webhook/
Volume Admission Webhook
This mutating admission webhook mounts NFS volumes to tool pods labelled with toolforge: tool
. It replaced kubernetes PodPresets which were removed in the 1.20 update. https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/volume-admission-controller/
Common issues
SSHing into a new node doesn't work, asks for password
Usually because first puppet run hasn't happened yet. Just wait for a bit! If that doesn't work, look at the console log for the instance - if it is *not* at a login prompt, read the logs to see what is up.
Node naming conventions
Node tye | Prefix | How to find active one? |
---|---|---|
Kubernetes Control Node | tools-k8s-control- | Hiera: profile::toolforge::k8s::control_nodes
|
Kubernetes worker node | tools-k8s-worker- | Run kubectl get node on the kubernetes master host
|
Kubernetes etcd | tools-k8s-etcd- | All nodes with the given prefix, usually |
Docker Registry | tools-docker-registry- | The node that docker-registry.tools.wmflabs.org resolves to
|
Docker Builder | tools-docker-builder- | Hiera: docker::builder_host
|
Bastions | tools-sgebastion | DNS: login.toolforge.org and dev.toolforge.org |
Web Proxies | tools-proxy | DNS: tools.wmflabs.org. and toolforge.org.
Hiera: |
GridEngine worker node
(Debian Stretch) |
tools-sgeexec-09 | |
GridEngine webgrid node
(Lighttpd, Strech) |
tools-sgewebgrid-lighttpd-09 | |
GridEngine webgrid node
(Generic, Stretch) |
tools-sgewebgrid-generic-09 | |
GridEngine master | tools-sgegrid-master | |
GridEngine master shadow | tools-sgegrid-shadow | |
Redis | tools-redis | Hiera: active_redis
|
tools-mail | ||
Cron runner | tools-sgecron- | Hiera: active_cronrunner |
Elasticsearch | tools-elastic- |