You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers.
K8S components are generally in two 'planes' - the control plane and the worker plane. You can also find more info about the general architecture of kubernetes (along with a nice diagram!) on github.
This refers to the 'master' components, that provide a unified view of the entire cluster. Currently most of these (except etcd) run in a single node, with HA scheduled to be setup soon.
Kubernetes stores all state in etcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable from the instance running the k8s 'master' (aka rest of the control plane).
We currently use a 3 node cluster, named
tools-k8s-etcd-0[1-3]. They're all smallish Debian Jessie instances configured by the same etcd puppet code we use in production.
This is the heart of the kubernetes control plane - it mediates access to all state stored in etcd for all other components (both in the control plane & the worker plane). It is purely a data access layer, containing no logic related to any of the actual end-functionality kubernetes offers. It offers the following functionality:
- Authentication & Authorization
- Read / Write access to all the API endpoints
- Watch functionality for endpoints, which notifies clients when state changes for a particular resource
When you are interacting with the kubernetes API, this is the server that is serving your requests.
The API server runs on the k8s master node, currently
tools-k8s-master-01. It listens on port 6443 (with TLS enabled, using the puppet cert for the host). It also listens on localhost, without TLS and with an insecure bind that bypasses all authentication. It runs as the 'kubernetes' user.
All other cluster-level functions are currently performed by the Controller Manager. For instance,
ReplicationController objects are created and updated by the replication controller (constantly checking & spawning new pods if necessary), and nodes are discovered, managed, and monitored by the node controller. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.
The Controller Manager also runs on the k8s master node, currently
tools-k8s-master-01 and communicates with the API server over the unsecured localhost bind. It runs as the 'kubernetes' user.
This simply polls the API for pods that are in unscheduled state & binds them to a specific node. This is also conceptually very simple reconciliation loop, and will be made pluggable later (and hence isn't part of Controller Manager).
The scheduler runs on the k8s master node and communicates with the API server over the unsecured localhost bind. It runs as the 'kubernetes' user.
The worker plane refers to the components of the nodes on which actual user code is executed in containers. In tools these are named
tools-worker-****, and run as Debian Jessie instances.
Kubelet is the interface between kubernetes and the container engine (in our case, Docker). It checks for new pods scheduled on the node it is running on, and makes sure they are running with appropriate volumes / images / permissions. It also does the health checks of the running pods & updates state of them in the k8s API. You can think of it as a reconciliation loop where it checks what pods must be running / not-running in its node, and makes sure that matches reality.
This runs on each node and communicates with the k8s API server over TLS, authenticated with a client certificate (puppet node certificate + CA). It runs as root since it needs to communicate with docker, and being granted access to docker is root equivalent.
kube-proxy is responsible for making sure that k8s service IPs work across the cluster. We run it in iptables mode, so it uses iptables NAT rules to make this happen. Its reconciliation loop is to get list of service IPs across the cluster, and make sure NAT rules for all of those exist in the node.
This is run as root, since it needs to use iptables. You can list the rules on any worker node with
iptables -t nat -L
We're currently using Docker as our container engine. We pull from upstream's deb repos directly, and pin versions in puppet. We run it in a slightly different configuration than straight upstream, primarily preventing it from doing iptables related changes (since flannel handles those for us). These changes are made in our systemd unit file that we use to replace upstream provided one.
Note that we don't have a clear docker upgrade strategy yet.
Flannel is the container overlay network we use to allow all the containers to think they're on the same network. We currently use a /16 (192.168.0.0), from which each node gets a /24 and allocates an IP per container. We use the VXLAN backend of flannel, which seems to produce fairly low overhead & avoids userspace proxying. We also have flannel do IP masquerading, We integrate flannel with docker with our modifications to the docker systemd unit.
Flannel expects its configuration to come from an etcd, so we have a separate etcd cluster (
tools-flannel-etcd-0[1-3]) serving just this purpose.
We use Kubernetes' token auth sytem to authenticate users. We have a list of user accounts in the primitive csv format, maintained via puppet + private puppet repo on the tools puppetmaster (
/var/lib/git/labs/secrets/hieradata/common.yaml on tools-puppetmaster-01). Currently the token is manually copied an to individual user's
~/.kube/config when they ask, but this is clearly not a scalable solution. We need to figure out new authentication setup at some point in the future.
We build on top of Kubernetes' Attribute Based Access Control to have four kinds of accounts:
- Namespaced accounts
- Admin accounts
- Infrastructure Readonly Accounts
- Infrastructure Full Access Accounts
Tool accounts are Namespaced accounts - for each tool we create a Kubernetes Namespace, and inside the namespace they have access to create a whitelisted set of resources (RCs, Pods, Services, Secrets). There will be a resource based (CPU/IO/Disk) quota imposed on a per-namespace basis at some point in the future.
Admin accounts have unrestricted access to everything on all namespaces!
Infrastructure Readonly Accounts provide only read access but to all resources in all namespaces. This is currently used for services like prometheus / kube2proxy. Infrastructure Full Access accounts aren't dissimilar, just also have write accounts. These two types should get way more specific accounts in the future.
NFS, LDAP and User IDs
Kubernetes by default allows users to run their containers with any UID they want, including root (0). This is problematic for multiple reasons:
- They can then mount any path in the worker instance as r/w and do whatever they want. This basically gives random users full root on all the instances
- They can mount NFS and read / write all tools' data, which is terrible and unacceptable.
So by default, being able to access the k8s api is the same as being able to access the Docker socket, which is root equivalent. This is bad for a multi-tenant system like ours, where we'd like to have multiple users running in the same k8s cluster.
Fortunately, unlike docker, k8s does allow us to write admission controllers that can place additional restrictions / modifications on what k8s users can do. We utilize this in the form of a
UidEnforcer admission controller that enforces the following:
- All namespaces must have a
- Pods (and their constituent containers) can run only with that UID
In addition, we establish the following conventions:
- Each tool gets its own Namespace
- During namespace creation, we add the RunAsUser annotation to match the UID of the tool in LDAP
- Namsepace creation / modification is a restricted operation that only admins can perform.
This essentially provides us with a setup where users who can today run a process with user id X with Grid Engine / Bastions are the only people who can continue to do so with K8S as well. This works out great for dealing with NFS permissions and such as well.