You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Portal:Tool Labs/Admin/Kubernetes: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>BryanDavis
(Link to general k8s docs instead of web specific)
 
imported>BryanDavis
 
Line 1: Line 1:
'''Kubernetes''' (often abbreviated '''k8s''') is an open-source system for automating deployment, and management of applications running in [[W:Operating-system-level virtualization|containers]]. Kubernetes was selected in 2015 by the Labs team as the replacement for Grid Engine in the Tool Labs project.<ref>[[mailarchive:labs-announce/2015-September/000071.html|[Labs-announce] [Tools] Kubernetes picked to provide alternative to GridEngine]]
#REDIRECT [[Portal:Toolforge/Admin/Kubernetes]]
</ref> Usage of k8s by Tools began in mid-2016.<ref>[[mailarchive:labs-announce/2016-June/000130.html|[Labs-announce] Kubernetes Webservice Backend Available for PHP webservices]]
</ref>
 
{{Notice|For help on using kubernetes in Tool Labs, see the [[Help:Tool_Labs/Kubernetes|Kubernetes help]] documentation.}}
 
== Terminology ==
 
Kubernetes comes with its own set of jargon, some of which is listed here.
 
=== Pod [http://kubernetes.io/docs/user-guide/pods/ k8s user guide]===
 
A pod is a collection of containers that share the network, IPC and hostname namespace. This means multiple containers can:
 
# Connect to each other securely via localhost (since that is shared)
# Communicate via traditional Linux IPC mechanisms
 
All containers in a pod will always be scheduled together on the same node, and started / killed together. This makes the pod the smallest unit of deployment in kubernetes.
 
When a pod dies, depending on its RestartPolicy it will be restarted on the same node. This must not be depended on for resilience however - if the node dies the pod is gone. Hence you should never really create pods directly, and instead use ReplicaSets / Deployments / Jobs to manage them.
 
Each pod gets its own IP address from the overlay network we use.
 
=== Service [http://kubernetes.io/docs/user-guide/services/ k8s user guide] ===
 
Services provide a stable IP by which you can connect to a set of pods. Pods are ephemeral - they come and go and switch IP as they please, but a service IP is stable from time of creation and will route in a round robin fashion to all the pods it services.
 
=== ReplicaSet [http://kubernetes.io/docs/user-guide/replicasets/ k8s user guide] ===
 
(replaces [http://kubernetes.io/docs/user-guide/replication-controller/ ReplicationControllers])
 
A lot of kubernetes operates on reconciliation loops, that do the following:
 
# Check if a specified condition is met
# If not, perform actions to try to make the specified condition be true.
 
For ReplicaSet, the condition they try to maintain as true is a given number of instances of a particular pod template always are running. So they are in a loop, checking how many pods with the given specification is running, and starting / killing pods to make sure that matches the expected number. You can use a replicaCount of 1 to make sure that a pod is running at least one instance - this makes it resilient against node failures as well. These are also units of horizontal scaling - you can increase the number of pods by just setting replicaCount on the ReplicaSet managing them.
 
=== Deployment [http://kubernetes.io/docs/user-guide/deployments/ k8s user guide] ===
 
This is a higher level object that spawns and manages ReplicaSets. The biggest use for it is that it allows zero downtime rolling deployments with health checks, but atm we aren't using any of those features.
 
== Components ==
 
K8S components are generally in two 'planes' - the control plane and the worker plane. You can also find more info about the general architecture of kubernetes (along with a nice diagram!) on [https://github.com/kubernetes/kubernetes/blob/master/docs/design/architecture.md github].
 
=== Control Plane ===
 
This refers to the 'master' components, that provide a unified view of the entire cluster. Currently most of these (except etcd) run in a single node, with HA scheduled to be setup soon{{cn}}.
 
==== Etcd ====
 
Kubernetes stores all state in [[etcd]] - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable from the instance running the k8s 'master' (aka rest of the control plane).
 
We currently use a 3 node cluster, named <code>tools-k8s-etcd-0[1-3]</code>. They're all smallish Debian Jessie instances configured by the same etcd puppet code we use in production.
 
==== API Server ====
 
This is the heart of the kubernetes control plane - it mediates access to all state stored in etcd for all other components (both in the control plane & the worker plane). It is purely a data access layer, containing no logic related to any of the actual end-functionality kubernetes offers. It offers the following functionality:
* Authentication & Authorization
* Validation
* Read / Write access to all the API endpoints
* Watch functionality for endpoints, which notifies clients when state changes for a particular resource
When you are interacting with the kubernetes API, this is the server that is serving your requests.
 
The API server runs on the k8s master node, currently <code>tools-k8s-master-01</code>. It listens on port 6443 (with TLS enabled, using the puppet cert for the host). It also listens on localhost, without TLS and with an insecure bind that bypasses all authentication. It runs as the 'kubernetes' user.
 
It is accessible internally via the domain <code>k8s-master.tools.wmflabs.org</code>, using the *.tools.wmflabs.org certificate. This allows all nodes, including ones that aren't using the custom puppetmaster, to access the k8s master.
 
==== Controller Manager ====
 
All other cluster-level functions are currently performed by the Controller Manager. For instance, <code>ReplicationController</code> objects are created and updated by the replication controller (constantly checking & spawning new pods if necessary), and nodes are discovered, managed, and monitored by the node controller. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.
 
The Controller Manager also runs on the k8s master node, currently <code>tools-k8s-master-01</code> and communicates with the API server over the unsecured localhost bind. It runs as the 'kubernetes' user.
 
==== Scheduler ====
 
This simply polls the API for [http://kubernetes.io/docs/user-guide/pods/ pods] that are in unscheduled state & binds them to a specific node. This is also conceptually very simple reconciliation loop, and will be made pluggable later (and hence isn't part of Controller Manager).
 
The scheduler runs on the k8s master node and communicates with the API server over the unsecured localhost bind. It runs as the 'kubernetes' user.
 
=== Worker plane ===
 
The worker plane refers to the components of the nodes on which actual user code is executed in containers. In tools these are named <code>tools-worker-****</code>, and run as Debian Jessie instances.
 
==== Kubelet ====
 
Kubelet is the interface between kubernetes and the container engine (in our case, [[W:Docker (software)|Docker]]). It checks for new pods scheduled on the node it is running on, and makes sure they are running with appropriate volumes / images / permissions. It also does the health checks of the running pods & updates state of them in the k8s API. You can think of it as a reconciliation loop where it checks what pods must be running / not-running in its node, and makes sure that matches reality.
 
This runs on each node and communicates with the k8s API server over TLS, authenticated with a client certificate (puppet node certificate + CA). It runs as root since it needs to communicate with docker, and being granted access to docker is root equivalent.
 
==== Kube-Proxy ====
 
kube-proxy is responsible for making sure that k8s service IPs work across the cluster. We run it in iptables mode, so it uses iptables NAT rules to make this happen. Its reconciliation loop is to get list of service IPs across the cluster, and make sure NAT rules for all of those exist in the node.
 
This is run as root, since it needs to use iptables. You can list the rules on any worker node with <code>iptables -t nat -L</code>
 
==== Docker ====
 
We're currently using Docker as our container engine. We pull from upstream's deb repos directly, and pin versions in puppet. We run it in a slightly different configuration than straight upstream, primarily preventing it from doing iptables related changes (since flannel handles those for us). These changes are made in our systemd unit file that we use to replace upstream provided one.
 
Note that we don't have a clear docker upgrade strategy yet.
 
==== Flannel ====
 
Flannel is the container overlay network we use to allow all the containers to think they're on the same network. We currently use a /16 (192.168.0.0), from which each node gets a /24 and allocates an IP per container. We use the VXLAN backend of flannel, which seems to produce fairly low overhead & avoids userspace proxying. We also have flannel do IP masquerading, We integrate flannel with docker with our modifications to the docker systemd unit.
 
Flannel expects its configuration to come from an etcd, so we have a separate etcd cluster (<code>tools-flannel-etcd-0[1-3]</code>) serving just this purpose.
 
=== Proxy ===
 
We need to be able to get http requests from the outside internet to pods running on kubernetes. While normally you would use an [http://kubernetes.io/docs/user-guide/ingress/ Ingress] service for this, we have a hacked up similar thing present in <code>kube2proxy.py</code> instead. This is mostly because there were no non-AWS/GCE ingress providers when we started. This will be replaced with a real ingress provider soon.
 
The script works by doing the following:
 
# Look for all services in all namespaces that have a label <code>tools.wmflabs.org/webservice</code> set to the string <code>"true"</code>
# Add a rule to redis routing <code>tools.wmflabs.org/$servicename</code> to that service's IP address
# This redis rule is interpreted by the code in dynamicproxy module that we also use for routing gridengine webservices, and requests get routed appropriately.
 
This allows both gridengine and kubernetes based webserivces to co-exist under the tools.wmflabs.org domain
 
=== Infrastructure centralized Logging ===
 
We aggregate all logs from syslog (so docker, kubernetes components, flannel, etcd, etc) into a central instance from all kubernetes related hosts. This is for both simplicity as well as to try capture logs that would be otherwise lost to kernel issues. You can see these logs in the logging host, which can be found in [[Hiera:Tools]] as <code>k8s::sendlogs::centralserver</code>, in <code>/srv/syslog</code>. The current central logging host is <code>tools-logs-01</code>. Note that this is not related to logging for applications running on top of kubernetes at all.
 
== Authentication & Authorization ==
 
We use Kubernetes' [http://kubernetes.io/docs/admin/authentication/ token auth] sytem to authenticate users. This information is maintained in a csv format. The source of this info is two fold - <code>maintain-kubeusers</code> script for tool accounts & puppet for non-tool accounts.
 
We build on top of Kubernetes' [http://kubernetes.io/docs/admin/authorization/ Attribute Based Access Control] to have three kinds of accounts:
 
# Namespaced accounts (tool accounts)
# Infrastructure Readonly Accounts
# Infrastructure Full Access Accounts
 
Tool accounts are Namespaced accounts - for each tool we create a Kubernetes Namespace, and inside the namespace they have access to create a whitelisted set of resources (RCs, Pods, Services, Secrets, etc). There will be a resource based (CPU/IO/Disk) quota imposed on a per-namespace basis at some point in the future.
 
Infrastructure Readonly Accounts provide only read access but to all resources in all namespaces. This is currently used for services like prometheus / kube2proxy. Infrastructure Full Access accounts aren't dissimilar, just also have write accounts. These two types should get way more specific accounts in the future.
 
The script <code>maintain-kubeusers</code> is responsible for the following:
# Creating a namespace for each tool with proper annotation
# Providing a <code>.kube/config</code> file for each tool
# Creating the homedir for each new tool (so that we can do #2 reliably)
# Write out <code>/etc/kubernetes/abac.json</code> to provide proper access controls for all kinds of accounts
# Write out <code>/etc/kubernetes/tokenauth</code> to provide proper tokens + user names for all kinds of accounts
# Read <code>/etc/kubernetes/infrastructure-users</code> provisioned by puppet to know about non-namespaced accounts.
 
== NFS, LDAP and User IDs ==
 
Kubernetes by default allows users to run their containers with any UID they want, including root (0). This is problematic for multiple reasons:
 
# They can then mount any path in the worker instance as r/w and do whatever they want. This basically gives random users full root on all the instances
# They can mount NFS and read / write all tools' data, which is terrible and unacceptable.
 
So by default, being able to access the k8s api is the same as being able to access the Docker socket, which is root equivalent. This is bad for a multi-tenant system like ours, where we'd like to have multiple users running in the same k8s cluster.
 
Fortunately, unlike docker, k8s does allow us to write [http://kubernetes.io/docs/admin/admission-controllers/ admission controllers] that can place additional restrictions / modifications on what k8s users can do. We utilize this in the form of a <code>UidEnforcer</code> admission controller that enforces the following:
 
# All namespaces must have a <code>RunAsUser</code> annotation
# Pods (and their constituent containers) can run only with that UID
 
In addition, we establish the following conventions:
 
# Each tool gets its own Namespace
# During namespace creation, we add the RunAsUser annotation to match the UID of the tool in LDAP
# Namsepace creation / modification is a restricted operation that only admins can perform.
 
This essentially provides us with a setup where users who can today run a process with user id X with Grid Engine / Bastions are the only people who can continue to do so with K8S as well. This works out great for dealing with NFS permissions and such as well.
 
== Monitoring ==
We've decided to use [https://prometheus.io/ Prometheus] for metrics collection & monitoring (and eventually alerting too). There's a publicly visible setup available at https://tools-prometheus.wmflabs.org/tools. There are also dashboards on [https://grafana-labs-admin.wikimedia.org Labs Grafana]. There is a [https://grafana-labs-admin.wikimedia.org/dashboard/db/kubernetes-tool-combined-stats per-tool statistics dashboard], as well as [https://grafana-labs-admin.wikimedia.org/dashboard/db/tools-activity an overall activity dashboard]. We have no alerting yet, but that should change at some point.
 
== Docker Images ==
 
We restrict only running images from the Tools Docker registry, which is available publicly (and inside tools) at <code>docker-registry.tools.wmflabs.org</code>. This is for the following purposes:
 
# Making it easy to enforce our Open Source Code only guideline
# Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
# Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
# Access control is provided totally by us, less dependent on dockerhub
 
This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container.
 
=== Image building ===
 
Images are built on the '''tools-docker-builder-05''' instance, which is setup with appropriate credentials (and a hole in the proxy for the docker registry) to allow pushing. Note that you need to be root to build / push docker containers. Suggest using <code>sudo -i</code> for it - since docker looks for credentials in the user's home directory, and it is only present in root's home directory.
 
==== Building base image ====
 
We have a base 'wikimedia' image (named <code>docker-registry.tools.wmflabs.org/wikimedia-jessie</code>) that is built using the command <code>build-base-images</code> on the image builder instance. This code uses [https://github.com/andsens/bootstrap-vz bootstrapvz] to build the image and push it to the registry, and the specs can be found in the operations/puppet.git repository under <code>modules/docker/templates/images</code>.
 
==== Building toollabs specific images ====
 
These are present in the git repository <code>operations/docker-images/toollabs-images</code>. There is a toollabs base image called <code>docker-registry.tools.wmflabs.org/toollabs-jessie</code> that inherits from the wikimedia-jessie base image but adds the toollabs debian repository + ldap NSS support. All toollabs related images should be named <code>docker-registry.tools.wmflabs.org/toollabs-$SOMETHING</code>. The structure should be fairly self explanatory. There is a clone of it in <code>/srv/images/toollabs-images</code> on the docker builder host.
 
You can rebuild any particular image by running the <code>build.py</code> script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from ''and'' all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry.
 
Example of rebuilding the python2 images:
<syntaxhighlight lang="shell-session">
$ ssh tools-docker-builder-05.tools.eqiad.wmflabs
$ screen
$ sudo su
$ cd /srv/images/toollabs
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git rebase @{upstream}
$ ./build.py --push python2/base
</syntaxhighlight>
 
== Building new nodes ==
 
This section documents how to build new k8s related nodes.
 
=== Bastion nodes ===
 
Kubernetes bastion nodes provide the following:
 
# <code>kubectl</code> access to the cluster
# A running <code>kube-proxy</code> so you can hit kubernetes service IPs
# A running <code>flannel</code> so you can hit kubernetes pod IPs
 
You can provision a new bastion node with the following:
 
# Add the bastion node's fqdn under the <code>k8s::bastion_hosts</code> in [[Hiera:Tools]]. This allows the flannel etcds to open up their firewalls for this bastion so flannel can reach etcd for its operations. You now need to either run puppet on all the flannel etcd hosts (<code>tools-flannel-etcd-0[1-3].tools.eqiad.wmflabs</code>) or wait ~20mins.
# Switch to the new puppetmaster
# Apply the role <code>role::toollabs::k8s::bastion</code> to the instance
# Run puppet
# Run <code>sudo -i deploy-bastion <builder-host> <tag></code>, where builder-host is the host on which kubernetes is built (currently tools-docker-builder-03.tools.eqiad.wmflabs), and tag is the current tag of kubernetes that is deployed (you can find out the tag by looking at the file <code>modules/toollabs/manifests/kubebuilder.pp</code> in the operations/puppet.git repo).
# Run puppet again on the bastion instance.
# Look at the kube-proxy logs and flannel logs to make sure they're up.
# Attempt to hit a pod IP or service IP to make sure it works.
 
=== Worker nodes ===
 
Worker nodes are where user containers/pods are actually executed. They are large nodes running Debian Jessie.
 
Kubernetes worker runs the following:
 
# <code>kubelet</code> to manage the pods, perform health checks, etc
# A running <code>kube-proxy</code> so you can hit kubernetes service IPs
# A running <code>flannel</code> so you can hit kubernetes pod IPs
# A <code>docker</code> daemon that manages the actual containers
 
You can provision a new worker node with the following:
 
# Add the worker node's fqdn under the <code>k8s::worker_hosts</code> in [[Hiera:Tools]]. This allows the flannel etcds to open up their firewalls for this worker so flannel can reach etcd for its operations. You now need to either run puppet on all the flannel etcd hosts (<code>tools-flannel-etcd-0[1-3].tools.eqiad.wmflabs</code>) or wait ~20mins.
# Switch worker to new puppetmaster from steps below, and run puppet until it has no errors.
# Look at the docker logs, kube-proxy logs, kubelet logs and flannel logs to make sure they're up.
# Run <code>kubectl get nodes</code> on the k8s master to make sure that the node is marked as ready
 
=== Builder nodes ===
 
Builder nodes are where you can create new Docker images and upload them to the Docker registry.
 
You can provision a new builder node with the following:
# Provision a new image using a name starting with <code>tools-docker-builder-</code>
# Switch worker to new puppetmaster from steps below, and run puppet until it has no errors.
# Edit hiera to set <code>docker::builder_host</code> to the new hostname
# Run puppet on the host named by <code>docker::registry</code> in hiera to allow uploading images
 
=== Switch to new puppetmaster ===
 
You need to switch the node to the tools puppetmaster first. This is common for all roles. This is because we require secret storage, and that is impossible with the default labs puppetmaster. This process should be made easier / simpler at some point, but until then...
 
# Make sure puppet has run at least once on the new instance.  On second run, it will produce a large blob of red error messages about SSL certificates. So just run puppet until you get that :)
# Run  <code>sudo rm -rf /var/lib/puppet/ssl</code> on the new instance.
# Run puppet on the new instance again. This will make puppet create a new certificate signing request and send it to the puppetmaster. If you get errors similar to [[phab:P3623|this]],  it means there already existed an instance with the same name attached to the puppetmaster that wasn't decomissioned properly. You can run <nowiki><code>sudo puppet cert clean $fqdn</code></nowiki> on the puppetmaster and then repeat steps 3 and 4.
# On the puppetmaster (<code>tools-puppetmaster-01.tools.eqiad.wmflabs</code>), run <code>sudo puppet cert sign <fqdn></code>, where fqdn is the fqdn of the new instance. This should not be automated away (the signing) since we depend on only signed clients having access for secrets we store in the puppetmaster.
# Run puppet again on the new instance, and it should run to completion now!
 
== Administrative Actions ==
 
=== Node management ===
 
You can run these as any user on the kubernetes master node (currently tools-k8st-master-01.eqiad.wmflabs). It is ok to kill pods on nodes - the controller manager will notice they are gone soon and recreate them elsewhere.
 
==== Getting a list of nodes ====
 
<code>kubectl get node</code>
 
==== Depooling a node ====
 
This deletes all running pods in that node as well as marking it as unschedulable. The <code>--delete-local-data --force</code> allows deleting paws containers (since those won't be automatically respawned)
 
<code>kubectl drain --delete-local-data --force $node_fqdn</code>
 
==== Cordoning a node ====
 
This prevents new pods from being scheduled on it, but does not kill currently running pods there.
 
<code>kubectl cordon $node_fqdn</code>
 
==== Repooling a node ====
 
Make sure that the node shows up as 'ready' in <code>kubectl get node</code> before repooling it!
 
<code>kubectl uncordon $node_fqdn</code>
 
==== Decommissioning a node ====
 
When you are permanently decomissioning a node, you need to do the following:
 
# Depool the node
# Clean its puppet certificate: Run <code>sudo puppet cert clean $fqdn</code> on the tools puppetmaster
# Remove it from the list of worker nodes in [[Hiera:Tools]].
 
== Upgrading Kubernetes ==
 
We try to keep up with kubernetes versions as they come up!
 
=== Code ===
 
We keep our Kubernetes code on Gerrit, in [https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/kubernetes operations/software/kubernetes]. We keep code referenced by tags, and usually it is the upstream version followed by the string 'wmf' and then a monotonically increasing number - so you end up deploying tags such as <code>v1.3.3wmf1</code>. There isn't really much active CR there yet, the repository is set up for direct push, including force push, so be careful :)
 
=== Patches ===
 
We have a bunch of patches on top of upstream, mostly around access control. These are in the repository but not in a very clear way anywhere. TODO List all the patches and what they do and why we have them!
 
=== Building ===
 
==== Building debian packages ====
 
We 've gone into some lengths to debianize the kubernetes packages in order to provide a way for deploying kubernetes without having to resolve to scap3 (which is best suited for service deployment) or ad-hoc hacks. While the end result works, it's not a really good debianization effort and as such not fit for inclusion upstream. The main reasons is it is using docker and downloading docker images off the internet during the build.
 
Requirements are:
 
* A VM/physical box with internet access
* Docker with enough disk space to host the build container images as well as the containers themselves. Failing that, the error messages will be hugely cryptic. A sane value seems to be around '''20G''' currently
* Enough memory (failing that will result in cryptic error). A sane value seems to be around '''6G''' currently
 
On tool labs, you can use whatever is the docker builder instance to do this.
 
The stuff below assumes you have some basic Debian package building knowledge. A quick tutorial can be found at https://wiki.debian.org/BuildingTutorial
 
Here's the runabout:
# git clone https://gerrit.wikimedia.org/r/operations/debs/kubernetes
# Take a quick look into the debian subdirectory. That's the stuff we mostly mess with, n
# Fetch the upstream version specified in debian/changelog top entry from the kubernetes releases. Currently that will be at  https://github.com/kubernetes/kubernetes/archive/v<version>.tar.gz, NOT at https://github.com/kubernetes/kubernetes/releases or https://get.k8s.io/. We just want the tar.gz file. We also do NOT want the release file. We just want the source code. The above all holds true for up to 1.4 series. Documentation will be updated for 1.5 when we migrate to it.
# Name it correctly (kubernetes_<upstream-version>.orig.tar.gz) and place it at the same hiearchy level as your git clone.
# Do whatever changes you need to do (restrict to changes in the debian/ directory).
## Whatever patches we apply are under debian/patches directory (exactly 1 currently). Updating these is done using quilt. Teaching quilt is outside of the scope of this, a good tutorial is at https://wiki.debian.org/UsingQuilt. A quick way to get started is:
### quilt pop to unapply a patch.
### quilt push to apply a patch.
### quilt refresh to update the currently applied patch from the current state of the repo
# Create a new debian revision by using dch -i (assuming you want a new version of the package out)
# Install the prereq required by the package. Done with apt-get install dh-systemd
# Run dpkg-buildpackage -us -uc -sa. The user this is being run as should either be in the docker group or be run as sudo - kubernetes packaging we use requires docker to be built.
# wait about 40-60 minutes. Fix your errors if there are any and rinse and repeatz
# Grab the built packages.
 
Note that there's a TODO item to actually make this work with git-buildpackage so we don't have to fetch the versions manually but rather have them autogenerated from the git repo.
 
==== Building the old way (deprecated as of March 2017) ====
 
The stuff below is already deprecated and kept for historical reasons
 
We have a puppet class called <code>toollabs::kubebuilder</code> that provisions the things required to build kubernetes for our use case. These are:
 
# A git clone of <code>operations/software/kubernetes</code> under <code>/srv/build/kubernetes</code>
# Docker (which is required to build kubernetes)
# A build script, <code>build-kubernetes</code> that checks out the appropriate tag (with our patches on top) and builds kubernetes with the recommended upstream way.
# A web server that serves the built binaries for tools to grab and deploy.
 
This class is included in the docker builder role, so you should use the active docker builder to build kubernetes.
 
So when a new version of kubernetes comes out, you do the following:
 
# Clone the kubernetes repo locally, get the tag for the newest version
# Cherry-pick our patches on top of the new version (TODO provide exact details of patch!)
# Push new tag with cherry picked patches with a new number:
## <code>git tag ${upstreamtagname}wmf${wmfversion}</code>
## git push origin --tags
# Set version number to your new version in kubebuilder.pp in ops/puppet and merge it
# Run puppet on docker builder so the build script picks up the new version
# Run <code>build-kubernetes</code>. This will fetch the new code and build it, and run all E2E tests. Make sure they all pass, including the tests for the patches!
# If they all pass, yay! You have successfully built kubernetes! If not, find out why they failed, fix the patches, and tag with an increased number, push, try again until you succeed.
 
Note that adding the tag to kubebuilder.pp is awkward, but that's the only place we track the 'currently deployed version', so is important.
 
== Custom admission controllers ==
 
To get the security features we need in our environment, we have written and deployed a few additional [http://kubernetes.io/docs/admin/admission-controllers/ admission controllers]. Since kubernetes is written in Go, so are these admission controllers. They need to live in the same repository as the go source code, and are hence maintained as patches on the upstream kubernetes source.
 
=== UidEnforcer ===
 
This enforces for each pod:
 
1. It belongs in a namespace (already enforced)
2. Namespace has an annotation <code>RunAsUser</code> set to a numeric value
3. The pod can run with UID & GID set to be the same value as the <code>RunAsUser</code> annotation. If this isn't true, this admission controller will modify the pod specification such that it is.
 
In the code that creates namespaces for tools automatically (<code>maintain-kubeusers</code>), we set this annotation to match the UID of each tool. This prevents users from impersonating other users, and also from running as root.
 
=== RegistryEnforcer ===
 
Enforces that pods can only use containers from a specified docker registry. This is enforced by rejecting all pods that do not start with the configured registry name in their image spec. The registry name is passed in via the commandline flag <code>--enforced-docker-registry</code>. If the value passed in is, say, <code>docker-registry.tools.wmflabs.org</code>, then only images of the form <code>docker-registry.tools.wmflabs.org/<something></code> will be allowed to run on this cluster.
 
=== HostPathEnforcer ===
 
We want to allow users to mount [http://kubernetes.io/docs/user-guide/volumes/#hostpath hostPath] volumes into their containers - this is how our NFS mounts are mounted in containers. We have them mounted in the k8s worker nodes (via puppet), and then just hostPath mount them into the containers. This simplifies management a lot, but it also means we've to allow users to mount hostPath volumes into containers. This is a potential security issue, since you can theoretically mount <code>/etc</code> from the host as read-write and do things to it. We already have protection in the form of UidEnforcer, but this is additional protection.
 
It allows you to whitelist paths / path-prefixes (comma separated values to <code>--host-paths-allowed</code> and <code>--host-path-prefixes-allowed</code>, and only those paths / path prefixes are allowed to be mounted as hostPaths. Pods that attempt to mount other paths will be rejected.
 
=== HostAutoMounter ===
 
We want *some* host paths to be mounted in *all* containers, regardless of wether they were in the spec or not. We currently do this for <code>/var/run/nslcd/socket</code> only, to allow libnss-ldap to work inside containers as is.
 
This is configured with the commandline parameter <code>--host-automounts</code>, which takes a comma separated list of paths. These paths will be mounted from each host to all containers.
 
=== Deprecation plan ===
We will have to keep an eye on [http://kubernetes.io/docs/user-guide/pod-security-policy/ Pod Security Policy] - it is the feature being built to replace all our custom admission controllers. Once it can do all the things we need to, we should get rid of our custom controllers and switch to it.
 
== Common issues ==
 
=== SSHing into a new node doesn't work, asks for password ===
 
Usually because first puppet run hasn't happened yet. Just wait for a bit! If that doesn't work, look at the console log for the instance - if it is *not* at a login prompt, read the logs to see what is up.
 
== Node naming conventions ==
{| class="wikitable"
!Node tye
!Prefix
!How to find active one?
|-
|Kubernetes Master
|tools-master-
|Hiera: <code>k8s::master_host</code>
|-
|Kubernetes worker node
|tools-worker-
|Run <code>kubectl get node</code> on the kubernetes master host
|-
|Kubernetes etcd
|tools-k8s-etcd-
|All nodes with the given prefix, usually
|-
|Flannel etcd
|tools-flannel-etcd
|All nodes with the given prefix, usually
|-
|Docker Registry
|tools-docker-registry-
|The node that <code>docker-registry.tools.wmflabs.org</code> resolves to
|-
|Docker Builder
|tools-docker-builder-
|Hiera: <code>docker::builder_host</code>
|-
|Bastions
|tools-bastion
|DNS: tools-login.wmflabs.org and tools-dev.wmflabs.org
|-
|Web Proxies
|tools-proxy
|DNS: tools.wmflabs.org.
Hiera: <code>active_proxy_host</code>
|-
|GridEngine worker node
(Ubuntu Precise)
|tools-exec-12
|
|-
|GridEngine worker node
(Ubuntu Trusty)
|tools-exec-14
|
|-
|GridEngine webgrid node
(Lighttpd, Precise)
|tools-webgrid-lighttpd-12
|
|-
|GridEngine webgrid node
(Lighttpd, Trusty)
|tools-webgrid-lighttpd-14
|
|-
|GridEngine webgrid node
(Generic, Trusty)
|tools-webgrid-generic-14
|
|-
|GridEngine master
|tools-grid-master
|
|-
|GridEngine master shadow
|tools-grid-shadow
|
|-
|Redis
|tools-redis
|Hiera: <code>active_redis</code>
|-
|Mail
|tools-mail
|
|-
|Cron runner
|tools-cron-
|Hiera: active_cronrunner
|-
|Elasticsearch
|tools-elastic-
|
|}
 
== References ==
{{Reflist|colwidth=30em}}

Latest revision as of 19:21, 14 July 2017