You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

PAWS/Admin: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Vivian Rook
(→‎Deployment: travis no longer used. Longer timeout frequently needed.)
imported>Vivian Rook
No edit summary
Line 39: Line 39:
==== Helm ====
==== Helm ====
[https://helm.sh Helm] 3 is used to deploy kubernetes applications on the cluster. It is installed by puppet via Debian package. The community supported ingress-nginx controller is deployed from its own helm chart, but the ingress objects are all managed in the PAWS helm chart. As this is helm 3, there is no tiller and RBAC affects what you can do.
[https://helm.sh Helm] 3 is used to deploy kubernetes applications on the cluster. It is installed by puppet via Debian package. The community supported ingress-nginx controller is deployed from its own helm chart, but the ingress objects are all managed in the PAWS helm chart. As this is helm 3, there is no tiller and RBAC affects what you can do.
==== Add a worker ====
Create a worker following the name: paws-k8s-worker-<number>
Do the standalone puppet dance: [[Help:Standalone puppetmaster]]
* From the worker:
** <code>sudo puppet agent -tv</code>
** <code>sudo rm -rf /var/lib/puppet/ssl</code>
* from the puppetmaster
** <code>sudo -i puppet cert list</code>
** <code>sudo -i puppet cert sign <client-fqdn></code>
* From the worker:
** <code>sudo puppet agent -tv</code>
* From a k8s control node:
** <code>kubeadm token create --print-join-command</code>
* From the worker:
** <output from kubeadm command>


=== General notes ===
=== General notes ===

Revision as of 18:48, 20 July 2022

Introduction

PAWS is a Jupyterhub deployment that runs in the PAWS Coud VPS project. The main Jupyterhub login is accessible at https://hub.paws.wmcloud.org/hub/login, and is a public service that can authenticated to via Wikimedia OAuth. More end-user info is at PAWS. Besides a simple Jupyterhub deployment, PAWS also contains easy access methods for the wiki replicas, the wikis themselves via the OAuth grant and pywikibot.

Kubernetes cluster

Deployment

The PAWS Kubernetes cluster built to similar specifications as the Toolforge cluster, deployed using puppet to prepare the system and native kubeadm deployment of the Kubernetes layer. As such, the deployment is nearly identical to the process described for Toolforge. After you have the base layer of Kubernetes deployed using the procedures outlined for Toolforge and the yaml deployed by puppet for PAWS, you can proceed with deploying the Jupyterhub itself.

Upgrading

Upgrading should be following the same schedule and technique as upgrading Toolforge Kubernetes because this is a similarly kubeadm-plus-puppet cluster. Regular upgrades are essential to staying ahead of CVEs (via point-releases) and keeping the cluster's certs fresh. Kubernetes 1.16+ has nice tools for manually refreshing certs, but it isn't a fun situation. Remember to check that something strange isn't in the kubeadm-config configmap in the kube-system namespace if one of the control plane pods isn't staying live!

The only special consideration here is that you should make sure that the Jupyterhub helm chart isn't trying to deploy deprecated objects. Objects that are deprecated on Kubernetes will continue to work, but you'll have an issue doing upgrades and deployments of Jupyterhub. The best thing to do here is probably to try to get a PR in upstream to fix things or upgrade our version of Jupyterhub.

Architecture

We opted to use a stacked control plane like in the original build, but we set it up with a redundant three-node cluster. To maintain HA for the control plane and for the services, two haproxy servers sit in front of the cluster with a floating IP managed with keepalived that should be capable of automatic failover. DNS simply points at that IP.


A simple diagram is as follows:

PAWS Design.png

General Build

With the exception of the stacked control plane and specific services, nearly the entire build re-uses the security and puppet design of Toolforge Kubernetes. By using helm 3, we were able to prevent any divergence from secure RBAC and Pod Security Policies. Upgrades should be conducted when Toolforge upgrades are on the same cycle, but the component repositories used (which are separated by major k8s version) allow the upgrade schedules to diverge if required. An ingress exists (not on this diagram) for the deploy-hook service, but that is disabled in the first iteration to work out some kinks in the process.

Floating IP

The floating IP is our second service using a manually-provisioned Neutron port with IP 172.16.1.171/32 that is managed with keepalived, using this procedure: Portal:Cloud VPS/Admin/Keepalived That is is NAT'd to public IP 185.15.56.57/32.

Ports

At the load balancer layer (haproxy), routing is done by port back to the Kubernetes control plane service on control plane nodes or ingresses at the dedicated ingress worker nodes. The control plane is hit at the usual port of TCP 6443, for both the frontend and backend. The ingress layer is served at the well-known web ports (TCP 80 and 443), which hits the dedicated ingress worker nodes on a Nodeport service at port 30000. The neutron security group paws-loadbalancer prevents internet clients from contacting the k8s API at this time.

TLS

TLS certs are done via acme-chief and distributed to the haproxy load balancer layer. Therefore inside the cluster, Kubernetes basically has the TLS ingress bits in helm turned off.

Users

The maintain-kubeusers service used in Toolforge runs on paws, granting the same privileges to admin users on the paws.admin group as would be found for members of the tools.admin group in Toolforge. The certs for these users are automatically renewed as they come close to their expiration date. Where cluster-admin is required directly rather than through the usual impersonation method, such as for using the helm command directly root@paws-k8s-control-1/2/3 has that access.

Helm

Helm 3 is used to deploy kubernetes applications on the cluster. It is installed by puppet via Debian package. The community supported ingress-nginx controller is deployed from its own helm chart, but the ingress objects are all managed in the PAWS helm chart. As this is helm 3, there is no tiller and RBAC affects what you can do.

Add a worker

Create a worker following the name: paws-k8s-worker-<number>

Do the standalone puppet dance: Help:Standalone puppetmaster

  • From the worker:
    • sudo puppet agent -tv
    • sudo rm -rf /var/lib/puppet/ssl
  • from the puppetmaster
    • sudo -i puppet cert list
    • sudo -i puppet cert sign <client-fqdn>
  • From the worker:
    • sudo puppet agent -tv
  • From a k8s control node:
    • kubeadm token create --print-join-command
  • From the worker:
    • <output from kubeadm command>

General notes

  • The control plane uses a converged or "stacked" etcd system. Etcd runs in containers deployed by kubeadm directly on the control plane nodes. Therefore, it is unwise to ever turn off 2 control plane nodes at once since it will cause problems for the etcd raft election system.
  • The control plane and haproxy nodes are part of separate anti-affinity server groups so that Openstack will not schedule them on the same hypervisor. Worker nodes are placed in a soft anti-affinity server group.
  • Ingress controllers are deployed to dedicated ingress worker nodes, which also take advantage of being in an anti-affinity server group.
  • To see status of k8s control plane pods (running coredns, kube-proxy, calico, etcd, kube-apiserver, kube-controller-manager), see kubectl --namespace=kube-system get pod -o wide.
  • Prometheus stats and metrics-server are deployed in the metrics namespace during cluster build via kubectl apply -f $yaml-file, just like in the Toolforge deploy documentation.
  • Because of pod security policies in place, all init containers have been removed from the paws-project version of things. Privileged containers cannot be run inside the prod namespace.

Jupyterhub deployment

Jupyterhub & PAWS Components

Jupyterhub is a set of systems deployed together that provide Jupyter notebook servers per user. The three main subsystems for Jupyterhub are the Hub, Proxy, and the Single-User Notebook Server. Really good overview of these systems is available at http://jupyterhub.readthedocs.io/en/latest/reference/technical-overview.html.

PAWS is a Jupyterhub deployment (Hub, Proxy, Single-User Notebook Server) with some added bells and whistles. Some additional PAWS-specific pods in our deployment are:

PAWS also includes customized versions of some Jupyterhub images:

  • singleuser: Since this is the environment for end users, there is a fair bit going on here. Our image is a replacement of the upstream one. We set the correct UID and directory. We install the jupyterhub/lab code directly from pip, along with PyWikiBot, a small library to allow importing a notebook like a python package along the lines of import paws.$username.$notebooks_name called ipynb-paws and code from https://github.com/toolforge/nbpawspublic to add a public link button. There are other customizations because this is a great surface for doing them. The general goal is to get a notebook up and running for use on wikis as fast as possible.
  • paws-hub: We build upon the upstream Jupyterhub hub image just a touch, adding bits that respect more of the UID settings and adding in a custom culling script. The code for doing OAuth is actually inserted in the helm chart instead.

The other custom image is a deploy-hook, which is undergoing some renovations before it is redeployed in the cluster.

Deployment

  • The PAWS repository is at https://github.com/toolforge/paws. It should be cloned locally. Then the git-crypt key needs to be used to unlock secrets.yaml file. See one of the PAWS admins if you should have access to this key.
  • PAWS is built via github actions triggered by a PR. Github actions will also update the values.yaml to match any new container that is built.
  • The command used to deploy it right now running cd'd into an unlocked git checkout is:
helm install paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m

If you are deploying to an actual paws cluster, you will also need the ingress controller:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm update
kubectl create ns ingress-nginx-gen2
helm install -n ingress-nginx-gen2 ingress-nginx-gen2 ingress-nginx/ingress-nginx --values ingress/values.yaml

Pod Security Policy: kubectl apply -f paws/ingress/nginx-ingress-psp.yaml and the controllers themselves kubectl apply -f paws/ingress/nginx-ingress-psp.yaml. Please note, you will need your dedicated ingress worker nodes deployed (prefix puppet looks for the name paws-k8s-ingress-) for that to do anything because there are tolerations and affinities for the nodes.

If already deployed, do not use the "install" command. Change that to "upgrade" to deploy changes/updates, such as:

helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m

Database

JupyterHub uses a database in Trove to manage the user state. Credentials are in secrets.yaml.

Moving to sqlite

During ToolsDB outages we can change the db to in memory sqlite without significant impact.

The smoothest way is to do a helm upgrade as root on a control node (as above, in an unlocked checkout) with this command: helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite

You can roll back to ToolsDB with helm by going into an unlocked checkout of https://github.com/toolforge/paws and run helm with helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml

Without using helm

If you don't have an unlocked checkout and you are using your user account on a shell on one of the k8s control plane hosts, you can also manually edit the configmap to do this:

$ kubectl --as admin --as-group system:masters --namespace prod edit configmap hub-config

write down the existing value and then set hub.db_url to "sqlite://"

Restart the hub with

$ kubectl --as admin --as-group system:masters -n prod delete pod $(kubectl get pods --namespace prod|grep hub|cut -f 1 -d ' ')

To move it back you can set hub.db_url to the previous value (if you didn't write it down before you changed it, see /home/bstorm/src/paws/paws/secrets.yaml at jupyterhub.hub.db.url) and restart the hub with

$ kubectl --as admin --as-group system:masters -n prod delete pod $(kubectl get pods --namespace prod|grep hub|cut -f 1 -d ' ')

Common administrative actions

Some common administrative actions.

Deleting user data in case of spam or credential leaks

In the instance a notebook or file hosted on PAWS needs an admin to remove it immediately (vs. asking a user to delete it), you can access all user data via the NFS mounted locally on all k8s nodes.

  • SSH to a worker or control node such as paws-k8s-worker-1.paws.eqiad1.wikimedia.cloud.
  • Become root with sudo -i
  • cd /data/project/paws/userhomes this is the top level of user homes and paws public pages.
  • cd $wiki_user-id where $wiki_user-id is the numeric id of the user, not the text username
  • Remove the offending file with rm as needed.

Stop a running workload in PAWS

Paws-activity.png

Useful if you want to stop a crypto miner or similar.

You need to be an admin inside PAWS.

  1. Log in to PAWS, likely https://hub.paws.wmcloud.org/hub/home
  2. Click the Admin button in the top menu. If you don't have the button, you aren't an admin
  3. Search in the list for the workload you want to stop
  4. Click the Stop server button

Bonus point if you check the user against https://meta.wikimedia.org/wiki/Special:CentralAuth for additional hints to see if the user is a bad actor.

Prevent an user from using PAWS

As of this writing the only method we know about is to talk to a CheckUser in-wiki admin to global-block the user, so it breaks the OAuth that PAWS uses.

TODO: link is probably: https://meta.wikimedia.org/wiki/Meta:Requests_for_help_from_a_sysop_or_bureaucrat