You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

PAWS/Admin: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Bstorm
m (→‎Current Setup: Correct typo in helm URL)
 
imported>Bstorm
(→‎Architecture: Start the section)
Line 11: Line 11:


=== Architecture ===
=== Architecture ===
[Coming Soon]
We opted to use a stacked control plane like in the original build, but we set it up with a redundant three-node cluster. To maintain HA for the control plane and for the services, two haproxy servers sit in front of the cluster with a floating IP managed with keepalived that should be capable of automatic failover.  DNS simply points at that IP.
 
 
A simple diagram is as follows:
 
[[File:PAWS Design.png|frameless|850x850px]]
 
With the exception of the introduction of keepalived and the stacked control plane and specific services, nearly the entire build re-uses the security and puppet design of Toolforge Kubernetes. By using helm 3, we were able to prevent any divergence from secure RBAC and Pod Security Policies. Upgrades should be conducted when Toolforge upgrades are on the same cycle, but the component repositories used (which are separated by major k8s version) allow the upgrade schedules to diverge if required.  An ingress exists (not on this diagram) for the deploy-hook service, but that is disabled in the first iteration to work out some kinks in the process.


=== Current Setup ===
=== Current Setup ===
Line 38: Line 45:
* PAWS is will be deployed with [https://travis-ci.org/ Travis CI] as it in in tools, and the dashboard is at https://travis-ci.org/toolforge/paws. The configuration for the Travis builds are at https://github.com/toolforge/paws/blob/master/.travis.yml, and builds and deploys launch the <code>travis-script.bash</code> script with appropriate parameters. However, this is not going to work at first, so please deploy via helm directly until the deploy-hook and CI setup is revisited.
* PAWS is will be deployed with [https://travis-ci.org/ Travis CI] as it in in tools, and the dashboard is at https://travis-ci.org/toolforge/paws. The configuration for the Travis builds are at https://github.com/toolforge/paws/blob/master/.travis.yml, and builds and deploys launch the <code>travis-script.bash</code> script with appropriate parameters. However, this is not going to work at first, so please deploy via helm directly until the deploy-hook and CI setup is revisited.
* To deploy via helm directly, you need to know some parameters because the values.yaml file of the helm chart both lacks some sane defaults (TODO) and requires some params no matter what so it deploys the right version of the images.  At a bare minimum, you will need to know the right images and tags for some of the images. Because the rebuilt cluster is running on sqlite before full deployment, the command used to deploy it right now running cd'd into an unlocked git checkout (currently of the forked repo) is:
* To deploy via helm directly, you need to know some parameters because the values.yaml file of the helm chart both lacks some sane defaults (TODO) and requires some params no matter what so it deploys the right version of the images.  At a bare minimum, you will need to know the right images and tags for some of the images. Because the rebuilt cluster is running on sqlite before full deployment, the command used to deploy it right now running cd'd into an unlocked git checkout (currently of the forked repo) is:
<syntaxhighlight lang=bash>
<syntaxhighlight lang="bash">
helm install paws --namespace prod ./paws -f paws/secrets.yaml --set=dbProxy.image.name=quay.io/brookestorm/db-proxy --set=deployHook.image.name=quay.io/brookestorm/deploy-hook --set=dbProxy.image.tag=test --set=deployHook.image.tag=test --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.image.name=quay.io/brookestorm/paws-hub --set=jupyterhub.hub.image.tag=test --set=jupyterhub.hub.db.type=sqlite --set=jupyterhub.singleuser.image.name=quay.io/brookestorm/singleuser --set=jupyterhub.singleuser.image.tag=test
helm install paws --namespace prod ./paws -f paws/secrets.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite
</syntaxhighlight>
</syntaxhighlight>
The images should be in quay.io/wikimedia-paws-prod at final deploy, not in Brooke's personal repo. If already deployed, do not use the "install" command. Change that to "upgrade" to deploy changes/updates. This version of the command is a placeholder for the final version when this is all ready.
If already deployed, do not use the "install" command. Change that to "upgrade" to deploy changes/updates, such as:
<syntaxhighlight lang="bash">
helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite
</syntaxhighlight>
 
TODO: When we move off sqlite to toolsdb, drop that part.


=== Database ===
=== Database ===
Line 47: Line 59:


==== Moving to sqlite ====
==== Moving to sqlite ====
{{Warn|content=This is deprecated tools-project material for now}}
During toolsdb outages we can change the db to in memory sqlite without significant impact.  
During toolsdb outages we can change the db to in memory sqlite without significant impact.  


From tools-paws-master-01 (grab k8s credentials from <code>/home/chicocvenancio/.kube</code> if necessary):
From a control plane node such as paws-k8s-control-1:


<code> kubectl --namespace prod edit configmap hub-config </code>
<code> kubectl --as=admin --as-group=system:masters --namespace prod edit configmap hub-config </code>


set <code> hub.db_url</code> to <code>"sqlite://"</code>
set <code> hub.db_url</code> to <code>"sqlite://"</code>


Restart the hub with <code> kubectl --namespace prod delete pod $(kubectl get pods --namespace prod|grep hub|cut -f 1 -d ' ') </code>
Restart the hub with <code> kubectl --as=admin --as-group=system:masters -n prod delete pod $(kubectl get pods -n prod|grep hub|cut -f 1 -d ' ') </code>


To move it back you can set <code> hub.db_url</code> to the previous value (see <code>/home/chicocvenancio/paws/paws/secrets.yaml</code> at <code>jupyterhub.hub.db.url</code>) and restart the hub with <code> kubectl --namespace prod delete pod $(kubectl get pods --namespace prod|grep hub|cut -f 1 -d ' ') </code>.
To move it back you can set <code> hub.db_url</code> to the previous value (see <code>/home/bstorm/src/paws/secrets.yaml</code> at <code>jupyterhub.hub.db.url</code>) and restart the hub with <code> kubectl --as=admin --as-group=system:masters -n prod delete pod $(kubectl get pods -n prod|grep hub|cut -f 1 -d ' ') </code>.


In doubt you can also reset the cluster with helm by going into the <code>/home/chicocvenancio/paws/</code> directory and run the deploy script with <code>./build.py deploy prod</code>.
In doubt you can also reset the cluster with helm by going into an [[PAWS/Admin#Deployment_2|unlocked]] checkout of https://github.com/toolforge/paws and run helm with <code>helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml</code>

Revision as of 00:14, 24 July 2020

Introduction

PAWS is a Jupyterhub deployment that runs in the PAWS Coud VPS project. The main Jupyterhub login is accessible at https://hub.paws.wmcloud.org/hub/login, and is a public service that can authenticated to via Wikimedia OAuth. More end-user info is at PAWS. Besides a simple Jupyterhub deployment, PAWS also contains easy access methods for the wiki replicas, the wikis themselves via the OAuth grant and pywikibot.

Kubernetes cluster

Deployment

The PAWS Kubernetes cluster built to similar specifications as the Toolforge cluster, deployed using puppet to prepare the system and native kubeadm deployment of the Kubernetes layer. As such, the deployment is nearly identical to the process described for Toolforge.

Architecture

We opted to use a stacked control plane like in the original build, but we set it up with a redundant three-node cluster. To maintain HA for the control plane and for the services, two haproxy servers sit in front of the cluster with a floating IP managed with keepalived that should be capable of automatic failover. DNS simply points at that IP.


A simple diagram is as follows:

PAWS Design.png

With the exception of the introduction of keepalived and the stacked control plane and specific services, nearly the entire build re-uses the security and puppet design of Toolforge Kubernetes. By using helm 3, we were able to prevent any divergence from secure RBAC and Pod Security Policies. Upgrades should be conducted when Toolforge upgrades are on the same cycle, but the component repositories used (which are separated by major k8s version) allow the upgrade schedules to diverge if required. An ingress exists (not on this diagram) for the deploy-hook service, but that is disabled in the first iteration to work out some kinks in the process.

Current Setup

  • The maintain-kubeusers service used in Toolforge runs on paws, granting the same privileges to admin users on the paws.admin group as would be found for members of the tools.admin group in Toolforge. The certs for these users are automatically renewed as they come close to their expiration date. Where cluster-admin is required directly rather than through the usual impersonation method, such as for using the helm command directly root@paws-k8s-control-1/2/3 has that access.
  • The three k8s control plane nodes are fully redundant behind a pair of redundant haproxy nodes.
  • The haproxy nodes have a keepalived-maintained IP address that should automatically fail over when one goes down, leaving no single point of failure in the system.
  • The control plane uses a converged or "stacked" etcd system. Etcd runs in containers deployed by kubeadm directly on the control plane nodes. Therefore, it is unwise to ever turn off 2 control plane nodes at once since it will cause problems for the etcd raft election system.
  • The control plane and haproxy nodes are part of separate anti-affinity server groups so that Openstack will not schedule them on the same hypervisor.
  • Ingress controllers are deployed to dedicated ingress worker nodes, which also take advantage of being in an anti-affinity server group.
  • Helm 3 is used to deploy kubernetes applications on the cluster. It is installed by puppet via Debian package. The community supported ingress-nginx is deployed by hand, but the ingress objects are all managed in the helm chart. As this is helm 3, there is no tiller and RBAC affects what you can do.
  • To see status of k8s control plane pods (running coredns, kube-proxy, calico, etcd, kube-apiserver, kube-controller-manager), see kubectl --namespace=kube-system get pod -o wide.
  • Prometheus stats and metrics-server are deployed in the metrics namespace during cluster build via kubectl apply -f $yaml-file, just like in the Toolforge deploy documentation.
  • Because of pod security policies in place, all init containers have been removed from the paws-project version of things. Privileged containers cannot be run inside the prod namespace.

Jupyterhub deployment

Jupyterhub & PAWS Components

Jupyterhub is a set of systems deployed together that provide Jupyter notebook servers per user. The three main subsystems for Jupyterhub are the Hub, Proxy, and the Single-User Notebook Server. Really good overview of these systems is available at http://jupyterhub.readthedocs.io/en/latest/reference/technical-overview.html.

Paws is a Jupyterhub deployment (Hub, Proxy, Single-User Notebook Server) with some added bells and whistles. The additional PAWS specific parts of our deployment are:

  • db-proxy: Mysql-proxy plugin script to perform simple authentication to the Wiki Replicas. See https://github.com/toolforge/paws/blob/master/images/db-proxy/auth.lua
  • nbserve and render: nbserve is the nginx proxy being run from toolforge in the paws-public tool that handles URL rewriting for paws-public URLs, and render handles the actual rendering of the ipynb notebook as a static page. Together they make paws-public possible.

Deployment

  • The PAWS repository is at https://github.com/toolforge/paws. It should be cloned locally (currently the rebuilt cluster is running off a fork of this at https://github.com/crookedstorm/paws). Then the git-crypt key needs to be used to unlock secrets.yaml file. See one of the PAWS admins if you should have access to this key.
  • PAWS is will be deployed with Travis CI as it in in tools, and the dashboard is at https://travis-ci.org/toolforge/paws. The configuration for the Travis builds are at https://github.com/toolforge/paws/blob/master/.travis.yml, and builds and deploys launch the travis-script.bash script with appropriate parameters. However, this is not going to work at first, so please deploy via helm directly until the deploy-hook and CI setup is revisited.
  • To deploy via helm directly, you need to know some parameters because the values.yaml file of the helm chart both lacks some sane defaults (TODO) and requires some params no matter what so it deploys the right version of the images. At a bare minimum, you will need to know the right images and tags for some of the images. Because the rebuilt cluster is running on sqlite before full deployment, the command used to deploy it right now running cd'd into an unlocked git checkout (currently of the forked repo) is:
helm install paws --namespace prod ./paws -f paws/secrets.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite

If already deployed, do not use the "install" command. Change that to "upgrade" to deploy changes/updates, such as:

helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml --set=jupyterhub.hub.db.url="sqlite://" --set=jupyterhub.hub.db.type=sqlite

TODO: When we move off sqlite to toolsdb, drop that part.

Database

JupyterHub uses a database to keep the user state, currently it uses ToolsDB. It can be changed to sqlite when ToolsDB is having an outage, but that generally doesn't scale as well. It should be moved to its own database server (ideally a Trove system) as soon as possible.

Moving to sqlite

During toolsdb outages we can change the db to in memory sqlite without significant impact.

From a control plane node such as paws-k8s-control-1:

kubectl --as=admin --as-group=system:masters --namespace prod edit configmap hub-config

set hub.db_url to "sqlite://"

Restart the hub with kubectl --as=admin --as-group=system:masters -n prod delete pod $(kubectl get pods -n prod|grep hub|cut -f 1 -d ' ')

To move it back you can set hub.db_url to the previous value (see /home/bstorm/src/paws/secrets.yaml at jupyterhub.hub.db.url) and restart the hub with kubectl --as=admin --as-group=system:masters -n prod delete pod $(kubectl get pods -n prod|grep hub|cut -f 1 -d ' ') .

In doubt you can also reset the cluster with helm by going into an unlocked checkout of https://github.com/toolforge/paws and run helm with helm upgrade paws --namespace prod ./paws -f paws/secrets.yaml