You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Toolforge/Admin/Kubernetes/jobs: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
(→‎See also: add link to end users documentation)
imported>BryanDavis
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Tracked|T285944}}
#REDIRECT [[Portal:Toolforge/Admin/Kubernetes/Jobs framework]]
 
This page contains information about the Toolforge Jobs Framework, an architecture to support [[Help:Toolforge/Grid | grid-like jobs]] on Toolforge kubernetes.
 
== The framework ==
 
The framework is called '''Toolforge Jobs Framework''' (or '''TJF'''). The main component is a REST API to ease end user interaction with Toolforge jobs in the kubernetes cluster. The API abstracts away most of the k8s gory details for configuring, removing, managing and reading status on jobs. The abstraction approach is similar to what is being done with [[Help:Toolforge/Web | Toolforge webservices]] (we have the <code>webservice</code> command there), but with an approach that consist on having most of the business logic in an API service.
 
By splitting the software into several components, and introducing an stable API, we aim to reduce maintenance burden by not needing to rebuild all Toolforge docker containers every time we change some internal mechanism (which is the case of the <code>tools-webservice</code> package).
 
[[File:Toolforge_jobs.png|center|500px]]
 
The framework consists on 3 components:
* '''jobs-framework-api''' ([https://gerrit.wikimedia.org/r/admin/repos/cloud/toolforge/jobs-framework-api gerrit]) ([https://gerrit.wikimedia.org/g/cloud/toolforge/jobs-framework-api gitiles]) --- uses [https://flask-restful.readthedocs.io flask-restful] and runs inside the k8s cluster as a webservice. Offers the REST API that in turn interacts with the k8s API native objects: <code>CronJob</code>, <code>Job</code> and <code>Deployment</code>.
* '''jobs-framework-cli''' ([https://gerrit.wikimedia.org/r/admin/repos/cloud/toolforge/jobs-framework-cli gerrit]) ([https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/jobs-framework-api/ gitiles]) --- command line interface to interact with the jobs API service. Typically used by end users in Toolforge bastions.
* '''jobs-framework-emailer''' ([https://gerrit.wikimedia.org/r/admin/repos/cloud/toolforge/jobs-framework-emailer gerrit]) ([https://gerrit.wikimedia.org/g/cloud/toolforge/jobs-framework-emailer gitiles]) --- a daemon that uses [https://github.com/kubernetes-client/python the official k8s python client] and [https://docs.python.org/3/library/asyncio.html asyncio]. It runs inside k8s, listen to pod events, and email users about their jobs activity.
 
The REST API is freely usable within Toolforge, both bastion servers and kubernetes pods. This means that a running job can interact with the Toolforge jobs API and CRUD other jobs.
 
=== Auth ===
 
[[File:Toolforge_jobs-auth.png|center|500px]]
 
To ensure that Toolforge users only manage their own jobs, TJF uses kubernetes certificates for client authentication. These x509 certificates are automatically managed by <code>maintain-kubeusers</code>, and live in each user home directory:
 
<syntaxhighlight lang="shell-session">
toolsbeta.test@toolsbeta-sgebastion-04:~$ egrep client-certificate\|client-key .kube/config
    client-certificate: /data/project/test/.toolskube/client.crt
    client-key: /data/project/test/.toolskube/client.key
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.crt
-----BEGIN CERTIFICATE-----
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.key
-----BEGIN RSA PRIVATE KEY-----
</syntaxhighlight>
 
The <code>jobs-framework-api</code> component needs to know the client certificate '''CommonName'''. With this information, <code>jobs-framework-api</code> can ''supplant'' the user by reading again the x509 certificates from the user home, and use them to interact with the kubernetes API. This is effectively a TLS proxy that reuses the original certificate.
 
In the current Toolforge webservice setup, TLS termination is done at the nginx front proxy. The front proxy talks to the backends using plain HTTP, with no simple options for relaying or forwarding the original client TLS certs.  That's why the <code>jobs-framework-api</code> doesn't use the main Toolofrge ingress setup.
 
This results in two types of connections, as shown in the diagram above:
 
* '''connection type 1''': an user contacts <code>jobs-framework-api</code> using k8s client TLS certs from its home directory. The TLS connection is established to the <code>ingress-ngnx-jobs</code>, which has the client-side TLS termination. This can happen from a Toolforge bastion, or from a Job already running inside kubernetes. The connection can be made either using <code>jobs-framework-cli</code> or directly contacting <code>jobs-framework-api</code> programmatically by other methods.
* '''connection type 2''': once the CommonName of the original request certificate is validated, <code>jobs-framework-api</code> can load the same k8s client TLS certificate from the user home, and ''supplant'' the user to contact the k8s API. For this to be possible, the <code>jobs-framework-api</code> component needs permissions for every user home directory, pretty much like <code>maintain-kubeusers</code> has.
 
This setup is possible because the x509 certificates are maintained by the <code>maintain-kubeusers</code> component, and because <code>jobs-framework-api</code> runs inside the kubernetes cluster itself and therefore can be configured with enough permissions to read each users home.
 
More or other authentication mechanisms can be introduced in the future as we detect new use cases.
 
The Toolforge front proxy exists today basically for webservices running in the grid. Once the grid is fully deprecated and we no longer need the front proxy, we could re-evaluate this whole situation and simplify it.
 
=== Ingress & TLS ===
 
The <code>jobs-framework-api</code> doesn't use a kubernetes ingress deployment. Instead, it deploys its own nodeport service in the <code>jobs-api</code> namespace.
 
The jobs-specific one is able to read TLS client certificates and pass the <code>ssl-client-subject-dn</code> HTTP header to the pod running the <code>toolforge-jobs-api</code> webservice.
With this information <code>toolforge-jobs-api</code> can load again the client cert when talking to the k8s API on behalf of the original user.
 
The way this whole ingress/TLS setup works is as follows:
* The FQDN <code>jobs.svc.toolsbeta.eqiad1.wikimedia.cloud</code> that points to the k8s haproxy VIP address.
* The haproxy system listens on 30001/TCP for this jobs-specific ingress (and in 30000/TCP for the general one).
* The haproxy daemon reaches all k8s worker nodes on 30001/TCP, where there is a nodeport service in the <code>jobs-api</code> namespace, that redirects packets to the <code>jobs-api</code> deployment.
* The deployment consist on 1 pod with 2 containers: nginx & the <code>jobs-framework-api</code> itself.
* The nginx container handles the TLS termination and proxies the API by means of a socket.
* Once the TLS certs are verified the proxy injects the HTTP header <code>ssl-client-subject-dn</code> to <code>jobs-framework-api</code>, which contains the <code>CN=</code> information of the original user.
* With the <code>ssl-client-subject-dn</code> header, <code>jobs-framework-api</code> can load again the client certificate from the original user home on NFS and in turn contact the k8s API using them.
 
=== About logs ===
 
Logs produced by jobs should not be made available using <code>kubectl logs</code> because that means the stderr/stdout of the pod is being RW in the etcd cluster. If left unattended, logs produced by jobs can easily hammer and bring down our etcd clusters.
 
Logs should be stored in each user NFS home directory, until we come up with some holistic solution at kubernetes level like https://kubernetes.io/docs/concepts/cluster-administration/logging/
 
=== Endpoints ===
 
Some relevant URLs:
* https://jobs.svc.tools.eqiad1.wikimedia.cloud:30001/api/v1 --- API endpoint in the '''tools''' project.
* https://jobs.svc.toolsbeta.eqiad1.wikimedia.cloud:30001/api/v1 --- API endpoint in the '''toolsbeta''' project.
* https://jobs.toolforge.org/ --- name-reserved Toolforge tool ([https://toolsadmin.wikimedia.org/tools/id/jobs toolsadmin]) ([https://toolhub.wikimedia.org/tools/toolforge-jobs toolhub])
 
Please note that as of this writing the API endpoints are only available within Toolforge / Cloud VPS (internal IP address, no floating IP).
 
== Deployment and maintenance ==
 
Information on how to deploy and maintain the framework.
 
=== jobs-framework-api ===
 
==== deployment ====
The usual workflow to deploy a custom k8s component, which should really be automated, see [[phab:T291915 | Phabricator T291915: toolforge: automate how we deploy custom k8s components]].
 
==== maintenance ====
 
To see logs, try something like:
 
<syntaxhighlight lang="shell-session">
user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api nginx
[..]
192.168.17.192 - - [15/Feb/2022:12:57:54 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:12:59:50 +0000] "GET /api/v1/list/ HTTP/1.1" 200 3 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:00:34 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:13:01:01 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:01:02 +0000] "POST /api/v1/run/ HTTP/1.1" 409 52 "-" "python-requests/2.21.0"
user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api webservice
[..]
*** Operational MODE: single process ***
mounting api:app on /
Adding available container: {'shortname': 'tf-bullseye-std', 'image': 'docker-registry.tools.wmflabs.org/toolforge-bullseye-standalone:latest'}
Adding available container: {'shortname': 'tf-buster-std-DEPRECATED', 'image': 'docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest'}
Adding available container: {'shortname': 'tf-golang', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang-sssd-base:latest'}
Adding available container: {'shortname': 'tf-golang111', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest'}
Adding available container: {'shortname': 'tf-jdk17', 'image': 'docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest'}
[..]
</syntaxhighlight>
 
To verify the API endpoint is up try something like:
 
<syntaxhighlight lang="shell-session">
user@toolsbeta-test-k8s-control-4:~$ curl https://jobs.svc.toolsbeta.eqiad1.wikimedia.cloud:30001/api/v1/list -k
<html>
<head><title>400 No required SSL certificate was sent</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<center>No required SSL certificate was sent</center>
<hr><center>nginx/1.21.0</center>
</body>
</html>
</syntaxhighlight>
 
The 400 error is expected in that example because we're not sending a TLS client certificate, meaning nginx is doing its work correctly.
 
=== jobs-framework-cli ===
 
==== deployment ====
A simple debian package installed on the bastions. See [[Portal:Toolforge/Admin/Packaging]].
 
=== jobs-framework-emailer ===
 
==== deployment ====
The usual workflow to deploy a custom k8s component, which should really be automated, see [[phab:T291915 | Phabricator T291915: toolforge: automate how we deploy custom k8s components]].
 
==== maintenance ====
 
TODO: in development, see [[phab:T286135 | Phabricator T286135: Toolforge jobs framework: email maintainers on job failure]].
 
== API docs ==
 
This section contains concrete details for the API that TJF introduces.
 
'''TODO:''' this is outdated, we need swagger or similar to keep this up-to-date.
 
==== POST /api/v1/run/ ====
 
Creates a new job in the kubernetes cluster.
{{Collapse top|POST /api/v1/run/ details}}
 
{| class="wikitable sortable"
|+ Request parameters
|-
! name !! datatype !! required !! comment
|-
| name || string || yes || Name identification for the new job
|-
| cmd || string || yes || Job command line to execute, including arguments
|-
| type || string || yes || Job container type
|-
| schedule || string || optional || If present, job will be scheduled. String is cron syntax, like <code>*/1 * * * *</code>. Mutually exclusive with <code>continuous</code>.
|-
| continuous || boolean || optional || If <code>true</code>, job will be persistent. Mutually exclusive with <code>schedule</code>.
|}
 
Example request data:
<syntaxhighlight lang="json">
{
  "name": "myjob",
  "cmd": "./myscript.py --once",
  "type": "toolforge-buster-sssd",
  "schedule": "*/1 * * * *"
}
</syntaxhighlight>
{{Collapse bottom}}
 
==== GET /api/v1/show/{name}/  ====
 
Shows information about a job in the kubernetes cluster.
 
{{Collapse top|GET /api/v1/show/{name}/ details}}
 
{| class="wikitable sortable"
|+ Request parameters
|-
! name !! datatype !! required !! comment
|-
| name || string || yes || Job name identification
|}
 
{| class="wikitable sortable"
|+ Response parameters
|-
! name !! datatype !! comment
|-
| name || string || Job name identification
|-
| cmd || string || Job command line, including arguments
|-
| type || string || Job container type
|-
| schedule || string || Job schedule string in cron syntax, like <code>*/1 * * * *</code>
|-
| continuous || boolean || True if job is persistent
|-
| state || string || Job current state
|}
 
Example response JSON data:
<syntaxhighlight lang="json">
{
  "name": "myjob",
  "cmd": "./myscript.py --once",
  "type": "python",
  "continuous": false,
  "state"; "finished"
}
</syntaxhighlight>
{{Collapse bottom}}
 
==== DELETE /api/v1/delete/{name} ====
 
Delete a job in the kubernetes cluster.
 
{{Collapse top|DELETE /api/v1/delete/{name} details}}
{| class="wikitable sortable"
|+ Request parameters
|-
! name !! datatype !! required !! comment
|-
| name || string || yes || Job name identification
|}
{{Collapse bottom}}
 
==== GET /api/v1/list/ ====
 
Shows information about all user jobs in the kubernetes cluster.
 
{{Collapse top|GET /api/v1/list/ details}}
 
There are no request parameters.
 
{| class="wikitable sortable"
|+ Response parameters (list)
|-
! name !! datatype !! comment
|-
| name || string || Job name identification
|-
| cmd || string || Job command line, including arguments
|-
| type || string || Job container type
|-
| schedule || string || Job schedule string in cron syntax, like <code>*/1 * * * *</code>
|-
| continuous || boolean || True if job is persistent
|-
| state || string || Job current state
|}
 
Example response JSON data:
<syntaxhighlight lang="json">
[
  {
    "name": "myjob",
    "cmd": "./myscript.py --once",
    "type": "toolforge-buster-sssd",
    "continuous": false,
    "state": "finished"
  },
  {
    "name": "myotherjob",
    "cmd": "./myotherscript.py --once",
    "type": "toolforge-buster-sssd",
    "schedule": "*/1 * * * *",
    "state": "running"
  }
]
</syntaxhighlight>
{{Collapse bottom}}
 
==== DELETE /api/v1/flush/ ====
 
Delete all user jobs in the kubernetes cluster.
 
{{Collapse top|DELETE /api/v1/flush/ details}}
There are no request parameters.
{{Collapse bottom}}
 
==== GET /api/v1/containers/ ====
 
Shows information about all containers available for jobs in the kubernetes cluster.
 
{{Collapse top|GET /api/v1/containers/ details}}
 
There are no request parameters.
 
{| class="wikitable sortable"
|+ Response parameters (list)
|-
! name !! datatype !! comment
|-
| name || string || container shortname
|-
| type || string || container URL
|}
 
Example response JSON data:
<syntaxhighlight lang="json">
[
    {
        "name": "tf-buster",
        "type": "docker-registry.tools.wmflabs.org/toolforge-buster-sssd:latest"
    },
    {
        "name": "tf-buster-std",
        "type": "docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest"
    }
]
</syntaxhighlight>
{{Collapse bottom}}
 
 
== See also ==
* [[Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_jobs]] -- where this was initially designed.
* [[Help:Toolforge/Jobs_framework]] -- end user documentation
 
Some upstream kubernetes documentation pointers:
 
* https://kubernetes.io/docs/concepts/workloads/controllers/job/
* https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/
* https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
* https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
* https://kubernetes.io/docs/tasks/job/
* https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
* https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs
 
Related components:
* [https://gerrit.wikimedia.org/g/operations/software/tools-webservice/+/refs/heads/master operations/software/tools-webservice.git]

Latest revision as of 21:05, 24 May 2022