You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/EnhancementProposals/Toolforge jobs

From Wikitech-static
< Wikimedia Cloud Services team‎ | EnhancementProposals
Revision as of 16:45, 25 January 2021 by imported>Arturo Borrero Gonzalez (→‎Not using the framework: typo)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page contains information on a potential design to support grid-like jobs on Toolforge kubernetes, the end goal being to help in GridEngine deprecation.

Proposal

This proposal consist on introducing a framework called Toolforge Jobs Framework (or _TJF_). It is basically a new API to ease end user interaction with Toolforge jobs in the kubernetes cluster. The new API should abstract away most of the k8s gory details for configuring, removing, managing and reading status on jobs. The abstraction approach is similar to what is being done with Toolforge webservices (we have the webservice command there), but with a new approach that consist on decoupling the software into 2 components: an API service and a command line interface.

The framework consists precisely in these two components.

The API is freely usable within Toolforge, both bastion servers and kubernetes pods. This means that a running job can interact with the Toolforge jobs API and CRUD other jobs.

There are no plans to introduce backwards support for GridEngine in TJF, given the ultimate goal is to deprecate the old grid.

The two components approach

The TJF is composed of 2 components:

  • toolforge-jobs-api.py --- runs inside the k8s cluster as a webservice. Interacts with the k8s API native objects: CronJob, Job and ReplicationController.
  • toolforge-jobs-cli.py --- interacts with the toolforge-job-api service. Typically used by end users in Toolforge bastions.

By splitting the software into two components, and introducing an stable API, we aim to reduce maintenance burden by not needing to rebuild all Toolforge docker containers every time we change something in the API.

k8s abstraction that matches GridEngine experience

We would like to support a similar experience to what users are used to in GridEngine. Given the feature mapping table below, it should be possible in Kubernetes by using the following mechanisms:

  • Job. This object is the basic definition of a wprkload in the k8s cluster that makes it run a given task and ensure it finished as expected.
  • CronJob. This object support cron-like scheduling of child Jobs objects.
  • ReplicationContrller. This object is used to ensure a given Job is present. Used to control execution of continuous tasks, a feature not supported natively in the Job object.

Auth

To ensure that Toolforge users only manage their own jobs, TJF will use kubernetes certificates. A similar approach to authentication as used by the webservice command. The certificates are managed by maintain-kubeusers.

Not using the framework

Advanced Toolforge users that know how to interact with a Kubernetes API can still use it directly (like for webservices). Using the new TJF is optional and is provided just as a convenient facility for Toolforge users.

The containers problem

We have custom-built containers for Toolforge webservices. Containers for the most common web development frameworks and languages. Each container don't include every and each language and framework in the universe for practical reasons.

However, users can currently schedule jobs in GridEngine using any language, library or framework installed in our Debian bastions. They can write a script that combines calls to Python, PHP and Perl.

We would need to think and develop a container solution that enables job users with the appropriate runtimes.

Implementation details

TODO: Arturo would like to use Python3 to build TJF.

development notes

Some random stuff that Arturo has written here.

tools.arturo-test-tool@tools-sgebastion-08:~$ kubectl delete job arturo-test-job ; kubectl apply -f job.yaml 
job.batch "arturo-test-job" deleted
job.batch/arturo-test-job created
tools.arturo-test-tool@tools-sgebastion-08:~$ cat job.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: arturo-test-job
spec:
  backoffLimit: 3
  template:
    spec:
      containers:
      - name: arturo-test
        image: docker-registry.tools.wmflabs.org/toolforge-buster-sssd:latest
        workingDir: /data/project/arturo-test-tool/
        command:
          - ./arturo-test-script.sh
        env:
        - name: HOME
          value: /data/project/arturo-test-tool
        volumeMounts:
        - mountPath: /data/project
          name: home
      restartPolicy: Never
      volumes:
      - hostPath:
          path: /data/project
          type: Directory
        name: home
tools.arturo-test-tool@tools-sgebastion-08:~$ kubectl get jobs
NAME              COMPLETIONS   DURATION   AGE
arturo-test-job   1/1           9s         88s
tools.arturo-test-tool@tools-sgebastion-08:~$ kubectl get pods
NAME                    READY   STATUS      RESTARTS   AGE
arturo-test-job-9lzkc   0/1     Completed   0          94s
tools.arturo-test-tool@tools-sgebastion-08:~$ kubectl logs job/arturo-test-job
arturo test script
Done sleeping

Feature mapping

Each currently supported use case in the grid should have an equivalent feature in kubernetes. This table should help map each one.

Toolforge jobs feature mapping table
Feature GridEngine Kubernetes
simple one-off job launch jsub native Job API support
get single job status qstat kubectl describe job
get all jobs status qstat kubectl + some scripting
delete job jstop kubectl delete
scheduled jobs crontab + jsub native CronJob API support
continuous job launch (bot, daemon) jstart native ReplicationController API support
concurrency limits 16 running + 34 scheduled TBD. several potential mechanisms
get stderr / stdout of a job files in the NFS directory files in the NFS directory + kubectl logs <pod>
request additional mem jsub -mem TBD. we may not need this
sync run jsub -sync y TBD. no native support
making sure a job only runs once jsub -once native Job API support

See also

Internal documents:

Some upstream kubernetes documentation pointers: