You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Help:Toolforge/Jobs framework: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
imported>BryanDavis
(Fix link to avoid redirect)
Line 436: Line 436:
* [[News/2020 Kubernetes cluster migration]]
* [[News/2020 Kubernetes cluster migration]]
* [[Help:Toolforge/Raw kubernetes jobs | Alternate procedure for managing jobs in Toolforge Kubernetes, using the raw k8s API]], only recommended if you are an advanced user.
* [[Help:Toolforge/Raw kubernetes jobs | Alternate procedure for managing jobs in Toolforge Kubernetes, using the raw k8s API]], only recommended if you are an advanced user.
* [[Portal:Toolforge/Admin/Kubernetes/jobs]] - Engineering documentation about this system.
* [[Portal:Toolforge/Admin/Kubernetes/Jobs framework]] - Engineering documentation about this system.
* [https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/ Wikimedia Techblog: Toolforge Jobs Framework]
* [https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/ Wikimedia Techblog: Toolforge Jobs Framework]

Revision as of 21:07, 24 May 2022

This page contains information on the Toolforge jobs framework.

Every non-trivial task performed in Toolforge (like executing a script or running a bot) should be dispatched to a job scheduling backend (in this case, Kubernetes), which ensures that the job is run in a suitable place with sufficient resources.

The basic principle of running jobs is fairly straightforward:

  • You create a job from a submission server (usually login.toolforge.org)
  • Kubernetes finds a suitable execution node to run the job on, and starts it there once resources are available
  • As it runs, your job will send output and errors to files until the job completes or is aborted.

Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once.

Creating jobs

Information about job creation using the toolforge-jobs run command.

Creating one-off jobs

One-off jobs (or normal jobs) are workloads that will be scheduled by Toolforge Kubernetes and run until finished. They will run once, and are expected to finish at some point.

Select a runtime, a command in your tool home directory and then use toolforge-jobs run to create the job, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std

The --command option supports input arguments, using quotes, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command "./mycommand.sh --witharguments" --image tf-bullseye-std

You can instruct the command line to wait and don't return until the job is finished with the --wait option, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std --wait

Creating scheduled jobs (cron jobs)

To schedule a recurrent job (also known as cron jobs), use the --schedule WHEN option when creating it:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run mycronjob --command ./daily.sh --image tf-bullseye-std --schedule "17 13 * * *"

The schedule argument uses cron syntax (see also cron on Wikipedia).

If you need to run a daily/hourly job, please avoid scheduling jobs at exactly midnight (00:00) or at the top of the hour (at :00 minutes) if your job does not explicitly require it. Instead, pick a random time of the day so that system load is balanced evenly through the day.

Creating continuous jobs

Continuous jobs are programs that are never meant to end. If they end (for example, because of an error) the Toolforge Kubernetes system will restart them.

To create a continuous job, use the --continuous option:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image tf-bullseye-std --continuous

About the executable

In all job types (normal, continuous, cronjob) the --command parameter should meet the following conditions:

  • it should refer to an executable file.
  • mind the path, the command working directory is the tools home directory, so --command mycommand.sh will likely fail (it references $PATH), and --command ./mycommand.sh is likely what you mean.
  • arguments are optional but if present then better use quotes, example: --command "./mycommand.sh --arg1 x --arg2 y".

Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.

About the job name

The job name is a unique string identifier. The string should meet these criteria:

  • between 1 and 100 characters long.
  • any combination of number, lower-case letters and the - (dash) character.
  • no spaces, no special symbols.

Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.

Choosing the execution runtime

In Toolforge Kubernetes we offer a pre-defined set of container images that you can use as the execution runtime for your job.

To view which execution runtimes are available, run the toolforge-jobs images command.

Example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs images
Short name                Container image URL
------------------------  ----------------------------------------------------------------------
tf-bullseye-std           docker-registry.tools.wmflabs.org/toolforge-bullseye-standalone:latest
tf-buster-std-DEPRECATED  docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest
tf-golang                 docker-registry.tools.wmflabs.org/toolforge-golang-sssd-base:latest
tf-golang111              docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
tf-jdk11-DEPRECATED       docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-base:latest
tf-jdk17                  docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest
tf-jdk8-DEPRECATED        docker-registry.tools.wmflabs.org/toolforge-jdk8-sssd-base:latest
tf-node6-DEPRECATED       docker-registry.tools.wmflabs.org/toolforge-node6-sssd-base:latest
tf-node10-DEPRECATED      docker-registry.tools.wmflabs.org/toolforge-node10-sssd-base:latest
tf-node12                 docker-registry.tools.wmflabs.org/toolforge-node12-sssd-base:latest
tf-php5-DEPRECATED        docker-registry.tools.wmflabs.org/toolforge-php5-sssd-base:latest
tf-php72-DEPRECATED       docker-registry.tools.wmflabs.org/toolforge-php72-sssd-base:latest
tf-php73-DEPRECATED       docker-registry.tools.wmflabs.org/toolforge-php73-sssd-base:latest
tf-php74                  docker-registry.tools.wmflabs.org/toolforge-php74-sssd-base:latest
tf-python2-DEPRECATED     docker-registry.tools.wmflabs.org/toolforge-python2-sssd-base:latest
tf-python34-DEPRECATED    docker-registry.tools.wmflabs.org/toolforge-python34-sssd-base:latest
tf-python35-DEPRECATED    docker-registry.tools.wmflabs.org/toolforge-python35-sssd-base:latest
tf-python37-DEPRECATED    docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
tf-python39               docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest
tf-ruby21-DEPRECATED      docker-registry.tools.wmflabs.org/toolforge-ruby21-sssd-base:latest
tf-ruby25-DEPRECATED      docker-registry.tools.wmflabs.org/toolforge-ruby25-sssd-base:latest
tf-ruby27                 docker-registry.tools.wmflabs.org/toolforge-ruby27-sssd-base:latest
tf-tcl86                  docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest

We suggest you move away from images marked with the DEPRECATED keyword, since they are old runtimes.

Introducing additional flexibility for execution runtimes is currently part of the WMCS team roadmap.

NOTE: if your tool uses python, you may want to use a virtualenv, see Help:Toolforge/Python#Kubernetes_python_jobs.

Loading jobs from a YAML file

You can define a list of jobs in a YAML file and load them all at once using the toolforge-jobs load command, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs load jobs.yaml

NOTE: loading jobs from a file flushes all previously defined jobs.

Example YAML file:

# https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
---
# a cronjob
- name: everyminute
  command: ./myothercommand.py -v
  image: tf-bullseye-std
  no-filelog: true
  schedule: "* * * * *"
  emails: onfailure
# a continuous job
- image: tf-bullseye-std
  name: endlessjob
  command: ./dumps-daemon.py --endless
  continuous: true
  emails: all
# wait for this normal job before loading the next
- name: myjob
  image: tf-bullseye-std
  command: ./mycommand.sh --argument1
  wait: true
  emails: onfinish
# another normal job after the previous one finished running
- name: anotherjob
  image: tf-bullseye-std
  command: ./mycommand.sh --argument1
  emails: none

Listing your existing jobs

You can get information about the jobs created for your tool using toolforge-jobs list, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs list
Job name:       Job type:             Status:
--------------  --------------------  ---------------------------
myscheduledjob  schedule: * * * * *   Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   continuous            Running
myjob           normal                Completed

Listing even more information at once is possible using --long or -l:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs list -l
Job name:       Command:                 Job type:            Image:            File log:  Emails:   Resources:   Status:
--------------  -----------------------  -------------------  ---------------   ---------  -------   ----------   ---------------------------
myscheduledjob  ./read-dumps.sh          schedule: * * * * *  tf-bullseye-std   yes        none      default      Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   ./myendlesscommand.sh    continuous           tf-bullseye-std   no         all       default      Running
myjob           ./mycommand.sh --debug   normal               tf-bullseye-std   yes        onfinish  default      Completed

NOTE: normal jobs will be deleted from this listing shortly after being completed (even if they finish with some error).

Deleting your jobs

You can delete your jobs in two ways:

  • manually delete each job, identified by name, using the toolforge-jobs delete command.
  • delete all defined jobs at once, using the toolforge-jobs flush command.

Showing information about your job

You can get information about a defined job using the toolforge-jobs show command, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name:  | myscheduledjob                                                  |
+------------+-----------------------------------------------------------------+
| Command:   | ./read-dumps.sh myargument                                      |
+------------+-----------------------------------------------------------------+
| Job type:  | schedule: * * * * *                                             |
+------------+-----------------------------------------------------------------+
| Image:     | tf-bullseye-std                                                 |
+------------+-----------------------------------------------------------------+
| File log:  | yes                                                             |
+------------+-----------------------------------------------------------------+
| Emails:    | none                                                            |
+------------+-----------------------------------------------------------------+
| Resources: | mem: 10Mi, cpu: 100                                             |
+------------+-----------------------------------------------------------------+
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
+------------+-----------------------------------------------------------------+
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |
+------------+-----------------------------------------------------------------+

This should include information about the job status and some hints (in case of failure, etc).

Job logs

Jobs log stdout/stderr to files in your tool home directory.

For a job myjob, you will find:

  • a myjob.out file, containing stdout generated by your job.
  • a myjob.err file, containing stderr generated by your job.

Example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std
tools.mytool@tools-sgebastion-11:~$ ls myjob*
myjob.out myjob.err

Subsequent same-name job runs will append to the same files.

NOTE: as of this writing there is no automatic way to prune log files, so tool users must take care of such files growing too large.

Log generation can disabled with the --no-filelog parameter when creating a new job, for example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std --no-filelog

Job quotas

Each tool account has a limited quota available. The same quota is used for jobs and other things potentially running on Kubernetes, like webservices.

To check your quota, run:

tools.mytool@tools-sgebastion-11:~$ kubectl describe resourcequotas
Name:                   tool-mytool
Namespace:              tool-mytool
Resource                Used  Hard
--------                ----  ----
configmaps              2     10
count/cronjobs.batch    0     50    <--
count/deployments.apps  0     3     <--
count/jobs.batch        0     15    <--
limits.cpu              0     2
limits.memory           0     8Gi
persistentvolumeclaims  0     3
pods                    0     10
replicationcontrollers  0     1
requests.cpu            0     2
requests.memory         0     6Gi
secrets                 1     10
services                0     1
services.nodeports      0     0

The quota entries marked with the <-- symbol indicate:

  • maximum number of cronjobs
  • maximum number of continuous jobs
  • maximum number of jobs

As of this writing, new jobs get 512Mi memory and 1/2 CPU by default.

You can run jobs with additional CPU and memory using the --mem MEM and --cpu CPU parameters, example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command "./heavycommand.sh" --image tf-bullseye-std --mem 1Gi --cpu 2

Requesting more memory or CPU will fail if the tool quota is exceeded.

Quota increases

It is possible to request a quota increase if you can demonstrate your tool's need for more resources than the default namespace quota allows. Instructions and a template link for creating a quota request can be found at Toolforge (Quota requests) in Phabricator.

Please read all the instructions there before submitting your request.

Job email notifications

You can select to receive email notifications from your job activity, by using the --emails EMAILS option when creating a job.

The available choices are:

  • none, don't get any email notification. The default behavior.
  • onfailure, receive email notifications in case of a failure event.
  • onfinish, receive email notifications in case of the job finishing (both successfully and on failure).
  • all, receive all possible notifications.

Example:

tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std --emails onfinish

The email will be sent to tools.mytool@toolforge.org, which is an email alias that by default redirects to all tool maintainers associated with that particular tool account.

Complete example session

Here is a complete example of a work session with the Toolforge jobs framework.

Grid Engine migration

This section contains specific documentation for Grid Engine users that are trying to migrate their jobs to Kubernetes.

In particular, here is a list of common command equivalences between Grid Engine (legacy, with jsub and friends) and Kubernetes (with the new toolforge-jobs).

  • Basic job submission:
tools.mytool@tools-sgebastion-11:~$ jsub ./mycommand.sh
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-bullseye-std
  • Allocating additional memory:
tools.mytool@tools-sgebastion-11:~$ jsub -mem 1000m php i_like_more_ram.php
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./i_like_more_ram.php --image tf-php74 --mem 1Gi --cpu 2
  • Waiting until the job is completed:
tools.mytool@tools-sgebastion-11:~$ jsub -sync y program [args...]
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./myScript.py --image tf-python39 --wait
  • Viewing information about all jobs:
tools.mytool@tools-sgebastion-11:~$ qstat
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs list
  • Deleting a job:
tools.mytool@tools-sgebastion-11:~$ qdel job_number/job_name
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs delete myjob
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs flush

Useful links

The following tools have been built by the Toolforge admin team to help others see job status:

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia Movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)

See also