You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Systems/Airflow/Developer guide/Python Job Repos
This page provides a tutorial for how to design a python based job repository in GitLab that publishes job artifacts that can be scheduled and launched by Airflow . There is also an example Gitlab repository that follows all these recommendations: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project.
Overview
We intentionally want to separate job logic from scheduling logic. A job should be standalone and parameterized in a way that given specific inputs it produces certain outputs. Airflow is a scheduler that is meant to run the job with input parameters for a particular run of a job, usually based on timestamps or incoming data.
A job repository can be used to specify all dependencies and logic needed to run a job. In order for the Airflow Scheduler to to launch the job, it needs to be able to access the job code and dependencies somewhere.
Data Engineering has implemented reusable GitLab CI pipelines to automate the generation of job 'artifacts', as well as tooling to deploy these artifacts so that Airflow can access them.
As of 2022-04, the CI pipelines focus on python based jobs (or anything that uses conda environments), but the artifact deployment can work with any kind of artifact file (zip files, jars, compiled binaries, etc.)
GitLab Job Repository Setup
Python package setup
You must minimally have the following:
- A conda-environment.yaml file that specifies minimally the python version:
dependencies:
- python=3.7
- A pip installable project setup, e.g. pyproject.toml, setup.cfg, setup.py etc. I.e.
pip install .
in your project dir will work.
Optionally, to use automated releases, you should use bump2version to manage your package version. If you use setup.cfg to manage your package version, then you need a .bumpversion.cfg file as follows:
[bumpversion]
current_version = 0.1.0.dev
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z0-9]+))?
serialize =
{major}.{minor}.{patch}.{release}
{major}.{minor}.{patch}
[bumpversion:part:release]
optional_value = unused
values =
dev
unused
[bumpversion:file:setup.cfg]
search = version = {current_version}
replace = version = {new_version}
When you first create this file, make sure that the current_version
ends in ".dev", and that the version
in setup.cfg matches this exactly.
GitLab CI setup
If you are choosing to use automated release versioning, then you .gitlab-ci.yml file should contain the following.
# Include conda_artifact_repo.yml to add release and conda env publishing jobs.
include:
- project: 'repos/data-engineering/workflow_utils'
ref: v0.4.0
file: '/gitlab_ci_templates/pipelines/conda_artifact_repo.yml'
ALTERNATIVELY, if you choose not to use automated release versioning, then you include just the publish_conda_env
job directly:
# Include just the publish_conda_env job.
# This does not include automated releasing, so you will need to either manually
# run the publish_conda_env job, or manually push tags to trigger the
# publish_conda_env job.
include:
- project: 'repos/data-engineering/workflow_utils'
ref: v0.4.0
file: '/gitlab_ci_templates/jobs/publish_conda_env.yml'
Automated Release GitLab Project Setup
This is only needed if you choose to use automated releasing. You'll need to configure your GitLab project to allow GitLab CI to push commits. To do this, you need to create an ssh key pair, add it as a GitLab Deploy Key for your project, and then set variables in GitLab that allow use of this key from the GitLab CI Runners.
First, follow the GitLab Create a project deploy key instructions. (Make sure you create a passwordless ssh key) When saving this Deploy Key in your Gitlab Repository settings, make sure you check the Grant write permissions to this key box to allow the key to push commits.
Next, we need to allow the GitLab CI Runner access to the private key so it can can push commits. The lib/git_ci.yml CI template that is used to configure git in a Runner uses GitLab variables to do this.
Go to Settings -> CI/CD -> Variables in your GitLab project and add the following 3 variables:
CI_GIT_USER_EMAIL
- An email address that will be used for git commits made by GitLab CI.CI_GIT_USER_USERNAME
- The git username that will be used for git commits made by GitLab CI.CI_GIT_USER_SSH_PRIVATE_KEY
- The ssh private key that you previously generated. This is the private part of the ssh key pair.
Job Repository Conda Env Artifact Publishing
Assuming you are using automated releases and you've followed all the setup instructions above, to publish a conda env job artifact you'll do the following.
Development
You publish a .dev version of your conda env from any commit on the main branch. To do so, from a commit Pipeline, manually run the publish_conda_env
job. This will publish a conda env to your Projects Generic Package Registry
Releasing
- Make sure the changes you want to deploy are merged into your main branch.
- Go to CI/CD -> Pipelines -> Run Pipeline (Blue button In the upper right)
- The only variable here you might want to edit is
POST_RELEASE_VERSION_BUMP
. This is the part of the semantic version to bump after releasing. Allowed values are major, minor and patch. Default is minor. - Click Run Pipeline at the bottom. This will launch a new pipeline for the latest commit on your main branch.
- The only variable here you might want to edit is
- Go to CI/CD -> Pipelines and click on the pipeline you just launched.
- Once any tests have finished, you should see be able to manually run the
trigger_release
job. This job will: remove the .dev version, commit and tag, and then bump the version and make a new commit to main. After this is done, the new .dev version will be bumped in main, and a tag will have been created and pushed to gitlab. - The creation of a new tag in gitlab will automatically launch a new pipeline that will build and publish a conda env artifact to your GitLab Project's Generic Package registry. Go to CI/CD -> Pipelines and you should see a Pipeline running for a tag commit titled something like 'Release version 0.15.0'. This is the pipeline that will make a GitLab release and publish the conda env.
Once the release tag pipeline finishes, you should have a new GitLab Release as well as a conda dist env artifact published in your project's Generic Package Registry.
Deploying your conda env artifact for use by Airflow
Go to Packages & Registries -> Package Registry and you should see a list of all the conda env artifacts. To deploy this so that Airflow can use it, you should declare this artifact in Your airflow-dags instance artifact config file.
Example: I want to use example-job-project 0.15.0 conda env artifact. At https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/packages/113, I can copy the URL for the .tgz artifact file. I then use this URL when I declare the artifact in e.g. analytics/config/artifacts.yaml:
artifacts:
# ...
example-job-project-0.15.0.conda.tgz:
id: https://gitlab.wikimedia.org/repos/data-engineering/example-job-project/-/package_files/487/download
This will then allow me to use the airflow-dags dag_config.artifact
to refer to this artifact by name in my DAG code:
# STILL WIP!
from analytics.config import dag_config
from wmf_airflow_common.operators.spark import SparkSubmitOperator
with DAG(
# ...
): as dag
etl = SparkSubmitOperator.for_virtualenv(
# This will be translated to a cached URL (in HDFS) accessible by Airflow.
# By default, the alias name of the extracted archive directory will be 'venv'
virtualenv_archive=artifact('example-job-project-0.15.0.conda.tgz')
# This should be a relative path to the pyspark job entrypoint in the archive.
# Note that this needs to end in .py if it is really a pyspark job!
application='bin/pyspark_job_file.py',
)
Spark and Conda
TODO
Gitlab CI UI test integration
GitLab CI has the ability to integrate test coverage and reporting in its UI.
For pytest reporting, make sure your pytest job outputs a junitxml format report by adding the a flag like --junitxml=junit_pytest_report.xml
. Then, add a junit artifact to your test job that generates this file.
For coverage, add a --cov-report=xml
flag to your pytest command. Then, add a cobertura artifact to your test job that generates this file.
Full example:
In your setup.cfg [tool:pytest] section, or in your pytest.ini file:
# Coverage and junit XML report formats are output for use with GitLab CI UI.
addopts = -svv --failed-first --cov-report=xml --cov-report=term --cov=example_job_project --junitxml=junit_pytest_report.xml tests example_job_project
Then, in your .gitlab-ci.yml file in your test job:
test:
stage: test
script:
# - pytest, tox, whatever you prefer.
# - ...
# Match coverage total from job log output.
# See: https://docs.gitlab.com/ee/ci/yaml/index.html#coverage
# This is what allows for use of the GitLab coverage badge.
coverage: '/^TOTAL.+?(\d+\%)$/'
# Add these artifacts to integrate with MR and Pipeline UIs.
artifacts:
when: always
reports:
# This shows test reports in the Pipeline test tab
junit: junit_pytest_report.xml
# This shows coverage information in Merge Request diffs.
cobertura: coverage.xml