You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Airflow

From Wikitech-static
< Analytics‎ | Systems
Revision as of 12:21, 25 May 2022 by imported>Joal (Add airflow developer guide link)
Jump to navigation Jump to search

WIP documentation page.

Apache Airflow is a workflow job scheduler. Developers declare job workflows using a custom DAG python API.

This page documents the Data Engineering managed Airflow instances in the Analytics Cluster.

If you wish to develop DAGs with Airflow, you can find more information on the Airflow Developer guide page.

Airflow setup and conventions

The Data Engineering team maintains several Airflow instances. Usually, these instances are team specific. Teams have full control over their airflow instance. Data Engineering manages the tooling needed to deploy and run these instances.

As of 2021-11, these instances all live within the Analytics Cluster VLAN, and have access to Hadoop and other Analytics Cluster related tools. It is expected that the Airflow instances themselves do not perform real computation tasks; instead they should submit jobs to the Hadoop cluster. Airflow is used for the pipelining and scheduling of these jobs.

airflow-dags repository

To develop best practices around Airflow, we use a single shared git repository for Airflow DAGs for all instances: data-engineering/airflow-dags. Airflow instance (and team) specific DAGs live in subdirectories of this repository, e.g. in <instance_name>/dags.

Deployment of airflow-dags

Each Airflow instance has its own scap deployment of data-engineering/airflow-dags. See Scap#Other_software_deployments for instructions on how to use scap to deploy.

Your airflow instance's airflow-dags scap deployment directory is located at /srv/deployment/airflow-dags/<instance_name> on the deployment server as well as on your airflow host. To deploy:

ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/<instance_name>
git pull # or checkout, do whatever you need to make this git clone ready for deployment
scap deploy

See also

Airflow Instances

analytics

Airflow instance owned by the Data / Analytics engineering team.

Host an-launcher1002.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics
git fetch && git rebase
scap deploy

analytics_test

Airflow test instance owned by the Data / Analytics engineering team.

Host an-test-client1001.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics_test/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_test
git fetch && git rebase
scap deploy

search

TODO

research

Airflow instance owned by the Research team.

Host an-airflow1002.eqiad.wmnet
Service user analytics-research
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/research/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/research
git fetch && git rebase
scap deploy

platform_eng

Airflow instance owned by the Platform Engineering team.

Host an-airflow1003.eqiad.wmnet
Service user analytics-platform-eng
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet - http://localhost:8600
Dags /srv/airflow-platform_eng/dags

Custom test instance

More at Analytics/Systems/Airflow/Airflow_testing_instance_tutorial

Administration

Overview of Data Engineering's Airflow deployments

Data Engineering maintains a debian package for Airflow at operations/debs/airflow/. This debian packaging installs a premade conda environment with all dependencies needed to run Airflow. The debian package installs this conda environment to /usr/lib/airflow.

The airflow::instance Puppet define is used to set up and run Airflow instances. This define can be used multiple times on the same host to declare multiple airflow instances. The instance specific configs are installed in /srv/airflow-<instance_name>, and templated systemd units are set up for services airflow-scheduler@<instance_name> and airflow-webserver@<instance_name>.

The profile::airflow Puppet class uses the profile::airflow::instances hiera variable to declare airflow::instances. This allows each airflow::instance to be fully specified via hiera. profile::airflow by default will use Data Engineering conventions as defaults for an airflow::instance.

These defaults include setting up instance specific scap::targets of the data-engineering/airflow-dags repository. (There is still some manual setup needed for this, see the instructions below on how to configure this for new instances.) The Airflow instance's dags_folder will be automatically set to one of the instance specific subdirectories in the airflow-dags repository. (You can override this in hiera if you need.)

Creating a new Airflow Instance

In this example, we'll be creating a new Airflow instance named 'test'.

Prepare airflow-dags for deployment to the new instance

Create the instance specific dags folder

By convention, all Airflow team instances use the same DAGs repository: data-engineering/airflow-dags. Instance specific DAGs are located in the <instance-name>/dags directory. Unless you override defaults in puppet/hiera, this will be used as airflow's dags_folder.

Create this directory and commit the changes before proceeding. In our example, this directory would be test/dags, since 'test' is our instance name.

Create the instance specific scap repository

Scap requires configuration is declared for each of its deployments. Because we use the same source DAGs repository for all airflow instances, we can't just add the scap.cfg file to the main airflow-dags repository. Instead, we use separately managed 'scap repositories' in which the deployment configuration is declared.

Create a new repository in gitlab with the name data-engineering/airflow-dags-scap-<instance_name>. For our example, we'll be creating data-engineering/airflow-dags-scap-test.

You'll need to create two files in this repository:

Create scap/scap.cfg with the following content:

[global]
git_repo: data-engineering/airflow-dags
ssh_user: test_user # (this user must exist on the airflow host, and it must be in the deploy_airflow.trusted_groups (see below)
dsh_targets: targets

And create a scap/targets file with the list of hostnames that will be deployed too. Likely this will be only your airflow host.

hostname1001.eqiad.wmnet

Create a scap deployment source

Scap is used to deploy the data-engineering/airflow-dags repository to airflow instances. Declaration of scap::target will be taken care for you by profile::airflow, but you will need to declare the scap::source for the deployment server.

Edit hieradata/role/common/deployment_server.yaml and add a new entry to scap::sources:

scap::sources:
    airflow-dags/test:
    repository: data-engineering/airflow-dags
    # This is the name of the scap repository we created in the previous step.
    scap_repository: data-engineering/airflow-dags-scap-test
    origin: gitlab

You'll also need to make sure that real users will be able to deploy. They must be in a posix group that has access to the deployment server, as well as in a group listed in this hiera config:

  # Shared deploy ssh key for Data Engineering maintained
  # Airflow instances. For now, all admins of Airflow instances
  # can deploy any Airflow instance.
  deploy_airflow:
    trusted_groups:
      - analytics-deployers
      # ...

Merge any changes and run puppet on the deployment server.

Create the Airflow MySQL Database

You'll need a running MariaDB instance somewhere.

CREATE DATABASE airflow_test;
CREATE USER 'airflow_test' IDENTIFIED BY 'password_here';
GRANT ALL PRIVILEGES ON airflow_test.* TO 'airflow_test';

Make sure your MariaDB config sets explicit_defaults_for_timestamp = on. See: https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database

Configure the Airflow instance in Puppet

Add the profile::airflow class to your node's role in Puppet and configure the Airflow instance(s) in your role's hiera.

Let's assume we're adding this instance in a role class role::airflow::test.

class role::airflow::test {
    include ::profile::airflow
    # profile::kerberos::keytabs is needed if your Airflow
    # instance needs to authenticate with Kerberos.
    # You'll need to create and configure the keytab for the Airflow instance's
    # $service_user we'll set below.
    include ::profile::kerberos::keytabs
}


Then, in hieradata/role/common/airflow/test.yaml:

# Set up airflow instances.
profile::airflow::instances:
  # airflow@test instance.
  test:
    # Since we set security: kerberos a keytab must be deployed for the service_user.
    service_user: test_user
    service_group: test_group
    # Set this to true if you want enable alerting for your airflow instance.
    monitoring_enabled: false
    # Configuration for /srv/airflow-test/airflow.cfg
    # Any airflow::instance configs can go here. See:
    # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
    # NOTE: unless your airflow instance does special things, the defaults
    # set in profile::airflow should be sufficient for setting up a
    # WMF Data Engineering managed airflow::instance.
    #airflow_config:
    #  core:

# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
profile::kerberos::keytabs::keytabs_metadata:
  - role: 'test_user'
    owner: 'test_user'
    group: 'test_group'
    filename: 'test_user.keytab'

See Create_a_keytab_for_a_service for instructions on creating keytabs.

Note that we didn't set db_user or db_password. These are secrets and should be set in the operations puppet private repository in the hiera variable profile::airflow::instances_secrets. So, in puppet private in the hieradata/role/common/airflow/test.yaml file:

# Set up airflow instances.
profile::airflow::instances_secrets:
  # airflow@test instance.
  test:
    db_user: airflow_test
    db_password: password_here

profile::airflow::instances_secrets will be merged with profile::airflow::instances by the profile::airflow class, and the parameters to airflow::instance will be available for use in the sql_alchemy_conn as an ERb template.

Once this is merged and applied, the node with the role::airflow::test will run the systemd services airflow-scheduler@test, airflow-webserver@test, airflow-kerberos@test, as well as some 'control' systemd services airflow@test and airflow that can be used to manage the Airflow test instance.

Create the airflow tables by running

 sudo -u test_user airflow-test db upgrade

The airflow services were probably already started by the earlier puppet run. Restart them now that the airflow tables are created properly.

 sudo systemctl restart airflow@test.service