You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Airflow: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Joal
(Add airflow developer guide link)
(12 intermediate revisions by 6 users not shown)
Line 1: Line 1:
WIP documentation page.
WIP documentation page.


See also https://phabricator.wikimedia.org/T272973
[https://airflow.apache.org/ Apache Airflow] is a workflow job scheduler.  Developers declare job workflows using a custom DAG python API.


This page documents the Data Engineering managed Airflow instances in the Analytics Cluster.
If you wish to develop DAGs with Airflow, you can find more information on the [[Analytics/Systems/Airflow/Developer guide|Airflow Developer guide]] page.
= Airflow setup and conventions =
The Data Engineering team maintains several Airflow instances.  Usually, these instances are team specific.  Teams have full control over their airflow instance.  Data Engineering manages the tooling needed to deploy and run these instances.
As of 2021-11, these instances all live within the Analytics Cluster VLAN, and have access to Hadoop and other Analytics Cluster related tools.  It is expected that the Airflow instances themselves do not perform real computation tasks; instead they should submit jobs to the Hadoop cluster.  Airflow is used for the pipelining and scheduling of these jobs.
== airflow-dags repository ==
To develop best practices around Airflow, we use a single shared git repository for Airflow DAGs for all instances: [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags].  Airflow instance (and team) specific DAGs live in subdirectories of this repository, e.g. in <tt><instance_name>/dags</tt>.
=== Deployment of airflow-dags ===
Each Airflow instance has its own scap deployment of data-engineering/airflow-dags.  See [[Scap#Other_software_deployments]] for instructions on how to use scap to deploy.
Your airflow instance's airflow-dags scap deployment directory is located at <tt>/srv/deployment/airflow-dags/<instance_name></tt> on the deployment server as well as on your airflow host.  To deploy:
<syntaxhighlight lang=bash>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/<instance_name>
git pull # or checkout, do whatever you need to make this git clone ready for deployment
scap deploy
</syntaxhighlight>
= See also =
* [https://docs.google.com/document/d/1hp6JYVy3SLRgTx1BYfnNOCPk5VFJeZ4jMpxD8WJKVB0/edit Shared Airflow - Design Document]
* [[phab:T272973]]
* [[Analytics/Systems/Cluster/Workflow_management_tools_study]]
* [[phab:tag/airflow/|Phabricator project]]


= Airflow Instances =
= Airflow Instances =
Line 12: Line 41:
|-
|-
| Host || an-launcher1002.eqiad.wmnet
| Host || an-launcher1002.eqiad.wmnet
|-
| Service user || analytics
|-
|-
| Web UI Port || 8600
| Web UI Port || 8600
|-
|-
| Dags || [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/airflow/dags refinery/airflow/dags]
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics/dags airflow-dags/analytics/dags]
|-
| Dags deployment ||
 
<pre>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics
git fetch && git rebase
scap deploy
</pre>
|}
|}


SSH Tunnel to Web UI:
== analytics_test ==
  ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet
 
== analytics-test ==
Airflow test instance owned by the Data / Analytics engineering team.
Airflow test instance owned by the Data / Analytics engineering team.


{| class="wikitable"
{| class="wikitable"
|-
|-
| Host || an-test-coord1001.eqiad.wmnet
| Host || an-test-client1001.eqiad.wmnet
|-
| Service user || analytics
|-
|-
| Web UI Port || 8600
| Web UI Port || 8600
|-
|-
| Dags || /srv/airflow-analytics-test-dags
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics_test/dags airflow-dags/analytics_test/dags]
|-
| Dags deployment ||
 
<pre>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_test
git fetch && git rebase
scap deploy
</pre>
|}
|}
SSH Tunnel to Web UI:
  ssh -t -N -L8600:127.0.0.1:8600 an-test-coord1001.eqiad.wmnet


== search ==
== search ==
TODO
TODO
== research ==
Airflow instance owned by the Research team.
{| class="wikitable"
|-
| Host || an-airflow1002.eqiad.wmnet
|-
| Service user || analytics-research
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/research/dags airflow-dags/research/dags]
|-
| Dags deployment ||
<pre>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/research
git fetch && git rebase
scap deploy
</pre>
|}
== platform_eng ==
Airflow instance owned by the Platform Engineering team.
{| class="wikitable"
|-
| Host || an-airflow1003.eqiad.wmnet
|-
| Service user || analytics-platform-eng
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || <tt>/srv/airflow-platform_eng/dags</tt>
|}
== Custom test instance ==
More at [[Analytics/Systems/Airflow/Airflow_testing_instance_tutorial]]


= Administration =  
= Administration =  
== Overview of Data Engineering's Airflow deployments ==
Data Engineering maintains a debian package for Airflow at [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/airflow/ operations/debs/airflow/].  This debian packaging installs a premade [https://docs.conda.io/en/latest/ conda] environment with all dependencies needed to run Airflow.  The debian package installs this conda environment to <tt>/usr/lib/airflow</tt>.
The <code>airflow::instance</code> Puppet define is used to set up and run Airflow instances.  This define can be used multiple times on the same host to declare multiple airflow instances.  The instance specific configs are installed in <tt>/srv/airflow-<instance_name></tt>, and templated systemd units are set up for services <tt>airflow-scheduler@<instance_name></tt> and <tt>airflow-webserver@<instance_name></tt>.
The <code>profile::airflow</code> Puppet class uses the <code>profile::airflow::instances</code> hiera variable to declare <code>airflow::instance</code>s.  This allows each <code>airflow::instance</code> to be fully specified via hiera.  <code>profile::airflow</code> by default will use Data Engineering conventions as defaults for an <code>airflow::instance</code>.
These defaults include setting up instance specific <code>scap::target</code>s of the [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags] repository.  (There is still some manual setup needed for this, see the instructions below on how to configure this for new instances.)  The Airflow instance's <code>dags_folder</code> will be automatically set to one of the instance specific subdirectories in the airflow-dags repository.  (You can override this in hiera if you need.)


== Creating a new Airflow Instance ==
== Creating a new Airflow Instance ==
In this example, we'll be creating a new Airflow instance named 'test'.
In this example, we'll be creating a new Airflow instance named 'test'.
=== Prepare airflow-dags for deployment to the new instance ===
==== Create the instance specific dags folder ====
By convention, all Airflow team instances use the same DAGs repository: [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags].  Instance specific DAGs are located in the <tt><instance-name>/dags</tt> directory.  Unless you override defaults in puppet/hiera, this will be used as airflow's <code>dags_folder</code>. 
Create this directory and commit the changes before proceeding.  In our example, this directory would be <tt>test/dags</tt>, since 'test' is our instance name.
==== Create the instance specific scap repository ====
[https://doc.wikimedia.org/mw-tools-scap/scap3/repo%20config.html Scap requires configuration] is declared for each of its deployments.  Because we use the same source DAGs repository for all airflow instances, we can't just add the scap.cfg file to the main airflow-dags repository.  Instead, we use separately managed 'scap repositories' in which the deployment configuration is declared.
Create a new repository in gitlab with the name <tt>data-engineering/airflow-dags-scap-<instance_name></tt>.  For our example, we'll be creating <tt>data-engineering/airflow-dags-scap-test</tt>.
You'll need to create two files in this repository:
Create <tt>scap/scap.cfg</tt> with the following content:
<syntaxhighlight lang=text>
[global]
git_repo: data-engineering/airflow-dags
ssh_user: test_user # (this user must exist on the airflow host, and it must be in the deploy_airflow.trusted_groups (see below)
dsh_targets: targets
</syntaxhighlight>
And create a <tt>scap/targets</tt> file with the list of hostnames that will be deployed too.  Likely this will be only your airflow host.
<syntaxhighlight lang=text>
hostname1001.eqiad.wmnet
</syntaxhighlight>
==== Create a scap deployment source ====
[[Scap]] is used to deploy the [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags] repository to airflow instances.  Declaration of <code>scap::target</code> will be taken care for you by <code>profile::airflow</code>, but you will need to declare the <code>scap::source</code> for the deployment server.
Edit <tt>hieradata/role/common/deployment_server.yaml</tt> and add a new entry to <code>scap::sources</code>:
<syntaxhighlight lang="yaml">
scap::sources:
    airflow-dags/test:
    repository: data-engineering/airflow-dags
    # This is the name of the scap repository we created in the previous step.
    scap_repository: data-engineering/airflow-dags-scap-test
    origin: gitlab
</syntaxhighlight>
You'll also need to make sure that real users will be able to deploy.  They must be in a posix group that has access to the deployment server, as well as in a group listed in this hiera config:
<syntaxhighlight lang="yaml">
  # Shared deploy ssh key for Data Engineering maintained
  # Airflow instances. For now, all admins of Airflow instances
  # can deploy any Airflow instance.
  deploy_airflow:
    trusted_groups:
      - analytics-deployers
      # ...
</syntaxhighlight>
Merge any changes and run puppet on the deployment server.


=== Create the Airflow MySQL Database ===
=== Create the Airflow MySQL Database ===
Line 71: Line 228:
}
}
</syntaxhighlight>
</syntaxhighlight>


Then, in <code>hieradata/role/common/airflow/test.yaml</code>:
Then, in <code>hieradata/role/common/airflow/test.yaml</code>:
Line 81: Line 239:
     service_user: test_user
     service_user: test_user
     service_group: test_group
     service_group: test_group
  # Set this to true if you want enable alerting for your airflow instance.
    # Set this to true if you want enable alerting for your airflow instance.
     monitoring_enabled: false
     monitoring_enabled: false
     # Configuration for /srv/airflow-test/airflow.cfg
     # Configuration for /srv/airflow-test/airflow.cfg
     # Any airflow configs can go here. See:
     # Any airflow::instance configs can go here. See:
     # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#webserver
     # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
     airflow_config:
     # NOTE: unless your airflow instance does special things, the defaults
      core:
    # set in profile::airflow should be sufficient for setting up a
        security: kerberos # you don't need to set this if you don't use Kerberos.
    # WMF Data Engineering managed airflow::instance.
        executor: LocalExecutor
    #airflow_config:
        # This can be an ERB template that will be rendered in airflow::instance.
    #  core:
        # db_user and db_password params should be set in puppet private
        # in profile::airflow::instances_secrets.
        sql_alchemy_conn: mysql://<%= @db_user %>:<%= @db_password %>@my-db-host.eqiad.wmnet/airflow_analytics?ssl_ca=/etc/ssl/certs/Puppet_Internal_CA.pem


# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
Line 122: Line 277:


Create the airflow tables by running
Create the airflow tables by running
   sudo -u airflow_test airflow-test db upgrade
   sudo -u test_user airflow-test db upgrade


The airflow services were probably already started by the earlier puppet run.  Restart them now that the airflow tables are created properly.
The airflow services were probably already started by the earlier puppet run.  Restart them now that the airflow tables are created properly.
   sudo systemctl restart airflow@test.service
   sudo systemctl restart airflow@test.service

Revision as of 12:21, 25 May 2022

WIP documentation page.

Apache Airflow is a workflow job scheduler. Developers declare job workflows using a custom DAG python API.

This page documents the Data Engineering managed Airflow instances in the Analytics Cluster.

If you wish to develop DAGs with Airflow, you can find more information on the Airflow Developer guide page.

Airflow setup and conventions

The Data Engineering team maintains several Airflow instances. Usually, these instances are team specific. Teams have full control over their airflow instance. Data Engineering manages the tooling needed to deploy and run these instances.

As of 2021-11, these instances all live within the Analytics Cluster VLAN, and have access to Hadoop and other Analytics Cluster related tools. It is expected that the Airflow instances themselves do not perform real computation tasks; instead they should submit jobs to the Hadoop cluster. Airflow is used for the pipelining and scheduling of these jobs.

airflow-dags repository

To develop best practices around Airflow, we use a single shared git repository for Airflow DAGs for all instances: data-engineering/airflow-dags. Airflow instance (and team) specific DAGs live in subdirectories of this repository, e.g. in <instance_name>/dags.

Deployment of airflow-dags

Each Airflow instance has its own scap deployment of data-engineering/airflow-dags. See Scap#Other_software_deployments for instructions on how to use scap to deploy.

Your airflow instance's airflow-dags scap deployment directory is located at /srv/deployment/airflow-dags/<instance_name> on the deployment server as well as on your airflow host. To deploy:

ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/<instance_name>
git pull # or checkout, do whatever you need to make this git clone ready for deployment
scap deploy

See also

Airflow Instances

analytics

Airflow instance owned by the Data / Analytics engineering team.

Host an-launcher1002.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics
git fetch && git rebase
scap deploy

analytics_test

Airflow test instance owned by the Data / Analytics engineering team.

Host an-test-client1001.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics_test/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_test
git fetch && git rebase
scap deploy

search

TODO

research

Airflow instance owned by the Research team.

Host an-airflow1002.eqiad.wmnet
Service user analytics-research
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/research/dags
Dags deployment
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/research
git fetch && git rebase
scap deploy

platform_eng

Airflow instance owned by the Platform Engineering team.

Host an-airflow1003.eqiad.wmnet
Service user analytics-platform-eng
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet - http://localhost:8600
Dags /srv/airflow-platform_eng/dags

Custom test instance

More at Analytics/Systems/Airflow/Airflow_testing_instance_tutorial

Administration

Overview of Data Engineering's Airflow deployments

Data Engineering maintains a debian package for Airflow at operations/debs/airflow/. This debian packaging installs a premade conda environment with all dependencies needed to run Airflow. The debian package installs this conda environment to /usr/lib/airflow.

The airflow::instance Puppet define is used to set up and run Airflow instances. This define can be used multiple times on the same host to declare multiple airflow instances. The instance specific configs are installed in /srv/airflow-<instance_name>, and templated systemd units are set up for services airflow-scheduler@<instance_name> and airflow-webserver@<instance_name>.

The profile::airflow Puppet class uses the profile::airflow::instances hiera variable to declare airflow::instances. This allows each airflow::instance to be fully specified via hiera. profile::airflow by default will use Data Engineering conventions as defaults for an airflow::instance.

These defaults include setting up instance specific scap::targets of the data-engineering/airflow-dags repository. (There is still some manual setup needed for this, see the instructions below on how to configure this for new instances.) The Airflow instance's dags_folder will be automatically set to one of the instance specific subdirectories in the airflow-dags repository. (You can override this in hiera if you need.)

Creating a new Airflow Instance

In this example, we'll be creating a new Airflow instance named 'test'.

Prepare airflow-dags for deployment to the new instance

Create the instance specific dags folder

By convention, all Airflow team instances use the same DAGs repository: data-engineering/airflow-dags. Instance specific DAGs are located in the <instance-name>/dags directory. Unless you override defaults in puppet/hiera, this will be used as airflow's dags_folder.

Create this directory and commit the changes before proceeding. In our example, this directory would be test/dags, since 'test' is our instance name.

Create the instance specific scap repository

Scap requires configuration is declared for each of its deployments. Because we use the same source DAGs repository for all airflow instances, we can't just add the scap.cfg file to the main airflow-dags repository. Instead, we use separately managed 'scap repositories' in which the deployment configuration is declared.

Create a new repository in gitlab with the name data-engineering/airflow-dags-scap-<instance_name>. For our example, we'll be creating data-engineering/airflow-dags-scap-test.

You'll need to create two files in this repository:

Create scap/scap.cfg with the following content:

[global]
git_repo: data-engineering/airflow-dags
ssh_user: test_user # (this user must exist on the airflow host, and it must be in the deploy_airflow.trusted_groups (see below)
dsh_targets: targets

And create a scap/targets file with the list of hostnames that will be deployed too. Likely this will be only your airflow host.

hostname1001.eqiad.wmnet

Create a scap deployment source

Scap is used to deploy the data-engineering/airflow-dags repository to airflow instances. Declaration of scap::target will be taken care for you by profile::airflow, but you will need to declare the scap::source for the deployment server.

Edit hieradata/role/common/deployment_server.yaml and add a new entry to scap::sources:

scap::sources:
    airflow-dags/test:
    repository: data-engineering/airflow-dags
    # This is the name of the scap repository we created in the previous step.
    scap_repository: data-engineering/airflow-dags-scap-test
    origin: gitlab

You'll also need to make sure that real users will be able to deploy. They must be in a posix group that has access to the deployment server, as well as in a group listed in this hiera config:

  # Shared deploy ssh key for Data Engineering maintained
  # Airflow instances. For now, all admins of Airflow instances
  # can deploy any Airflow instance.
  deploy_airflow:
    trusted_groups:
      - analytics-deployers
      # ...

Merge any changes and run puppet on the deployment server.

Create the Airflow MySQL Database

You'll need a running MariaDB instance somewhere.

CREATE DATABASE airflow_test;
CREATE USER 'airflow_test' IDENTIFIED BY 'password_here';
GRANT ALL PRIVILEGES ON airflow_test.* TO 'airflow_test';

Make sure your MariaDB config sets explicit_defaults_for_timestamp = on. See: https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database

Configure the Airflow instance in Puppet

Add the profile::airflow class to your node's role in Puppet and configure the Airflow instance(s) in your role's hiera.

Let's assume we're adding this instance in a role class role::airflow::test.

class role::airflow::test {
    include ::profile::airflow
    # profile::kerberos::keytabs is needed if your Airflow
    # instance needs to authenticate with Kerberos.
    # You'll need to create and configure the keytab for the Airflow instance's
    # $service_user we'll set below.
    include ::profile::kerberos::keytabs
}


Then, in hieradata/role/common/airflow/test.yaml:

# Set up airflow instances.
profile::airflow::instances:
  # airflow@test instance.
  test:
    # Since we set security: kerberos a keytab must be deployed for the service_user.
    service_user: test_user
    service_group: test_group
    # Set this to true if you want enable alerting for your airflow instance.
    monitoring_enabled: false
    # Configuration for /srv/airflow-test/airflow.cfg
    # Any airflow::instance configs can go here. See:
    # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
    # NOTE: unless your airflow instance does special things, the defaults
    # set in profile::airflow should be sufficient for setting up a
    # WMF Data Engineering managed airflow::instance.
    #airflow_config:
    #  core:

# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
profile::kerberos::keytabs::keytabs_metadata:
  - role: 'test_user'
    owner: 'test_user'
    group: 'test_group'
    filename: 'test_user.keytab'

See Create_a_keytab_for_a_service for instructions on creating keytabs.

Note that we didn't set db_user or db_password. These are secrets and should be set in the operations puppet private repository in the hiera variable profile::airflow::instances_secrets. So, in puppet private in the hieradata/role/common/airflow/test.yaml file:

# Set up airflow instances.
profile::airflow::instances_secrets:
  # airflow@test instance.
  test:
    db_user: airflow_test
    db_password: password_here

profile::airflow::instances_secrets will be merged with profile::airflow::instances by the profile::airflow class, and the parameters to airflow::instance will be available for use in the sql_alchemy_conn as an ERb template.

Once this is merged and applied, the node with the role::airflow::test will run the systemd services airflow-scheduler@test, airflow-webserver@test, airflow-kerberos@test, as well as some 'control' systemd services airflow@test and airflow that can be used to manage the Airflow test instance.

Create the airflow tables by running

 sudo -u test_user airflow-test db upgrade

The airflow services were probably already started by the earlier puppet run. Restart them now that the airflow tables are created properly.

 sudo systemctl restart airflow@test.service