You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Airflow: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>ODimitrijevic
 
(7 intermediate revisions by 5 users not shown)
Line 1: Line 1:
WIP documentation page.
#REDIRECT [[Data Engineering/Systems/Airflow]]
 
[https://airflow.apache.org/ Apache Airflow] is a workflow job scheduler.  Developers declare job workflows using a custom DAG python API.
 
This page documents the Data Engineering managed Airflow instances in the Analytics Cluster.
 
= Airflow setup and conventions =
The Data Engineering team maintains several Airflow instances.  Usually, these instances are team specific.  Teams have full control over their airflow instance.  Data Engineering manages the tooling needed to deploy and run these instances.
 
As of 2021-11, these instances all live within the Analytics Cluster VLAN, and have access to Hadoop and other Analytics Cluster related tools.  It is expected that the Airflow instances themselves do not perform real computation tasks; instead they should submit jobs to the Hadoop cluster.  Airflow is used for the pipelining and scheduling of these jobs.
 
== airflow-dags repository ==
To develop best practices around Airflow, we use a single shared git repository for Airflow DAGs for all instances: [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags].  Airflow instance (and team) specific DAGs live in subdirectories of this repository, e.g. in <tt><instance_name>/dags</tt>.
 
=== Deployment of airflow-dags ===
Each Airflow instance has its own scap deployment of data-engineering/airflow-dags.  See [[Scap#Other_software_deployments]] for instructions on how to use scap to deploy.
 
Your airflow instance's airflow-dags scap deployment directory is located at <tt>/srv/deployment/airflow-dags/<instance_name></tt> on the deployment server as well as on your airflow host.  To deploy:
 
<syntaxhighlight lang=bash>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/<instance_name>
git pull # or checkout, do whatever you need to make this git clone ready for deployment
scap deploy
</syntaxhighlight>
 
= See also =
* [https://docs.google.com/document/d/1hp6JYVy3SLRgTx1BYfnNOCPk5VFJeZ4jMpxD8WJKVB0/edit Shared Airflow - Design Document]
* [[phab:T272973]]
* [[Analytics/Systems/Cluster/Workflow_management_tools_study]]
* [[phab:tag/airflow/|Phabricator project]]
 
= Airflow Instances =
 
== analytics ==
Airflow instance owned by the Data / Analytics engineering team.
 
{| class="wikitable"
|-
| Host || an-launcher1002.eqiad.wmnet
|-
| Service user || analytics
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics/dags airflow-dags/analytics/dags]
|-
| Dags deployment ||
 
<code>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics
git fetch && git rebase
scap deploy
</code>
|}
 
== analytics_test ==
Airflow test instance owned by the Data / Analytics engineering team.
 
{| class="wikitable"
|-
| Host || an-test-client1001.eqiad.wmnet
|-
| Service user || analytics
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics_test/dags airflow-dags/analytics_test/dags]
|-
| Dags deployment ||
 
<pre>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/analytics_test
git fetch && git rebase
scap deploy
</pre>
|}
 
== search ==
TODO
 
== research ==
Airflow instance owned by the Research team.
 
{| class="wikitable"
|-
| Host || an-airflow1002.eqiad.wmnet
|-
| Service user || analytics-research
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/research/dags airflow-dags/research/dags]
|-
| Dags deployment ||
 
<pre>
ssh deployment.eqiad.wmnet
cd /srv/deployment/airflow-dags/research
git fetch && git rebase
scap deploy
</pre>
|}
 
== platform_eng ==
Airflow instance owned by the Platform Engineering team.
 
{| class="wikitable"
|-
| Host || an-airflow1003.eqiad.wmnet
|-
| Service user || analytics-platform-eng
|-
| Web UI Port || 8600
|-
| Web UI Access || <code>ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet</code> - http://localhost:8600
|-
| Dags || <tt>/srv/airflow-platform_eng/dags</tt>
|}
 
= Administration =
 
== Overview of Data Engineering's Airflow deployments ==
 
Data Engineering maintains a debian package for Airflow at [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/airflow/ operations/debs/airflow/].  This debian packaging installs a premade [https://docs.conda.io/en/latest/ conda] environment with all dependencies needed to run Airflow.  The debian package installs this conda environment to <tt>/usr/lib/airflow</tt>.
 
The <code>airflow::instance</code> Puppet define is used to set up and run Airflow instances.  This define can be used multiple times on the same host to declare multiple airflow instances.  The instance specific configs are installed in <tt>/srv/airflow-<instance_name></tt>, and templated systemd units are set up for services <tt>airflow-scheduler@<instance_name></tt> and <tt>airflow-webserver@<instance_name></tt>.
 
The <code>profile::airflow</code> Puppet class uses the <code>profile::airflow::instances</code> hiera variable to declare <code>airflow::instance</code>s.  This allows each <code>airflow::instance</code> to be fully specified via hiera.  <code>profile::airflow</code> by default will use Data Engineering conventions as defaults for an <code>airflow::instance</code>.
 
These defaults include setting up instance specific <code>scap::target</code>s of the [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags] repository.  (There is still some manual setup needed for this, see the instructions below on how to configure this for new instances.)  The Airflow instance's <code>dags_folder</code> will be automatically set to one of the instance specific subdirectories in the airflow-dags repository.  (You can override this in hiera if you need.)
 
== Creating a new Airflow Instance ==
In this example, we'll be creating a new Airflow instance named 'test'.
 
=== Prepare airflow-dags for deployment to the new instance ===
 
==== Create the instance specific dags folder ====
By convention, all Airflow team instances use the same DAGs repository: [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags].  Instance specific DAGs are located in the <tt><instance-name>/dags</tt> directory.  Unless you override defaults in puppet/hiera, this will be used as airflow's <code>dags_folder</code>. 
 
Create this directory and commit the changes before proceeding.  In our example, this directory would be <tt>test/dags</tt>, since 'test' is our instance name.
 
==== Create the instance specific scap repository ====
[https://doc.wikimedia.org/mw-tools-scap/scap3/repo%20config.html Scap requires configuration] is declared for each of its deployments.  Because we use the same source DAGs repository for all airflow instances, we can't just add the scap.cfg file to the main airflow-dags repository.  Instead, we use separately managed 'scap repositories' in which the deployment configuration is declared.
 
Create a new repository in gitlab with the name <tt>data-engineering/airflow-dags-scap-<instance_name></tt>.  For our example, we'll be creating <tt>data-engineering/airflow-dags-scap-test</tt>.
 
You'll need to create two files in this repository:
 
Create <tt>scap/scap.cfg</tt> with the following content:
<syntaxhighlight lang=text>
[global]
git_repo: data-engineering/airflow-dags
ssh_user: test_user # (this user must exist on the airflow host, and it must be in the deploy_airflow.trusted_groups (see below)
dsh_targets: targets
</syntaxhighlight>
 
And create a <tt>scap/targets</tt> file with the list of hostnames that will be deployed too.  Likely this will be only your airflow host.
<syntaxhighlight lang=text>
hostname1001.eqiad.wmnet
</syntaxhighlight>
 
==== Create a scap deployment source ====
[[Scap]] is used to deploy the [https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags data-engineering/airflow-dags] repository to airflow instances.  Declaration of <code>scap::target</code> will be taken care for you by <code>profile::airflow</code>, but you will need to declare the <code>scap::source</code> for the deployment server.
 
Edit <tt>hieradata/role/common/deployment_server.yaml</tt> and add a new entry to <code>scap::sources</code>:
 
<syntaxhighlight lang="yaml">
scap::sources:
    airflow-dags/test:
    repository: data-engineering/airflow-dags
    # This is the name of the scap repository we created in the previous step.
    scap_repository: data-engineering/airflow-dags-scap-test
    origin: gitlab
</syntaxhighlight>
 
You'll also need to make sure that real users will be able to deploy.  They must be in a posix group that has access to the deployment server, as well as in a group listed in this hiera config:
<syntaxhighlight lang="yaml">
  # Shared deploy ssh key for Data Engineering maintained
  # Airflow instances. For now, all admins of Airflow instances
  # can deploy any Airflow instance.
  deploy_airflow:
    trusted_groups:
      - analytics-deployers
      # ...
</syntaxhighlight>
 
Merge any changes and run puppet on the deployment server.
 
=== Create the Airflow MySQL Database ===
 
You'll need a running MariaDB instance somewhere.
 
<syntaxhighlight lang="sql">
CREATE DATABASE airflow_test;
CREATE USER 'airflow_test' IDENTIFIED BY 'password_here';
GRANT ALL PRIVILEGES ON airflow_test.* TO 'airflow_test';
</syntaxhighlight>
 
Make sure your MariaDB config sets <code>explicit_defaults_for_timestamp = on</code>.  See:
https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database
 
=== Configure the Airflow instance in Puppet ===
Add the <code>profile::airflow</code> class to your node's role in Puppet and configure the Airflow instance(s) in your role's hiera.
 
Let's assume we're adding this instance in a role class <code>role::airflow::test</code>.
<syntaxhighlight lang="puppet">
class role::airflow::test {
    include ::profile::airflow
    # profile::kerberos::keytabs is needed if your Airflow
    # instance needs to authenticate with Kerberos.
    # You'll need to create and configure the keytab for the Airflow instance's
    # $service_user we'll set below.
    include ::profile::kerberos::keytabs
}
</syntaxhighlight>
 
 
Then, in <code>hieradata/role/common/airflow/test.yaml</code>:
<syntaxhighlight lang="yaml">
# Set up airflow instances.
profile::airflow::instances:
  # airflow@test instance.
  test:
    # Since we set security: kerberos a keytab must be deployed for the service_user.
    service_user: test_user
    service_group: test_group
    # Set this to true if you want enable alerting for your airflow instance.
    monitoring_enabled: false
    # Configuration for /srv/airflow-test/airflow.cfg
    # Any airflow::instance configs can go here. See:
    # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
    # NOTE: unless your airflow instance does special things, the defaults
    # set in profile::airflow should be sufficient for setting up a
    # WMF Data Engineering managed airflow::instance.
    #airflow_config:
    #  core:
 
# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
profile::kerberos::keytabs::keytabs_metadata:
  - role: 'test_user'
    owner: 'test_user'
    group: 'test_group'
    filename: 'test_user.keytab'
</syntaxhighlight>
 
See [[Analytics/Systems/Kerberos#Create_a_keytab_for_a_service|Create_a_keytab_for_a_service]] for instructions on creating keytabs.
 
Note that we didn't set <code>db_user</code> or <code>db_password</code>.  These are secrets and should be set in the [[Puppet#Private_puppet|operations puppet private repository]] in the hiera variable <code>profile::airflow::instances_secrets</code>.  So, in puppet private in the <code>hieradata/role/common/airflow/test.yaml</code> file:
 
<syntaxhighlight lang="yaml">
# Set up airflow instances.
profile::airflow::instances_secrets:
  # airflow@test instance.
  test:
    db_user: airflow_test
    db_password: password_here
 
</syntaxhighlight>
 
<code>profile::airflow::instances_secrets</code> will be merged with <code>profile::airflow::instances</code> by the <code>profile::airflow</code> class, and the parameters to <code>airflow::instance</code> will be available for use in the <code>sql_alchemy_conn</code> as an ERb template.
 
Once this is merged and applied, the node with the <code>role::airflow::test</code> will run the systemd services <code>airflow-scheduler@test</code>, <code>airflow-webserver@test</code>, <code>airflow-kerberos@test</code>, as well as some 'control' systemd services <code>airflow@test</code> and <code>airflow</code> that can be used to manage the Airflow test instance.
 
Create the airflow tables by running
  sudo -u test_user airflow-test db upgrade
 
The airflow services were probably already started by the earlier puppet run.  Restart them now that the airflow tables are created properly.
  sudo systemctl restart airflow@test.service

Latest revision as of 16:30, 2 September 2022