You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Airflow"

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Gmodena
m (Change team name under platform_eng)
 
(4 intermediate revisions by one other user not shown)
Line 1: Line 1:
WIP documentation page.
WIP documentation page.


See also https://phabricator.wikimedia.org/T272973
See also:
* https://phabricator.wikimedia.org/T272973
* [[Analytics/Systems/Cluster/Workflow_management_tools_study]]




Line 9: Line 11:
Airflow instance owned by the Data / Analytics engineering team.
Airflow instance owned by the Data / Analytics engineering team.


TODO
{| class="wikitable"
|-
| Host || an-launcher1002.eqiad.wmnet
|-
| Web UI Port || 8600
|-
| Dags || [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/airflow/dags refinery/airflow/dags]
|-
| Service user || analytics
|}
 
SSH Tunnel to Web UI:
  ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet
 
and navigate to http://localhost:8600


== analytics-test ==
== analytics-test ==
Line 16: Line 32:
{| class="wikitable"
{| class="wikitable"
|-
|-
| Host || an-test-coord1001.eqiad.wmnet
| Host || an-test-client1001.eqiad.wmnet
|-
|-
| Web UI Port || 8600
| Web UI Port || 8600
|-
|-
| Dags || [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/airflow/dags refinery/airflow/dags]
| Dags || /srv/airflow-analytics-test-dags
|-
| Service user || analytics
|}
|}


SSH Tunnel to Web UI:  
SSH Tunnel to Web UI:  
   ssh -t -N -L8600:127.0.0.1:8600 an-test-coord1001.eqiad.wmnet
   ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet
 
and navigate to http://localhost:8600


== search ==
== search ==
TODO
TODO
== research ==
Airflow instance owned by the Research team.
{| class="wikitable"
|-
| Host || an-airflow1002.eqiad.wmnet
|-
| Web UI Port || 8600
|-
| Dags || <tt>/srv/airflow-research/dags</tt>
|-
| Service user || analytics-research
|}
SSH Tunnel to Web UI:
  ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet
and navigate to http://localhost:8600
== platform_eng ==
Airflow instance owned by the Platform Engineering team.
{| class="wikitable"
|-
| Host || an-airflow1003.eqiad.wmnet
|-
| Web UI Port || 8600
|-
| Dags || <tt>/srv/airflow-platform_eng/dags</tt>
|-
| Service user || analytics-platform-eng
|}
SSH Tunnel to Web UI:
  ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet
and navigate to http://localhost:8600


= Administration =  
= Administration =  
Line 36: Line 92:
=== Create the Airflow MySQL Database ===
=== Create the Airflow MySQL Database ===


You'll need a running MariaDB instance somewhere.  If your MariaDB instance is replicated, you'll need to run the <code>GRANT</code> statement on the replicas as well.
You'll need a running MariaDB instance somewhere.


<syntaxhighlight lang="sql">
<syntaxhighlight lang="sql">
Line 110: Line 166:


Once this is merged and applied, the node with the <code>role::airflow::test</code> will run the systemd services <code>airflow-scheduler@test</code>, <code>airflow-webserver@test</code>, <code>airflow-kerberos@test</code>, as well as some 'control' systemd services <code>airflow@test</code> and <code>airflow</code> that can be used to manage the Airflow test instance.
Once this is merged and applied, the node with the <code>role::airflow::test</code> will run the systemd services <code>airflow-scheduler@test</code>, <code>airflow-webserver@test</code>, <code>airflow-kerberos@test</code>, as well as some 'control' systemd services <code>airflow@test</code> and <code>airflow</code> that can be used to manage the Airflow test instance.
Create the airflow tables by running
  sudo -u test_user airflow-test db upgrade
The airflow services were probably already started by the earlier puppet run.  Restart them now that the airflow tables are created properly.
  sudo systemctl restart airflow@test.service

Latest revision as of 18:35, 3 August 2021

WIP documentation page.

See also:


Airflow Instances

analytics

Airflow instance owned by the Data / Analytics engineering team.

Host an-launcher1002.eqiad.wmnet
Web UI Port 8600
Dags refinery/airflow/dags
Service user analytics

SSH Tunnel to Web UI:

 ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet

and navigate to http://localhost:8600

analytics-test

Airflow test instance owned by the Data / Analytics engineering team.

Host an-test-client1001.eqiad.wmnet
Web UI Port 8600
Dags /srv/airflow-analytics-test-dags
Service user analytics

SSH Tunnel to Web UI:

 ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet

and navigate to http://localhost:8600

search

TODO

research

Airflow instance owned by the Research team.

Host an-airflow1002.eqiad.wmnet
Web UI Port 8600
Dags /srv/airflow-research/dags
Service user analytics-research

SSH Tunnel to Web UI:

 ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet

and navigate to http://localhost:8600

platform_eng

Airflow instance owned by the Platform Engineering team.

Host an-airflow1003.eqiad.wmnet
Web UI Port 8600
Dags /srv/airflow-platform_eng/dags
Service user analytics-platform-eng

SSH Tunnel to Web UI:

 ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet

and navigate to http://localhost:8600

Administration

Creating a new Airflow Instance

In this example, we'll be creating a new Airflow instance named 'test'.

Create the Airflow MySQL Database

You'll need a running MariaDB instance somewhere.

CREATE DATABASE airflow_test;
CREATE USER 'airflow_test' IDENTIFIED BY 'password_here';
GRANT ALL PRIVILEGES ON airflow_test.* TO 'airflow_test';

Make sure your MariaDB config sets explicit_defaults_for_timestamp = on. See: https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database

Configure the Airflow instance in Puppet

Add the profile::airflow class to your node's role in Puppet and configure the Airflow instance(s) in your role's hiera.

Let's assume we're adding this instance in a role class role::airflow::test.

class role::airflow::test {
    include ::profile::airflow
    # profile::kerberos::keytabs is needed if your Airflow
    # instance needs to authenticate with Kerberos.
    # You'll need to create and configure the keytab for the Airflow instance's
    # $service_user we'll set below.
    include ::profile::kerberos::keytabs
}

Then, in hieradata/role/common/airflow/test.yaml:

# Set up airflow instances.
profile::airflow::instances:
  # airflow@test instance.
  test:
    # Since we set security: kerberos a keytab must be deployed for the service_user.
    service_user: test_user
    service_group: test_group
   # Set this to true if you want enable alerting for your airflow instance.
    monitoring_enabled: false
    # Configuration for /srv/airflow-test/airflow.cfg
    # Any airflow configs can go here. See:
    # https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#webserver
    airflow_config:
      core:
        security: kerberos # you don't need to set this if you don't use Kerberos.
        executor: LocalExecutor
        # This can be an ERB template that will be rendered in airflow::instance.
        # db_user and db_password params should be set in puppet private
        # in profile::airflow::instances_secrets.
        sql_alchemy_conn: mysql://<%= @db_user %>:<%= @db_password %>@my-db-host.eqiad.wmnet/airflow_analytics?ssl_ca=/etc/ssl/certs/Puppet_Internal_CA.pem

# Make sure the keytab for test_user is deployed via profile::kerberos::keytabs
profile::kerberos::keytabs::keytabs_metadata:
  - role: 'test_user'
    owner: 'test_user'
    group: 'test_group'
    filename: 'test_user.keytab'

See Create_a_keytab_for_a_service for instructions on creating keytabs.

Note that we didn't set db_user or db_password. These are secrets and should be set in the operations puppet private repository in the hiera variable profile::airflow::instances_secrets. So, in puppet private in the hieradata/role/common/airflow/test.yaml file:

# Set up airflow instances.
profile::airflow::instances_secrets:
  # airflow@test instance.
  test:
    db_user: airflow_test
    db_password: password_here

profile::airflow::instances_secrets will be merged with profile::airflow::instances by the profile::airflow class, and the parameters to airflow::instance will be available for use in the sql_alchemy_conn as an ERb template.

Once this is merged and applied, the node with the role::airflow::test will run the systemd services airflow-scheduler@test, airflow-webserver@test, airflow-kerberos@test, as well as some 'control' systemd services airflow@test and airflow that can be used to manage the Airflow test instance.

Create the airflow tables by running

 sudo -u test_user airflow-test db upgrade

The airflow services were probably already started by the earlier puppet run. Restart them now that the airflow tables are created properly.

 sudo systemctl restart airflow@test.service