You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Airflow/Airflow testing instance tutorial

From Wikitech-static
Jump to navigation Jump to search

This page explains a way of creating your own Airflow instance in a stats machine. You can use it to test the DAGs that you are developing before you merge them to the code base.

Creating your own Airflow instance

1. All steps in this tutorial assume you are logged in your preferred stats machine via ssh.

ssh stat1007.eqiad.wmnet

2. Also, make sure at all times that your Kerberos authentication ticket is fresh. Note that you'll be able to execute tests in airflow only for as long as your ticket is valid. So, consider renewing it for long tests.

kinit

Installing Airflow

1. Make sure you have a dedicated directory for Airflow in your home folder. It should contain a subfolder named dags, where you will put your dag files.

mkdir -p ~/airflow/dags

2. Set the environment variable AIRFLOW_HOME to your Airflow folder. This will tell Airflow where to setup configuration, database files and where to find your dag files.

export AIRFLOW_HOME=~/airflow

3. Change directory to your Airflow folder.

cd ~/airflow

4. Create a Python virtual environment. This will allow you to install all required Python packages for Airflow without altering other Python systems you may have.

python3 -m venv venv/

5. Activate the python virtual environment. This will set some environment variables that control which Python executable and packages are going to be used, and will display a different command line prompt.

source venv/bin/activate

6. Make sure your environment variable https_proxy is set to allow you to download Python packages from the internet.

export https_proxy=http://webproxy.eqiad.wmnet:8080

7. Install all required Python packages. Note Airflow is installed together with its Hdfs, Hive and Kerberos extensions. Flask-admin version needs to be 1.4.0, because newer versions break when spinning up the Airflow web server (2020-04-15).

pip install wheel
pip install hmsclient
pip install apache-airflow[hdfs,hive,kerberos]
pip install flask-admin==1.4.0
pip install pyarrow

8. Execute Airflow's db init command. Airflow will create a SQLite database file, a logs folder and a config file, all under your Airflow directory. The installation is finished at this point.

airflow db init


Configuring Airflow

1. If you just installed Airflow following the previous section, skip this step. If not, make sure your environment is setup correctly.

export AIRFLOW_HOME=~/airflow
cd ~/airflow
source venv/bin/activate

2. Obtain your Kerberos credentials cache path. Execute Kerberos' klist command. You'll find the path to your credentials cache directory under Ticket cache: FILE:<path>.

klist
# copy your credentials cache path and Service principal

3. Edit ~/airflow/airflow.cfg and assign the following configuration values.

# under the [core] section
load_examples = False
security = kerberos
# under the [kerberos] section
ccache = <your credentials cache path>
principal = <your service principal>
reinit_frequency = 3600
kinit_path = kinit 

4. The Hive metastore configurations need to be set from the Airflow UI. For that, spin up the Airflow web server. Use another port if necessary.

airflow webserver -p 8080

5. On your local machine, create an ssh tunnel to the stats machine you are running Airflow. Use the port that you specified when launching the web server.

ssh stat1007.eqiad.wmnet -L 8080:stat1007.eqiad.wmnet:8080

6. Open http://localhost:8080/connection/list/ (change port if needed) on your browser, and click on the edit button for the connection with Conn Id = metastore_default and Conn Type = hive_metastore. Set the following configurations and save changes.

Host = 10.64.21.104
Port = 9083
Extra = {"authMechanism": "GSSAPI"}

7. You can now stop the Airflow web server in your stats machine. Airflow is now configured to be able to access Hive. The steps followed so far don't need to be repeated. Whenever you want to test an Airflow DAG, just jump to the next section.


Executing a DAG

1. If you just configured Airflow following the previous section, skip this step. If not, make sure your environment is setup correctly.

export AIRFLOW_HOME=~/airflow
cd ~/airflow
source venv/bin/activate

2. Execute the Airflow web server inside a screen/tmux. This will spin up the Airflow UI. Use another port if necessary.

screen -S airflow_webserver
airflow webserver -p 8080

3. Execute the Airflow scheduler inside a screen/tmux. This will spin up the service that executes the dags.

screen -S airflow_scheduler
airflow scheduler

4. On your local machine, create an ssh tunnel to the stats machine you are running Airflow. Use the port that you specified when launching the web server. After that, you should be able to see Airflow's UI if you open http://localhost:8080/ (change port if needed) on your browser.

ssh stat1007.eqiad.wmnet -L 8080:stat1007.eqiad.wmnet:8080

5. To add a new dag to your Airflow instance, just scp the dag Python file to the corresponding stats machine dag folder.

scp dagFile.py stat1007.eqiad.wmnet:airflow/dags/dagFile.py

6. After a bit, you should see your new dag under the DAGs tab in the Airflow UI (refresh page). By default, new dags are turned off in Airflow. So for your dag to run, you should turn it on using the ON/OFF toggle in the Airflow UI. You can access the DAG execution logs via the Airflow UI as well. Open the detail page of your DAG and select the Tree View. You'll see small colored boxes that represent the executions of each of your DAG's tasks. If you click on them, you can access their corresponding logs.