You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Airflow/Instances

From Wikitech-static
< Analytics‎ | Systems‎ | Airflow
Revision as of 14:53, 26 April 2022 by imported>Mforns (Initial edit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

WMF's Airflow system is composed of several Airflow instances. Each instance is supposed to schedule and orchestrate jobs belonging to a particular grouping. For example, there's an Airflow instance called analytics, which schedules jobs that generate and process analytics data sets. There's another instance called research, which orchestrates jobs that process research-related data sets. Usually, each Airflow instance is managed by a given WMF team, for example, the analytics instance is managed by the Data Engineering team, and most of its jobs have been developed by them. However, an Airflow instance can also be shared by several teams, and also one team can part-take in the development of jobs in multiple Airflow instances.

Multi-instance vs. single instance

During the development of WMF's Airflow system, we've had discussions about using a single instance approach versus a multi-instance approach. There are advantages and disadvantages in both cases. This thread contains most of the arguments we discussed, which include the following:

Single instance Multi-instance
Pros Single configuration, no custom stacks for teams, and thus, easy upgrades and maintenance. No single point of failure, if a team deploys code that breaks Airflow services, the other instances continue working. Teams have more independence when deploying.
Cons Airflow does not support Kerberos multitenancy (yet), so one single instance would require that all WMF jobs accessed Hadoop with the same Kerberos credentials, not allowing for access control or specific permissions. When doing maintenance, the Data Engineering team will have to rangle multiple airflow instances to stop jobs.

We Data Engineering decided to kick off the project with a multi-instance approach, mainly because of the Kerberos issue, but we don't discard the possibility of switching to single instance in the future. All WMF Airflow instances are set up by the same puppet configuration, so even if we provide multiple instances, they all will have the same stack (see: https://github.com/wikimedia/puppet/tree/production/modules/airflow).

List of instances

analytics

Airflow instance owned by the Data / Analytics engineering team. Contains all production jobs historically developed by the team.

Host an-launcher1002.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-launcher1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics/dags

analytics_test

Airflow instance owned by the Data / Analytics engineering team. Contains some jobs analog to the ones in the analytics instance, just to create some data flows in the Data Engineering's test cluster.

Host an-test-client1001.eqiad.wmnet
Service user analytics
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-test-client1001.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/analytics_test/dags

research

Airflow instance owned by the Research team.

Host an-airflow1002.eqiad.wmnet
Service user analytics-research
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1002.eqiad.wmnet - http://localhost:8600
Dags airflow-dags/research/dags

platform_eng

Airflow instance owned by the Platform Engineering team.

Host an-airflow1003.eqiad.wmnet
Service user analytics-platform-eng
Web UI Port 8600
Web UI Access ssh -t -N -L8600:127.0.0.1:8600 an-airflow1003.eqiad.wmnet - http://localhost:8600
Dags /srv/airflow-platform_eng/dags