You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Cluster/Workflow management tools study"

From Wikitech-static
Jump to navigation Jump to search
imported>Mforns
 
imported>Mforns
m
Line 1: Line 1:
This document describes our decision process (Data Engineering team) when choosing a replacement for  workflow management system.
== Overview ==
=== The problem ===
=== The problem ===
We Analytics team identified the need of updating our main scheduler/task manager system: [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie Oozie]. Oozie is quite simple and robust, and has worked OK so far. However it is and old software that is not improving any more, has a diminishing community and sparse documentation; it has significant limitations like the lack of dynamic execution flows (i.e. looping all files in a directory) that we need for refining dynamic databases; also, the developer needs to write lots of boilerplate XML code for even a simple job; we Analytics are not satisfied with its UI (Hue) neither.
We Data Engineering identified the need of updating our main scheduler/task manager system: [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie Oozie]. Oozie is quite simple and robust, and has worked OK so far. However it is and old software that is not improving any more and has a diminishing community; it has significant limitations like the lack of dynamic execution flows (i.e. looping all files in a directory) that we need for refining dynamic databases; also, the developer needs to write lots of boilerplate XML code for even a simple job, which represents an entry wall for external teams; we Analytics are not satisfied with its UI (Hue) neither.


In parallel to that, we observed that we are maintaining other scheduling (or data presence sensors) systems, like: [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine Refine], [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater reportupdater] and systemd timers. Refine implements the dynamic data presence sensors that Oozie does not support. Reportupdater tries to reduce all the boilerplate code from Oozie and its learning curve, so that users that are not super technical can still create jobs easily. And in some cases, for simplicity, we're using systemd-timers for jobs (deletion scripts) that could be better handled by a higher level scheduling system that can i.e. be monitored in a UI.
In parallel to that, we observed that we are maintaining other scheduling (or data presence sensors) systems, like: [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine Refine], [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater reportupdater] and systemd timers. Refine implements the dynamic data presence sensors that Oozie does not support. Reportupdater tries to reduce all the boilerplate code from Oozie and its learning curve, so that users that are not super technical can still create jobs easily. And in some cases, for simplicity, we're using systemd-timers for jobs (deletion scripts) that could be better handled by a higher level scheduling system that can i.e. be monitored in a UI.


=== The objective ===
=== The objective ===
We want to find a new scheduling (and data presence sensor) system, that is able to replace all our current Oozie jobs, and if possible, unify all Refine jobs, all reportupdater jobs, and all systemd-timers that would make sense to be handled by a higher level system. We'd like a more modern software with a growing community and good docs, that can be used by different types of users (more or less technical). The new system should ideally match Oozie in robustness and ease of maintenance. With such a scheduler, we would reduce the maintenance of our jobs, improve the control over them and make it easier for collaborators to create their own jobs. Also, a more modern system would give our job pipeline a longer life and more state-of-the-art capabilities (i.e. kubernetes support).
We want to find a new scheduler / workflow management system, that is able to replace all our current Oozie jobs, and if possible, unify all Refine jobs, all reportupdater jobs, and all systemd-timers that would make sense to be handled by a higher level system. We'd like a more modern software with a growing community and good docs, that can be used by different types of users (with varying technical levels). The new system should ideally match Oozie in robustness and ease of maintenance. With such a scheduler, we would reduce the maintenance of our jobs, improve the control over them and make it easier for collaborators to create their own jobs. Also, a more modern system would give our job pipeline a longer life and more state-of-the-art capabilities (i.e. kubernetes support).
 
=== Hard requirements ===
In our search of a better workflow manager software, most candidates were discarded because of hard requirements, without doing a deeper evaluation. Here are the reasons:
* We're looking for a fully open source software. Open source is one of the pillars of the WMF, and we'll only choose a system that is open source and therefore we can contribute back to.
* We want a full-fledged workflow orchestrator that is not only focused on just i.e. automation ([https://github.com/rundeck/rundeck RunDeck], [https://github.com/StackStorm/st2 StackStorm]), or data science/ML ([https://github.com/pachyderm/pachyderm Pachyderm], [https://github.com/mlflow/mlflow MLFlow]), or data integration/ETL ([https://github.com/apache/beam Beam], [https://github.com/apache/camel Camel]). The chosen solution should have the power and flexibility to handle all our use cases.
* It's crucial that the chosen software is mature enough, has an extensive community (is popular enough) and thus has good support and documentation. We'll discard projects that are not actively maintained ([https://github.com/pinterest/pinball Pinball]).
* Also, it must be able to support our current set of data pipelines (time schedules, data dependency management, alerts, etc.) and infrastructure (Hadoop + Yarn).


== Candidates ==
== Candidates ==
These are the candidates that we evaluated. Please, add more as we do more studies.
These are the candidates that we evaluated.
 
=== Prefect [discarded] ===
[https://www.prefect.io/ Prefect] is a "dataflow automation" system build by the company with the same name. The software seems to meet all our needs (even though we didn't do a deep study). Now, in its early days, Prefect was a closed source project, and in 2020 they open-sourced it. However, parts of the system (server and UI) are licensed under the [https://www.prefect.io/legal/prefect-community-license/ prefect community license], which is open source except for a couple limitations that include using the software for PAAS and SAAS. After discussing the issue with our CTO, the position of the WMF is that we should for now not use software with such license. Thus, we discarded this option without further study.
 
=== Argo [discarded] ===
[https://argoproj.github.io/argo-workflows/ Argo Workflows] is a workflow orchestrator with an interesting approach, that we think would also meet all our needs. Except, it is strictly Kubernetes-based. If we wanted to choose Argo to replace Oozie, we'd have to move our Hadoop cluster to Kubernetes in the first place. We Data Engineering don't discard moving to Kubernetes in the future, but don't want to make moving to Kubernetes a dependency of the replacement of Oozie. We think the new workflow manager should be able to handle both Hadoop + Yarn and Kubernets. This way we decouple both projects. Thus, we also discarded Argo without further study. Note: This reasoning also applies to Chronos, another workflow manager based on Apache Mesos, which we also discarded.
 
=== Azkaban [discarded] ===
[https://azkaban.github.io/ Azkaban] is a batch workflow job scheduler built by LinkedIn. It mostly fulfills all our requirements, except it only supports time-based scheduling of jobs ([https://github.com/azkaban/azkaban/issues/1526 bug]). In our data pipelines there are many jobs that depend on the presence of data, so we need jobs that are triggered by i.e. the existence of files in a given HDFS path, by the fact that an HDFS path saw updates in the last hour, or when a Hive partition is added to a table. Such data probes are really important for us, and are one of the reasons we're moving away from Oozie. So, we also discarded this option without further study.
 
=== Luigi [discarded] ===
[https://github.com/spotify/luigi Luigi] is a workflow engine created by Spotify. It's a widely used mature project and covers most of the use cases that we need. But it lacks scheduling of the jobs, instead it depends on Cron (or any other scheduler) to do that ([https://luigi.readthedocs.io/en/stable/execution_model.html#triggering-tasks docs]). One of our main objectives of moving to another workflow manager is to unify different schedulers that we're maintaining, including systemd timers (our Cron). Given this significant disadvantage, we did a quick check of other properties of Luigi, to see if they would compensate it. But we found that Luigi also does not handle distributed execution of tasks ([https://luigi.readthedocs.io/en/stable/execution_model.html#workers-and-task-execution docs]). We Data Engineering wouldn't probably use this feature, because we'll run jobs using Yarn (at least in the near future); but other teams might want to have this option. Plus, we found that Luigi's UI is really minimal, to the point that users can not interact with running jobs. Given all this, we decided to discard this option without further study.


=== Apache Airflow ===
=== Apache Airflow ===
[https://airflow.apache.org/ Apache Airflow] is an open-source workflow management platform created by AirBnB, that became an Apache Software Foundation project in January 2019.
[https://airflow.apache.org/ Apache Airflow] is a workflow management platform created by AirBnB, that became an Apache Software Foundation project in January 2019.
It is written in Python, and workflows are created via Python scripts (configuration as code). Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive).
It is written in Python, and workflows are created via Python scripts (configuration as code). Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive).
[summarized from https://en.wikipedia.org/wiki/Apache_Airflow]
[summarized from https://en.wikipedia.org/wiki/Apache_Airflow]
Airflow seems to meet all our needs with no critical drawbacks.


== Airflow P.O.C. ==
We did a POC to test Airflow's capability of implementing dynamic workflows, and to get a grasp of how is it to develop and manage jobs with Airflow. The descriptions below and comparison chart reflect what we learned during the POC. You can find the code and read more technical details at https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/597623/.
We did a POC to test Airflow's capability of implementing dynamic workflows, and to get a grasp of how is it to develop and manage jobs with Airflow. The descriptions below and comparison chart reflect what we learned during the POC. You can find the code and read more technical details at https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/597623/.


==== Python ====
=== Python ===
In Airflow all jobs are written in Python, a programming language, as opposed of a markup language like Oozie's XML. This gives Airflow lots of power and flexibility. An example of that are dynamic workflows, i.e. one can iterate over the files in a directory at DAG definition time. Also, Python is an easy language to learn, and the majority of data analysts and researchers are already familiar with it. And finally, Python is easier to abstract and compartmentalize, thus making DAG code way shorter than i.e. Oozie's XML.
In Airflow all jobs are written in Python, a programming language, as opposed of a markup language like Oozie's XML. This gives Airflow lots of power and flexibility. An example of that are dynamic workflows, i.e. one can iterate over the files in a directory at DAG definition time. Also, Python is an easy language to learn, and the majority of data analysts and researchers are already familiar with it. And finally, Python is easier to abstract and compartmentalize, thus making DAG code way shorter than i.e. Oozie's XML.


==== Support and extensibility ====
=== Support and extensibility ===
Airflow is probably the workflow management solution out there with most extensive out of the box support for related tools and technology, i.e.: Hadoop, Hive, Presto, Spark, Cassandra, kubernetes, Druid, Sqoop, Docker, Slack, Jenkins, etc. There are many operators and sensors within the standard Airflow install (i.e. HiveToDruid), plus there are many other connectors developed by the community. Also, you can build your own custom plugins. For the POC, we built a Refine plugin, and it was pretty straight forward.
Airflow is probably the workflow management solution out there with most extensive out of the box support for related tools and technology, i.e.: Hadoop, Hive, Presto, Spark, Cassandra, kubernetes, Druid, Sqoop, Docker, Slack, Jenkins, etc. There are many operators and sensors within the standard Airflow install (i.e. HiveToDruid), plus there are many other connectors developed by the community. Also, you can build your own custom plugins. For the POC, we built a Refine plugin, and it was pretty straight forward.


==== Architecture ====
=== Architecture ===
Airflow has 3 main components: the meta-store (JDBC database), where it stores information and state about DAGs and tasks; the scheduler, a process that periodically reads DAGs and schedules tasks; and the web server, which serves the Airflow UI and also controls DAG and task events that come from it. Additionally, you can configure Airflow to use Celery as a task execution engine (4th component). Some Airflow community members say these several-moving-pieces system can be more costly to set up and monitor, especially if you're using Celery.
Airflow has 3 main components: the meta-store (JDBC database), where it stores information and state about DAGs and tasks; the scheduler, a process that periodically reads DAGs and schedules tasks; and the web server, which serves the Airflow UI and also controls DAG and task events that come from it. Additionally, you can configure Airflow to use Celery as a task execution engine (4th component). Some Airflow community members say these several-moving-pieces system can be more costly to set up and monitor, especially if you're using Celery.


==== DAGs: interpretation in 2 steps ====
=== DAGs: interpretation in 2 steps ===
Something to take into account is that DAG coding has its gotchas; like understanding the interpretation of the code in 2 steps: DAG definition time and task execution time. A DAG is interpreted every N seconds by Airflow, so that dynamic DAGs are updated accordingly; or changed to DAG code are also updated. But no tasks are run at that moment. When a task within a DAG fulfills its execution conditions and is triggered, then some parts of the DAG spec code are used to execute that task. Note that both DAG and task spec belong to the same code file. So, there's still a learning curve for people that have never written an Airflow DAG.
Something to take into account is that DAG coding has its gotchas; like understanding the interpretation of the code in 2 steps: DAG definition time and task execution time. A DAG is interpreted every N seconds by Airflow, so that dynamic DAGs are updated accordingly; or changed to DAG code are also updated. But no tasks are run at that moment. When a task within a DAG fulfills its execution conditions and is triggered, then some parts of the DAG spec code are used to execute that task. Note that both DAG and task spec belong to the same code file. So, there's still a learning curve for people that have never written an Airflow DAG.


==== Documentation and community ====
=== Documentation and community ===
Apache Airflow has a decent documentation. It is not super detailed though (2020-06-23), so sometimes you can find yourself looking for several sources to confirm what you want to do, or even browsing Airflow code to be sure what a parameter does. Airflow seems to have the biggest community within open-source workflow management software options. There are lots of questions answered online and case studies in blogs, etc. During the POC we found that some of our questions were only partially answered. Maybe it's still a young Apache project, and it will improve in the near future.
Apache Airflow has a decent documentation. It is not super detailed though (2020-06-23), so sometimes you can find yourself looking for several sources to confirm what you want to do, or even browsing Airflow code to be sure what a parameter does. Airflow seems to have the biggest community within open-source workflow management software options. There are lots of questions answered online and case studies in blogs, etc. During the POC we found that some of our questions were only partially answered. Maybe it's still a young Apache project, and it will improve in the near future.


==== UI ====
=== UI ===
Airflow's UI has a lot of diagrams, icons and buttons, and it can be intimidating at first. But it gives you decent control over your config, DAGs and tasks. You can do things like: switching DAGs on/off, killing/re-running/clearing/marking-success-of tasks or task sequences, look at (driver) logs, getting the state of each task in a live execution diagram, checking the rendered final parameters of the task, etc. The main improvement vs Hue, is probably usability, visual aids and the detail of information that is displayed about DAGs and tasks (the actual things you can do are mostly equivalent). Finally, big con of the Airflow UI: it does not auto-refresh, and it's a bit clunky when manually refreshing (i.e. takes a bit until a running task displays as running in the UI).
Airflow's UI has a lot of diagrams, icons and buttons, and it can be intimidating at first. But it gives you decent control over your config, DAGs and tasks. You can do things like: switching DAGs on/off, killing/re-running/clearing/marking-success-of tasks or task sequences, look at (driver) logs, getting the state of each task in a live execution diagram, checking the rendered final parameters of the task, etc. The main improvement vs Hue, is probably usability, visual aids and the detail of information that is displayed about DAGs and tasks (the actual things you can do are mostly equivalent). Finally, big con of the Airflow UI: it does not auto-refresh, and it's a bit clunky when manually refreshing (i.e. takes a bit until a running task displays as running in the UI).


==== Kerberos ====
=== Kerberos ===
Airflow comes with Kerberos support out of the box. But so far it seems it only support one user. So if we wanted to use Airflow with both the analytics user and the hdfs user (or the product-analytics user), that could potentially be a problem. Or if you want to test a new job with your own user to make sure you don't overwrite production files, then that would also be an issue (unless we setup a separate Airflow testing instance). This limitation is significant and should be fixed in case we choose to use Airflow.
Airflow comes with Kerberos support out of the box. But so far it seems it only support one user. So if we wanted to use Airflow with both the analytics user and the hdfs user (or the product-analytics user), that could potentially be a problem. Or if you want to test a new job with your own user to make sure you don't overwrite production files, then that would also be an issue (unless we setup a separate Airflow testing instance). This limitation is significant and should be fixed in case we choose to use Airflow.


==== PyArrow concurrency issues ====
=== PyArrow concurrency issues ===
During the POC we've had problems with dynamic DAGs using PyArrow to access HDFS. Some tasks would get stuck at running state right after calling an HDFS ls operation using PyArrow library. It seems there's a mutex deadlock when combining parallel tasks using a PyArrow HDFS client. After lots of troubleshooting we decided to move on and write this issue here as a note to be considered. There are still ways to workaround this issue, like trying to use the Celery executor, or using the command line to query HDFS (poor). But definitely something else to solve if we decide to use Airflow.
During the POC we've had problems with dynamic DAGs using PyArrow to access HDFS. Some tasks would get stuck at running state right after calling an HDFS ls operation using PyArrow library. It seems there's a mutex deadlock when combining parallel tasks using a PyArrow HDFS client. After lots of troubleshooting we decided to move on and write this issue here as a note to be considered. There are still ways to workaround this issue, like trying to use the Celery executor, or using the command line to query HDFS (poor). But definitely something else to solve if we decide to use Airflow.


== Candidate comparison vs Oozie ==
=== UPDATE: Airflow 2.0 ===
Since we did our Airflow P.O.C. Airflow launched its 2.0 version, that includes significant improvements in various fronts, including performance of the scheduler and revamp of the UI, plus the PyArrow concurrency issues described above are resolved in the new version.
 
== Oozie vs Airflow ==
{| class="wikitable"
{| class="wikitable"
|-
|-
! Aspect !! Oozie !! Airflow
! Aspect !! Oozie !! Airflow
|-
|-
| Can execute scheduled jobs
| Can schedule jobs timely
| {{check mark}}
| {{check mark}}
| {{check mark}}
| {{check mark}}
|-
|-
| Can execute data sensor jobs
| Can execute on data dependencies
| {{check mark}}
| {{check mark}}
| {{check mark}}
|-
| Can execute on custom data probes
| {{cross|15}}
| {{check mark}}
| {{check mark}}
|-
|-
| Can execute dynamic jobs (Refine)
| Can execute dynamic jobs (Refine)
| {{cross|15}}
| [?] Oozie 5.1 suports Java dynamic workflows
| [?]{{check mark}} The POC works in sequential mode, but has (hopefully solvable) concurrency issues with PyArrow.
| {{check mark}}
|-
|-
| Supports modern tools
| Supports modern tools
Line 68: Line 101:
| code usability/maintainability
| code usability/maintainability
| [?] boilerplate XML.
| [?] boilerplate XML.
| {{check mark}} shorter code, way less boilerplate.
| {{check mark}}[?] shorter code, way less boilerplate, complexity could be a barrier for less technical users
|-
|-
| UI
| UI
| [?] Not satisfied with it
| [?] Not satisfied with it
| {{check mark}} Better, but with some caveats, still not great!
| {{check mark}} 2.0 UI is better than former
|-
|-
| Jobs testing and deployment
| Jobs testing and deployment
Line 80: Line 113:
| Security
| Security
| {{check mark}} Flexible Kerberos support.
| {{check mark}} Flexible Kerberos support.
| [?] Kerberos support, but for single user?
| [?] Kerberos support, but multitenancy not supported yet
|-
|-
| Community and documentation
| Community and documentation
| {{cross|15}} Very small community and limited docs.
| {{cross|15}} Small and diminishing community.
| {{check mark}} Biggest community and OK docs.
| {{check mark}} Biggest community and OK docs.
|-
|-
| Future of the project
| Future of the project
| {{cross|15}} In disuse.
| {{cross|15}} In disuse.
| {{check mark}} Recently Apache project.
| {{check mark}} Recently Apache project, 2.0 version is promising.
|}
|}


== Conclusion ==
== Conclusion ==
It seems to me that Airflow will be able to do all we need it to do (Refine, regular schedules, data presence jobs, better maintainability, better UI, unify all our workflows, etc.). And Airflow will probably be maintained for a long time, and have a nice community. But it also seems that until we get there, we'll have to solve a set of non-trivial issues (concurrency, Kerberos, proper setup, and potentially more...), in addition to the actual migration of all jobs and Airflow's learning curve. It is a bit discouraging, and I think we should consider other options before we decide. If other options aren't better, then let's use Airflow. [[User:Mforns|Mforns]] ([[User talk:Mforns|talk]]) 13:40, 23 June 2020 (UTC)
It seems to me that Airflow will be able to do all we need it to do (Refine, regular schedules, data presence jobs, better maintainability, better UI, unify all our workflows, etc.). And Airflow will probably be maintained for a long time, and have a nice community. But it also seems that until we get there, we'll have to solve a set of non-trivial issues (concurrency, Kerberos, proper setup, and potentially more...), in addition to the actual migration of all jobs and Airflow's learning curve. It is a bit discouraging, and I think we should consider other options before we decide. If other options aren't better, then let's use Airflow. [[User:Mforns|Mforns]] ([[User talk:Mforns|talk]]) 13:40, 23 June 2020 (UTC)
=== Update ===
Since the release of Airflow 2.0 some of the identified drawbacks have been reduced: the PyArrow concurrency problem is solved, the scheduler scalability has been improved, the UI has been revamped. In addition to that, it seems that the 2.0 release has been happily accepted by the community, who is more active than ever.

Revision as of 21:31, 28 June 2021

This document describes our decision process (Data Engineering team) when choosing a replacement for workflow management system.

Overview

The problem

We Data Engineering identified the need of updating our main scheduler/task manager system: Oozie. Oozie is quite simple and robust, and has worked OK so far. However it is and old software that is not improving any more and has a diminishing community; it has significant limitations like the lack of dynamic execution flows (i.e. looping all files in a directory) that we need for refining dynamic databases; also, the developer needs to write lots of boilerplate XML code for even a simple job, which represents an entry wall for external teams; we Analytics are not satisfied with its UI (Hue) neither.

In parallel to that, we observed that we are maintaining other scheduling (or data presence sensors) systems, like: Refine, reportupdater and systemd timers. Refine implements the dynamic data presence sensors that Oozie does not support. Reportupdater tries to reduce all the boilerplate code from Oozie and its learning curve, so that users that are not super technical can still create jobs easily. And in some cases, for simplicity, we're using systemd-timers for jobs (deletion scripts) that could be better handled by a higher level scheduling system that can i.e. be monitored in a UI.

The objective

We want to find a new scheduler / workflow management system, that is able to replace all our current Oozie jobs, and if possible, unify all Refine jobs, all reportupdater jobs, and all systemd-timers that would make sense to be handled by a higher level system. We'd like a more modern software with a growing community and good docs, that can be used by different types of users (with varying technical levels). The new system should ideally match Oozie in robustness and ease of maintenance. With such a scheduler, we would reduce the maintenance of our jobs, improve the control over them and make it easier for collaborators to create their own jobs. Also, a more modern system would give our job pipeline a longer life and more state-of-the-art capabilities (i.e. kubernetes support).

Hard requirements

In our search of a better workflow manager software, most candidates were discarded because of hard requirements, without doing a deeper evaluation. Here are the reasons:

  • We're looking for a fully open source software. Open source is one of the pillars of the WMF, and we'll only choose a system that is open source and therefore we can contribute back to.
  • We want a full-fledged workflow orchestrator that is not only focused on just i.e. automation (RunDeck, StackStorm), or data science/ML (Pachyderm, MLFlow), or data integration/ETL (Beam, Camel). The chosen solution should have the power and flexibility to handle all our use cases.
  • It's crucial that the chosen software is mature enough, has an extensive community (is popular enough) and thus has good support and documentation. We'll discard projects that are not actively maintained (Pinball).
  • Also, it must be able to support our current set of data pipelines (time schedules, data dependency management, alerts, etc.) and infrastructure (Hadoop + Yarn).

Candidates

These are the candidates that we evaluated.

Prefect [discarded]

Prefect is a "dataflow automation" system build by the company with the same name. The software seems to meet all our needs (even though we didn't do a deep study). Now, in its early days, Prefect was a closed source project, and in 2020 they open-sourced it. However, parts of the system (server and UI) are licensed under the prefect community license, which is open source except for a couple limitations that include using the software for PAAS and SAAS. After discussing the issue with our CTO, the position of the WMF is that we should for now not use software with such license. Thus, we discarded this option without further study.

Argo [discarded]

Argo Workflows is a workflow orchestrator with an interesting approach, that we think would also meet all our needs. Except, it is strictly Kubernetes-based. If we wanted to choose Argo to replace Oozie, we'd have to move our Hadoop cluster to Kubernetes in the first place. We Data Engineering don't discard moving to Kubernetes in the future, but don't want to make moving to Kubernetes a dependency of the replacement of Oozie. We think the new workflow manager should be able to handle both Hadoop + Yarn and Kubernets. This way we decouple both projects. Thus, we also discarded Argo without further study. Note: This reasoning also applies to Chronos, another workflow manager based on Apache Mesos, which we also discarded.

Azkaban [discarded]

Azkaban is a batch workflow job scheduler built by LinkedIn. It mostly fulfills all our requirements, except it only supports time-based scheduling of jobs (bug). In our data pipelines there are many jobs that depend on the presence of data, so we need jobs that are triggered by i.e. the existence of files in a given HDFS path, by the fact that an HDFS path saw updates in the last hour, or when a Hive partition is added to a table. Such data probes are really important for us, and are one of the reasons we're moving away from Oozie. So, we also discarded this option without further study.

Luigi [discarded]

Luigi is a workflow engine created by Spotify. It's a widely used mature project and covers most of the use cases that we need. But it lacks scheduling of the jobs, instead it depends on Cron (or any other scheduler) to do that (docs). One of our main objectives of moving to another workflow manager is to unify different schedulers that we're maintaining, including systemd timers (our Cron). Given this significant disadvantage, we did a quick check of other properties of Luigi, to see if they would compensate it. But we found that Luigi also does not handle distributed execution of tasks (docs). We Data Engineering wouldn't probably use this feature, because we'll run jobs using Yarn (at least in the near future); but other teams might want to have this option. Plus, we found that Luigi's UI is really minimal, to the point that users can not interact with running jobs. Given all this, we decided to discard this option without further study.

Apache Airflow

Apache Airflow is a workflow management platform created by AirBnB, that became an Apache Software Foundation project in January 2019. It is written in Python, and workflows are created via Python scripts (configuration as code). Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers (e.g. a file appearing in Hive). [summarized from https://en.wikipedia.org/wiki/Apache_Airflow] Airflow seems to meet all our needs with no critical drawbacks.


Airflow P.O.C.

We did a POC to test Airflow's capability of implementing dynamic workflows, and to get a grasp of how is it to develop and manage jobs with Airflow. The descriptions below and comparison chart reflect what we learned during the POC. You can find the code and read more technical details at https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/597623/.

Python

In Airflow all jobs are written in Python, a programming language, as opposed of a markup language like Oozie's XML. This gives Airflow lots of power and flexibility. An example of that are dynamic workflows, i.e. one can iterate over the files in a directory at DAG definition time. Also, Python is an easy language to learn, and the majority of data analysts and researchers are already familiar with it. And finally, Python is easier to abstract and compartmentalize, thus making DAG code way shorter than i.e. Oozie's XML.

Support and extensibility

Airflow is probably the workflow management solution out there with most extensive out of the box support for related tools and technology, i.e.: Hadoop, Hive, Presto, Spark, Cassandra, kubernetes, Druid, Sqoop, Docker, Slack, Jenkins, etc. There are many operators and sensors within the standard Airflow install (i.e. HiveToDruid), plus there are many other connectors developed by the community. Also, you can build your own custom plugins. For the POC, we built a Refine plugin, and it was pretty straight forward.

Architecture

Airflow has 3 main components: the meta-store (JDBC database), where it stores information and state about DAGs and tasks; the scheduler, a process that periodically reads DAGs and schedules tasks; and the web server, which serves the Airflow UI and also controls DAG and task events that come from it. Additionally, you can configure Airflow to use Celery as a task execution engine (4th component). Some Airflow community members say these several-moving-pieces system can be more costly to set up and monitor, especially if you're using Celery.

DAGs: interpretation in 2 steps

Something to take into account is that DAG coding has its gotchas; like understanding the interpretation of the code in 2 steps: DAG definition time and task execution time. A DAG is interpreted every N seconds by Airflow, so that dynamic DAGs are updated accordingly; or changed to DAG code are also updated. But no tasks are run at that moment. When a task within a DAG fulfills its execution conditions and is triggered, then some parts of the DAG spec code are used to execute that task. Note that both DAG and task spec belong to the same code file. So, there's still a learning curve for people that have never written an Airflow DAG.

Documentation and community

Apache Airflow has a decent documentation. It is not super detailed though (2020-06-23), so sometimes you can find yourself looking for several sources to confirm what you want to do, or even browsing Airflow code to be sure what a parameter does. Airflow seems to have the biggest community within open-source workflow management software options. There are lots of questions answered online and case studies in blogs, etc. During the POC we found that some of our questions were only partially answered. Maybe it's still a young Apache project, and it will improve in the near future.

UI

Airflow's UI has a lot of diagrams, icons and buttons, and it can be intimidating at first. But it gives you decent control over your config, DAGs and tasks. You can do things like: switching DAGs on/off, killing/re-running/clearing/marking-success-of tasks or task sequences, look at (driver) logs, getting the state of each task in a live execution diagram, checking the rendered final parameters of the task, etc. The main improvement vs Hue, is probably usability, visual aids and the detail of information that is displayed about DAGs and tasks (the actual things you can do are mostly equivalent). Finally, big con of the Airflow UI: it does not auto-refresh, and it's a bit clunky when manually refreshing (i.e. takes a bit until a running task displays as running in the UI).

Kerberos

Airflow comes with Kerberos support out of the box. But so far it seems it only support one user. So if we wanted to use Airflow with both the analytics user and the hdfs user (or the product-analytics user), that could potentially be a problem. Or if you want to test a new job with your own user to make sure you don't overwrite production files, then that would also be an issue (unless we setup a separate Airflow testing instance). This limitation is significant and should be fixed in case we choose to use Airflow.

PyArrow concurrency issues

During the POC we've had problems with dynamic DAGs using PyArrow to access HDFS. Some tasks would get stuck at running state right after calling an HDFS ls operation using PyArrow library. It seems there's a mutex deadlock when combining parallel tasks using a PyArrow HDFS client. After lots of troubleshooting we decided to move on and write this issue here as a note to be considered. There are still ways to workaround this issue, like trying to use the Celery executor, or using the command line to query HDFS (poor). But definitely something else to solve if we decide to use Airflow.

UPDATE: Airflow 2.0

Since we did our Airflow P.O.C. Airflow launched its 2.0 version, that includes significant improvements in various fronts, including performance of the scheduler and revamp of the UI, plus the PyArrow concurrency issues described above are resolved in the new version.

Oozie vs Airflow

Aspect Oozie Airflow
Can schedule jobs timely Yes check.svg Yes check.svg
Can execute on data dependencies Yes check.svg Yes check.svg
Can execute on custom data probes ☒N Yes check.svg
Can execute dynamic jobs (Refine) [?] Oozie 5.1 suports Java dynamic workflows Yes check.svg
Supports modern tools ☒N Need to build them ourselves. Yes check.svg Support out of the box.
Setup, maintenance, monitoring Yes check.svg Simple and robust. [?] Community points out potential difficulties.
code usability/maintainability [?] boilerplate XML. Yes check.svg [?] shorter code, way less boilerplate, complexity could be a barrier for less technical users
UI [?] Not satisfied with it Yes check.svg 2.0 UI is better than former
Jobs testing and deployment Yes check.svg Straight forward to test and deploy. [?] A bit weird that you need to deploy code to test it (or have a testing instance).
Security Yes check.svg Flexible Kerberos support. [?] Kerberos support, but multitenancy not supported yet
Community and documentation ☒N Small and diminishing community. Yes check.svg Biggest community and OK docs.
Future of the project ☒N In disuse. Yes check.svg Recently Apache project, 2.0 version is promising.

Conclusion

It seems to me that Airflow will be able to do all we need it to do (Refine, regular schedules, data presence jobs, better maintainability, better UI, unify all our workflows, etc.). And Airflow will probably be maintained for a long time, and have a nice community. But it also seems that until we get there, we'll have to solve a set of non-trivial issues (concurrency, Kerberos, proper setup, and potentially more...), in addition to the actual migration of all jobs and Airflow's learning curve. It is a bit discouraging, and I think we should consider other options before we decide. If other options aren't better, then let's use Airflow. Mforns (talk) 13:40, 23 June 2020 (UTC)

Update

Since the release of Airflow 2.0 some of the identified drawbacks have been reduced: the PyArrow concurrency problem is solved, the scheduler scalability has been improved, the UI has been revamped. In addition to that, it seems that the 2.0 release has been happily accepted by the community, who is more active than ever.