With increasingly more companies considering themselves "data-driven" and with the vast amounts of "big data" being used, data pipelines or workflows have become an integral part of data engineering.
Having outgrown hacked-together cron jobs and bash scripts, there are now several data pipeline frameworks which make it easier to schedule, debug, and retry jobs. These jobs are typically long-running (several hours to several days) and processing several billion rows of data, for example as part of an ETL process, mapreduce jobs, or data migration.
Why Airflow, Luigi, and Pinball?
A client I consult for is considering building its own data pipeline framework to handle sensor / electric meter data. A first step to this was reviewing the currently available data pipeline frameworks, of which there are many.
In this post, I review 3 of these:
These were chosen as they most closely matched my client's requirements, namely:
- Python-based, which ruled out Scala and Java-based frameworks. I would have loved to look at Mario or Suro if this were not a consideration.
- Emphasis on code to configure / run pipelines, which ruled out XML / configuration heavy frameworks.
- Reasonably general purpose (ruled out frameworks specialising in scientific / statistical analysis)
- Reasonably popular / growing community
Admittedly I was only able to spend about a day looking at the source code for these and have not used these in a production environment, so please correct me if I have misunderstood anything!
Terminology
It can be mildly confusing reading about these 3 frameworks since they use slightly different terminology to refer to similar concepts.
Airflow | Luigi | Pipeline | |
---|---|---|---|
Collection of work to be done (I refer to this as the data pipeline) | DAG (Directed Acyclic Graph) | Not really supported, Tasks are grouped together into a DAG to be run. Most of the code treats Tasks as the main unit of work. | Workflow |
Main unit of work. Does not refer to an individual unit of data, but rather a batch of these. eg TopArtistsJob rather than Artist A, Artist B, etc | Jobs | Tasks | Tokens |
Class processing the main unit of work | Operators | Tasks / Workers | Jobs / Workers |
Configuring a pipeline
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
User Interface and Metadata
All 3 frameworks ship with an admin panel of sorts which shows the status of your workflow / tasks (ie the metadata).
Airflow | Luigi | Pinball | |
---|---|---|---|
UI |
|
|
|
Metadata / Job status |
|
|
|
Scaling
Read more about using multiprocessing vs multithreading in Python.
Airflow | Luigi | Pinball | |
---|---|---|---|
Scaling |
|
|
|
Parallel Execution |
|
|
|
Dependency Management
This refers to not processing a job / task unless one or more conditions have been met first.
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
Disadvantages
Here are my perceived disadvantages of each framework compared to my client's requirements:
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
Further Reading
If you have read so far, you may find the following links useful too:
- Creator of Luigi compares with other frameworks
- Comparing Luigi, Azkaban, and Oozie
- Comparing workflows with different criteria
- Slideshare on how Luigi developed over time