With increasingly more companies considering themselves "data-driven" and with the vast amounts of "big data" being used, data pipelines or workflows have become an integral part of data engineering.
Having outgrown hacked-together cron jobs and bash scripts, there are now several data pipeline frameworks which make it easier to schedule, debug, and retry jobs. These jobs are typically long-running (several hours to several days) and processing several billion rows of data, for example as part of an ETL process, mapreduce jobs, or data migration.
A client I consult for is considering building its own data pipeline framework to handle sensor / electric meter data. A first step to this was reviewing the currently available data pipeline frameworks, of which there are many.
In this post, I review 3 of these:
These were chosen as they most closely matched my client's requirements, namely:
Admittedly I was only able to spend about a day looking at the source code for these and have not used these in a production environment, so please correct me if I have misunderstood anything!
It can be mildly confusing reading about these 3 frameworks since they use slightly different terminology to refer to similar concepts.
Airflow | Luigi | Pipeline | |
---|---|---|---|
Collection of work to be done (I refer to this as the data pipeline) | DAG (Directed Acyclic Graph) | Not really supported, Tasks are grouped together into a DAG to be run. Most of the code treats Tasks as the main unit of work. | Workflow |
Main unit of work. Does not refer to an individual unit of data, but rather a batch of these. eg TopArtistsJob rather than Artist A, Artist B, etc | Jobs | Tasks | Tokens |
Class processing the main unit of work | Operators | Tasks / Workers | Jobs / Workers |
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
All 3 frameworks ship with an admin panel of sorts which shows the status of your workflow / tasks (ie the metadata).
Airflow | Luigi | Pinball | |
---|---|---|---|
UI |
|
|
|
Metadata / Job status |
|
|
|
Read more about using multiprocessing vs multithreading in Python.
Airflow | Luigi | Pinball | |
---|---|---|---|
Scaling |
|
|
|
Parallel Execution |
|
|
|
This refers to not processing a job / task unless one or more conditions have been met first.
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
Here are my perceived disadvantages of each framework compared to my client's requirements:
Airflow | Luigi | Pinball |
---|---|---|
|
|
|
If you have read so far, you may find the following links useful too: