With increasingly more companies considering themselves "data-driven" and with the vast amounts of "big data" being used, data pipelines or workflows have become an integral part of data engineering.
Having outgrown hacked-together cron jobs and bash scripts, there are now several data pipeline frameworks which make it easier to schedule, debug, and retry jobs. These jobs are typically long-running (several hours to several days) and processing several billion rows of data, for example as part of an ETL process, mapreduce jobs, or data migration.
A client I consult for is considering building its own data pipeline framework to handle sensor / electric meter data. A first step to this was reviewing the currently available data pipeline frameworks, of which there are many.
In this post, I review 3 of these:
These were chosen as they most closely matched my client's requirements, namely:
Python-based, which ruled out Scala and Java-based frameworks. I would have loved to look at Mario or Suro if this were not a consideration.
Emphasis on code to configure / run pipelines, which ruled out XML / configuration heavy frameworks.
Reasonably general purpose (ruled out frameworks specialising in scientific / statistical analysis)
Reasonably popular / growing community
Admittedly I was only able to spend about a day looking at the source code for these and have not used these in a production environment, so please correct me if I have misunderstood anything!
It can be mildly confusing reading about these 3 frameworks since they use slightly different terminology to refer to similar concepts.
AirflowLuigiPipelineCollection of work to be done (I refer to this as the data pipeline)DAG (Directed Acyclic Graph)Not really supported, Tasks are grouped together into a DAG to be run. Most of the code treats Tasks as the main unit of work.WorkflowMain unit of work. Does not refer to an individual unit of data, but rather a batch of these. eg TopArtistsJob rather than Artist A, Artist B, etcJobsTasksTokensClass processing the main unit of workOperatorsTasks / WorkersJobs / Workers
AirflowLuigiPinball
Create a python class which imports existing Operator classes
Ships with numerous Operators, so a DAG can be constructed more dynamically with existing Operators
Requires subclassing one of the small number of Tasks, not as dynamic.
Create a config dictionary with jobs and schedules
All 3 frameworks ship with an admin panel of sorts which shows the status of your workflow / tasks (ie the metadata).
AirflowLuigiPinballUI
Comprehensive, with multiple screens
Built with Flask
Detailed, looks like Sidekiq
Built with Django
Metadata / Job status
Job status is stored in a database
Operators mark jobs as passed / failed
Last_updated is refreshed frequently with a heartbeat function
kill_zombies() is called to clear all jobs with older heartbeats
Task status is stored in database
Similar to Airflow, but fewer details
Workers 'claim' messages from the queue with an ownership timestamp on the message
This lease claim gets renewed frequently
Messages with older lease claims are requeued. Messages successfully processed are archived to S3 file system using Secor.
Job status is stored to database.
Read more about using multiprocessing vs multithreading in Python.
AirflowLuigiPinballScaling
DAGs can be constructed with multiple Operators
Scale out by adding Celery workers
Create multiple Tasks
Add Workers
Parallel Execution
Subprocess
Subprocess
Threading
This refers to not processing a job / task unless one or more conditions have been met first.
AirflowLuigiPinball
Operators can be constructed with depends_on_past parameter
Tasks can be constructed with requires() method
class ArtistToplistToDatabase(luigi.postgres.CopyToTable):
def requires(self):
return Top10Artists(self.date_interval, self.use_hadoop)
Jobs can require other jobs to finish first before starting, eg child_job requires parent_job
WORKFLOWS = {
'example_workflow': WorkflowConfig(
jobs={
'parent_job':JobConfig(SomeJob, []),
'child_job':JobConfig(AnotherJob, ['parent_job'])
},
final_job_config=....,
notify_emails=....
}
Here are my perceived disadvantages of each framework compared to my client's requirements:
AirflowLuigiPinball
No Kafka support, uses Celery (RabbitMQ, Redis)
Seems more suitable for scheduled batch jobs, rather than streaming data. In particular, updating the metadata DB for streaming data may have a performance impact.
Possibly too full-featured / overkill.
Metadata and UI is not as useful as Airflow.
Relatively small number of Tasks, requires writing subclasses for most of our requirements.
Possibly not suitable for streaming data, same performance concern as Airflow.
Uses threading.
Possibly not suitable for streaming data, same performance concern as Airflow.
If you have read so far, you may find the following links useful too: