Airflow: The Missing Context
There are a number of ways to do data workflows:
- Custom scripts
- Cron, running each discrete task in a workflow based on timing and run-time of each task
- Jenkins, modeling each task as a job, and using job dependencies
- Oozie, Luigi, and other tailor made workflow management systems born into the Hadoop ecosystem
- Airflow
Weâre going to take a look at some of these approaches and end up with a discussion of why Airflow is the better choice.
Our use case
Letâs build a price tracking platform, which tracks a popular ecommerce website, letâs say Ebayâs fashion section, so that we can cross-reference products there with other fashion sites like Zappos and Zara. Of course, weâre talking âin theoryâ here, because crawling websites that donât allow it, is against their TOS. Moving on, here are the high level steps.
Deep crawl over one or more stores. In this step weâre crawling one more more websites. We can crawl and dump everything we see, or scrape most of the data we need, dumping a semi-structured and semi-clean data to disk.
Cleansing, normalization and structured data extraction. In this step weâll get rid of blanks, nulls, corrupted, and redundant data points. Weâll try our best to normalize URLs, domains, and the likes to their canonical form; potentially discovering duplicate records.
Enrichment. Although weâre getting data from the source, thereâs still auxiliary data such as currency rates, inventory, blacklists, and so on that we need to use for enriching our existing data. These kind of data can be had by running more crawls, fetching data from APIs, or joining against a database.
Merge (match enrichment to items). Previous steps left us with data points that are clean and that we can join; perhaps we need to compute a currency on an item, based on an exchange rate weâve fetched, or we need to fill-in a missing field in some items, from another, auxiliary sub-crawl that we ran as part of the enrichment step.
Aggregation and trend tracking. In this step we care about deduplicating data, and figuring out the identity of items, so that for every item, we can cross reference these with previous runs and see if in that item really changed. If it did, weâd output a change record, which allows us to model change-tracking, and then detecting trends in each item individually based on the rate of change and properties of that change.
As a workflow, we can execute these steps linearly as listed, but because a good part of the steps are not dependent, they are a graph that lends itself to optimization. For example, we can run the main crawl while running enrichment crawls, and continue when both finished.
Workflows with Make
Assuming we want to track fashion items on ebay, hereâs a clunky workflow with make
.
crawl-main:
scrapy crawl ebay_wear
crawl-enrich: crawl-main
RECENCY_ONLY=1 IN_7=1 scrapy crawl zara
RECENCY_ONLY=1 IN_30=1 scrapy crawl h_n_m
RECENCY_ONLY=1 IN_90=1 scrapy crawl zappos
agg-last: crawl-enrich
python process_results.py
This (dangerously) works for the happy path (and sad developer)âââthereâs no failure recovery and no operational story at all and maintainbility is horrible. Itâs run-and-pray and then troubleshoot by hand.
Some tasks are verbatim in the Makefile
such as the crawl-
tasks, and knobs appear as environment variables. Other tasks are more cryptic, like agg-last
, where we shell out to Python because what ever we tried doing is too complex to remain in make
realm.
And we run this by saying:
JOB_ID=run-11 make agg-last
When we run make
in this way we synthesize a concept of a unit of work, so that all processes spawned are aware of it and build artifacts under it. By saying JOB_ID=run-11
each one of the processes must now implicitly be aware of it and generate an output under an agreed upon run-11
folder.
/out
/run-10
ebay_wear.csv
zara.csv
h_n_m.csv
zappos.csv
merged.csv
agged.csv
/run-11
ebay_wear.csv
zara.csv
h_n_m.csv
zappos.csv
merged.csv
agged.csv
But what was the core idea here going for make
in the first place? make
does know how to execute a model of a tree of dependencies, and it's already proven for highly complex builds with tons of dependencies, and heck, we can also play with making sentinel crawl-main
, crawl-enrich
, and agg-last
files to have make
skip tasks that have we ran successfully.
Bottom line, the intuition might be there, but the impedence mismatch between a build system and a data workflow management system has to result with a clunky solution sporting bad ergonomics that is hard to maintain and troubleshoot.
Workflows with Code
Hereâs how some of the workflow should look like in pure code. The main takeaways here are that we generate a job ID and around that a working directory for that job manually. We then have no choice but to build our own operators such as crawl
so that the workflow is somewhat readable. There's still plenty that is repetitive and subpar.
This marks a move away from our declarative attempt with make
and into a programmatic attempt with a Python script. Nothing special about Python here, aside of the fact that both crawling and data wrangling frameworks used later are in Python as well, so that it makes sense to keep at it.
Workflows with Jenkins
Building workflows with Jenkins can be great, and for a while Iâve ignored Luigi, Oozie and Azkaban in favor of using Jenkins. It can offer everything that Airflow provides and more in form of Jenkins plugins and its ecosystem and before youâre done with modeling data jobs, you can integrate your existing CI jobs for the code that wrangles your data directly with your data pipeline.
You get dependency modeling between jobs, job ID and isolation, artifacts, history, gantt, visualizations and more, and lastly a distributed model for workers thatâs battle proven over Mesos, Kubernetes or what-have-you, including its built-in robust scheduling and triggering mechanisms.
If your data pipeline work doesnât require dynamic workflows, complex workflows, and extreme flexibility, then by all means, thereâs no reason why you canât model your data pipeline workflows in Jenkins. Itâll be simpler and closer to what your probably already know.
Otherwise, itâd be tedious, and it wonât be a platform you can meaningfully evolve and remain within your data realm. For example, while you could build plugins for customizing your workflow platform (read: Jenkins), itâd be a horrible experience to build, and probably a nightmare getting a grasp of Jenkinsâ internal APIs, compared to Airflowâs small API surface area, and its âadd a script and importâ ergonomics.
Workflows with Airflow
Hereâs a minimal DAG with Airflow with some naive configuration to keep this example readable.
I like to abstract operator creation, as it ultimately makes a more readable code block and allows for extra configuration to generate dynamic tasks, so here we have crawl
, combine
, agg
, show
and all can take parameters.
Hereâs the crawl
function.
The oddly looking {{{{ds}}}}
bit is what makes our job ID. This is a special template variable that Airflow injects for us for free - this bash_command
parameter is actually a string template, passed into Airflow, rendered, and then executed as a Bash command. Because Airflow makes time a first-class citizen, you can look at plenty more of those special parameters here.
Hidden from our view is the metadata store and the scheduler and executor. The metadata store is a database that stores records of previous task runs and DAG runs, which makes task not repeat if they ran, and lets us retry, force-run tasks and DAGs and more. The executor can be a local executor using local processes, or a distributed one using a job queue or any other scheduler that we can build.
DAG authoring is ignorant of the operational aspects of itself: state and execution. This means that if we wanted to let data scientists write DAGs, devops maintain the database and data engineers work on the execution engine and DAG building blocks ( operators, sensors)âââwe could.### The Good Parts
DAGs
Although Airflow uses the term âDAGâ everywhere, it doesnât make that much difference. The term is making its odd hype comeback, I guess. You build and use DAGs every day when youâre sorting out your builds and their dependencies. DAGs are everywhere, regardless of a framework or dealing with data. But if you didnât happen to go over graph theory in university then itâll probably make a difference in your thinking.
That said, plenty of your workflow tasks will be linear at first, and those that donât will require plenty of fiddling with, and that will all happen as you go. Most often a job will start as one task, break into a linear series of jobs, perhaps running on different capacity machines, and then youâd optimize it for running bits in parallel, converge, and have steps communicate via a distributed file system.
DAG Control
The ability to visualize a DAG in more than one wayâââgraph, tree, Gantt, and more importantlyâââto interact with such a DAG by selecting a node and forcing repeat, re-running a failed node, force running an entire DAG, looking at a nodeâs progress as part of a DAG run and its logs do make a difference, and stand out from any other system that uses âDAGâ to promote itself.
Ergonomics
The API feels programmatic. Although it isnât the most concise one, generating dynamic workflows and building new APIs that use it almost asks to happen as you start working with Airflow. Operators, Sensors and Hooks are simple concepts and yet carry a lot of weight so that when you subclass you get a ton of functionality.
Dynamic workflows are the killer feature. Itâs nice to spin a for-loop that generates an array of parallel tasks for you and saves you some typing, but also think of Airflow and its malleability as a library you can use in your own projects.
With Airflow, you can have self-assembling workflows, dynamic and parameter-bound, and you can build one of those cool data shipping startups that hose data from one place to another, effectively building a multi-tenant workflow system and executor as-a-service like AWS data pipelines. For example, you can generate DAGs based on configuration files or database rows where each belongs to a different client.
Time
Time is a first-class citizen. But more abstractly, timing is versioning; if youâre dealing with software then itâs probable that you have some scheme of versioning built-in, from which you do delivery through periodical iterations, branches, patch and hotfix strategies, and perhaps a feature matrix.
When youâre dealing with data ingress, and organic traffic, then its most probable a time window is your de-facto version to use. For example you probably synthesized something like dt=20170908
as a folder name, file name, and log tag, to model a "version" of the current data iteration.
The impedence mismatch between time represeting a temporal thing, and version representing a discrete incremental thing, will vary between not-quite-a-workflow-management-system like Jenkins, to workflow management systems like Luigi, to data workflow management systems like Airflow.
In Airflow, thereâs a strong understading of time being the natural cadence for data moving forward. For example, a time stamp should be your âjob IDâ. Once we commit to specializing the versioning concept over data as a time window, then we can talk about time-specific operations such as âbackfillingâ based on start and end date, time intervals, scheduling, and so on. In other systems where time plays an equal role as any other concept, youâd have to manually make that work.
Growth
Hooks, Operators and Sensors are basic building blocks which Airflow relies on expanding for better growth and adoption as a project. I feel that thereâs almost a bet Airflow will prosper as a project iff people build more of these, which is not too hard to build.
Actually running a task using an Airflow workerâs cpu cycles vs an Airflow worker triggering a task in a remote, more powerful cluster allow for simplicity. More people can use the first case, and even a hybrid model using inline Spark (pyspark, at least) code, in the DAG file, which is easy and thereby attractive, and lastly go with the latter case where an Airflow task merely triggers a job in a more resourceful machine.### The Bad Parts
Airflow has some bad parts, but to take some care away I think whoever uses Airflow is still in the initial circle of early adopters; and this is a bleeding edge product, so expect to bleed.
Logs
Airflow is still clunky in some parts. Thereâs implicitness going on with automatic loading of DAGs, components that are separate from each other, which is a good thing; web, scheduler, executor, database, and more. All this means that when something goes wrong we have plenty of sifting through these components to understand what went wrong and then how to fix it; locating the right task and the right log is time consuming.
The exciting part is that this feels like a low-hanging-fruit; one idea is to show errors on the DAG visualization itself in a form of a badge with error count, and clicking-through will get you to the exact point in the logs.
Context
Although the documentation feels comprehensive, its not. Thereâs a lot of context missed (hopefully this article closes some gap there), and it feels that everything is there, in form of examples and building blocks, but no one took the time yet to do the appropriate write-ups.
It feels that with some more time itâll be more complete; but on the other side the Airflow adoption train is moving ahead, and I keep spotting confused souls that I had to give this context to.
Bugs
That Airflow makes such a good fit for data workflows, and makes a no-brainer decision to adopt, obscures the fact that this is still a young product that tackles a complex topic. Thereâs bugs and plenty hidden knobs to bump into (as of v1.8.1), even when youâre just starting up. For the hidden knobs I recommend giving the API reference a thorough read, and for the bugs, make sure to get the latest version and be ready to go to Github issues every time you feel something is odd.
Lastly, donât forget that because of the couple previous bad parts; logs and contextâââyouâd be out ways to get a bearing for whatâs wrong when something goes wrongâââso for that, reading Airflowâs code is a good way to ground yourself with some more intuition.### Conclusion
This wasnât an Airflow tutorial or a setup guide, thereâs plenty of those already. Itâs an attempt to hit the gray area where word-of-mouth is, and give you some good context for what to expect and how to prepare when you decide to adopt Airflow.
Airflow makes for a good match for data specific workflows because it understands time, its use case and how we build workflows, and makes a great tool built for data engineering.