Member-only story
What Should a Data Orchestration Tool Do
Data orchestration in a DataOps setup
One of the battles in the data domain is data orchestration and what’s the best tool for it. You hear about tool X and how it is dead, doomed to be replaced by this new open source tooling. Or you hear about using CI/CD tooling as a cheap solution to orchestrate your data pipelines. But what’s a data orchestration tool, and what should it do? That’s what this article is all about!
Even if the green status in a data orchestration tool is music to our ears, data orchestration has nothing to do with music if it has to do with coordinating jobs.
From A SQL Statement To A Job
Let’s say we are working on a new data pipeline, doing some data analysis, and figuring out what our SQL should look like. We would do something like this:
load file1 to tab1;
load file2 to tab2;
load file3 to tab3;
...
with
cte1 as (...),
cte2 as (...),
cte3 as (...)
-- insert into tgt
select ...
from cte1
join cte2 on ...
join cte3 on ...;
...
After we finalize our solution, it is time to automate it. Quick and dirty, we encapsulate all of the above in one step and schedule it somewhere to run (even on GitHub or Jenkins, because why not?!). If something fails, we clean up the data and start from scratch. While this could work for a low-priority data pipeline for a small team and less data, as soon as you start building high-priority data pipelines, your team grows, your data grows vertically or horizontally, and this solution will be hard to maintain due to the lack of traceability and management it provides.
Jobs = data processing steps
Oftentimes, in data, we think at the level of lowest granularity. By defining the lowest detail, we ensure we do not have duplicates in the data and that our processes remain simple and easy to manage. A job is the lowest granularity in data processing and contains a single step which can be (among others):
- (delete/ truncate) insert
- update
- upsert (or merge)
- remove
- copy
- export