DataOps — What Is It And Why Should You Care?

Basic concepts and available tools to achieve the data pipeline you dream for

Ivelina Yordanova
Better Programming

--

Photo by Sigmund on Unsplash

So, I realize I am a bit late on this since the concept of DataOps has been around for about as long as I have been in tech but I came across that recently and after reading up on it I thought there might be other people like me whose mind will be blown by this. There are lots of good ideas that can be implemented separately, incrementally and even if you don’t get to a one-button-does it all solution there’s much to be won by following the principles in the framework. What’s more, having the framework literally frames a lot of individual concepts all in one bundle makes individual concepts make more sense and gives you an ideal goal to strive for.

What is DataOps then and why does it sound so much like DevOps?

DevOps is a combination of principles for ways of working within an organization, best practices, and tools that help companies develop and deliver software quickly. The aim is to automate as much as possible of the product lifecycle (build, test, deploy etc). Keywords here are continuous integration (CI) and continuous delivery (CD) of software by leveraging on-demand IT resources (infrastructure as code), hence the name — “DEVelopment” + “OPerationS”/IT. This is in a way the practical side or the implementation making the Agile methodology possible.

DataOps is being described in short as DevOps applied to data, which is a nice summary but it’s not telling much. Everyone that’s ever worked with a more complex system knows that there are more variables going into delivering value from a data product than into a software one. So, let’s dig in a bit deeper. Yes, the goal is the same — delivering value quickly in a predictable and reliable way. However, DataOps is aiming to solve the challenge of not only delivering your data-ingestion-and-transformation jobs, models and analytics as versioned code but also the data itself. It tries to balance agility with governance — one speeds up delivery, and the other guarantees the security, quality, and deliverability of your data.

In terms of changing the way one thinks about different data pipeline components there are a few highlights I spotted while reading:

  • Extract, load, transform (ELT) — data should be loaded raw, without any unnecessary transformations into your data warehouse or lake. This has the benefit of reducing load times and the complexity of your ingestion jobs. The goal is also to not discard anything that might potentially bring value later, by keeping everything in its original form alongside your transformed and analytics-ready data. In some cases, due to legal reasons, some transformation is required and cannot be avoided, that’s when you’ll see this abbreviation — EtLT. In the same line of thinking, no data is deleted..ever. It is just supposed to be maybe archived in lower cost storage solutions
  • CI/CD — for those who are familiar with DevOps, the new concept here is the concept of life cycling database objects. Modern data platforms such as Snowflake offer some sort of advanced features like time travel and zero-copy clones. Additionally, there are solutions for versioning and branching out your data lake — like lakeFS.
  • Code design and maintainability — it’s all about small reusable bits of code in the form of components, modules, libraries, or whatever the used frameworks offer as atomic code entities. Naturally, each company would build its own repository of these and turn them into a standard for all internal projects. It should be obvious that the code should follow best practices, have standard style and formatting applied. Good documentation is also crucial.
  • Environment management — the ability to create and efficiently maintain both long-lived and short-lived environments from branches is essential in DataOps.

There are 18 general principles listed in the DataOps manifesto and they do sound a lot like a mix of Agile and DevOps principles, you can read more about those on the official page. Here’s the significant mind shift that needs to happen though to apply all that to a data pipeline — you need to start thinking about your data as a product.

How is “Data Product” different?

Usually, people would refer to a new piece of work that will deliver value to the business as a ”project”. So, in a way the data pipeline and everything around it supporting data ingestion, consumption, processing, visualization and analysis is a collection of small projects. Instead, though, those should be framed as “products”. Key differences between those are:

  • a project is managed and developed by a team for the duration of it and has an end date when it’s delivered. Product, on the other hand, has an owning team supporting it and has no end date attached to it
  • a project has a scope and a goal and it might get released a few times until it reaches that goal but once that’s done the project is also considered done. A product on the other hand is something people invest in, it evolves, gets updates, gets reviewed, and constantly improved (in an Agile way)
  • a project’s testing is limited to the defined and signed-off scope. A product, on the other hand, has automated unit, regression, and integration tests as part of the release process

If you look up DataOps the following diagram (or variations of it) will come up. It sums up very well the infinite process of — planning, development, testing, and code delivery. It also hints about the need to collaborate with the business (the product stakeholders) at different stages and to get actionable feedback in order to use it in the next iteration.

source: https://www.dataops.live/

Making sense of the sea of tools, frameworks, and languages

Generally, a data product involves a bigger variety of technologies than an isolated software product. Those built up naturally over time as different teams of data analysts, data scientists, data engineers, and software developers explore better options to meet the business needs and follow their own wishes to learn and grow in their own careers.

In any case, either directly or indirectly all tools and frameworks generate some code, at least if all the principles are followed and you have versioning for your configurations and all set up in the same way as for any other software project.

Additional complexity comes from the system design. Data usually comes from multiple sources and can often move non-linearly through the system, multiple processes may be running in parallel and at different stages, transformations can be applied.

DataOps tries to simplify this with the concept of a central repository, which serves as a single source of truth for “anything code and config” in your system, usually that’s referred to as the “data pipeline” (which is a bit ambiguous since I usually imagine a straight pipe from source to sink consider the above caveat). If we imagine having that all-knowing repository with automated orchestration of the data and the processes handling it, then releasing a change is just a click away. What’s more, keeping track of who does what and collaborating becomes way easier when every team has visibility of the changes in development and the ones going on at any moment. This reduces potential bugs caused by miscommunication and increases the quality of the resulting data product.

Let’s look at the moving parts DataOps tries to put in place with this magical repo.

Pipelines

When talking about pipelines in DataOps, there are 2 possible types we could be referring to:

  • Development and deployment — those are the ones familiar from DevOps. That’s the CI/CD pipeline to build, test, and release a data platform’s containers, APIs, libraries etc.
  • Data — that’s the pipeline orchestrating all the components ( whether it’s scheduled jobs, constantly running web services or else) that actually move the data from location A to location B and apply all the required processing. It’s possible and highly likely that a company would have few of those.

Jobs

That’s the part with most variables. There are so many options as to:

  • what technologies does it use — Snowflake, Airflow, Python, HTTP..
  • what triggers it — is it scheduled, does it wait for a condition to be met, does it run constantly
  • where it runs — in an on-premise server, in a private cloud, in a third party environment you know little about
  • how it handles errors
  • what defines success

Data Catalogue

No matter the amount and types of tools and services orchestrated as part of the pipeline, it’s important that they are capable of adding and understanding metadata. This serves the purpose of being the language in which the software components process the data communication with each other as well as, in the end, being a great source of debugging and business information. It can help you, as a maintainer of the pipeline, identify what data is in your system, how it’s moving through it, and trace and diagnose. It’s also helpful for the business to know what data is available and how to use it.

There are data cataloging tools promoting machine learning augmented catalogs that are supposed to be able to discover data, curate it, perform tagging, create semantic relationships and allow simple keyword searches through the metadata. ML can even recommend transformations and auto-provision data and data pipeline specifications. However, even with ML, any data catalog is only as good as the data and metadata it is working with. A centrally-orchestrated DataOps pipeline “knows” all about where data is coming from, how it’s tested, transformed, and processed, where it ends up, etc.

Having quality metadata along with the versioning in your orchestrating repository will also guarantee accountability and better quality of the data itself. It’s important to maintain that not only for the sake of the business value it adds but also for the internal company cultural element of having your stakeholders trust you and what you do and thus creates a better work environment..and happy and calm people work better.

Testing

A key concept in DataOps is testing. It should be automated and run in different stages before any code goes into production. An important distinction with DevOps is that orchestration happens twice in DataOps — once for the tools and software handling your data and the second time for the data itself i.e for the 2 pipelines described a few paragraphs back.

source datakitchen: https://medium.com/data-ops/dataops-is-not-just-devops-for-data-6e03083157b7

What is missing in the above flow is the complexity of the test stage in DataOps is also doubled. In order to do proper regression and integration testing, you need a representative data set to be selected (which is not a trivial task in itself) and anonymized. What’s more, there are often issues that can really only be caught much further downstream in your pipeline. Therefore for an integration test, you need to have an almost end-to-end setup. This is both a technical and financial challenge — replicating a full pipeline even for a snap moment can be too expensive and hard to achieve. In a world where DataOps is implemented to its purest form this might be possible but in many cases, smaller companies especially, do not really have the resources for this. The goal is rather to use the principles as a guide when you design your system and try to implement the bits and pieces that will bring you the most value and save you the most time, effort, and cost…easy right?

DataOps Tools

That all sounds good in theory, but how to even begin to implement it? Well, there are tools on the market that claim to work in this space and do-follow those principles, each to a different degree. The bigger ones are:

  • Unravel — offers AI-driven monitoring and management. It validates code and offers insights into how to improve testing and deployment. It’s for companies that want to focus on optimizing the performance of their pipeline.
  • DataKitchen — serves more like an add-on on top of your existing pipeline and provides the typical DevOps components like CI/CD, testing, orchestration, and monitoring
  • DataOps.live — in addition to the DevOps components, dataOps.live provides some elements of the data pipeline like ELT/ETL, modelling, and governance. If the customers use Snowflake, then they can benefit also from the db versioning and branching capabilities
  • Zaloni — contrary to the previous two, this one focuses heavily on the data pipeline and delivering all its components of it within one platform. That’s not to say that they don’t offer the DevOps components, they do. It’s a very well rounded tool and is perfect for companies with strict governance requirements

--

--

Software Engineer with 9+ years of experience and computer science degree, currently exploring the Data world :)