Member-only story
Build Production-Ready ML Workflow With DVC and S3
DVC: Same as Git but for data
In this article, we will introduce Data Version Control (DVC). This is an open source tool developed by the Iterative.ai team that is used to make machine learning (ML) models shareable and reproducible.
Before we get started, I would appreciate if you join Medium membership community. Each week I and my fellow authors share our industrial experience in DS, MLOps, NLP, CV. In these Medium tutorials you can discover the best practices and new techniques in the world of Machine Learning. Become a member of the community and unlock your full tech potential by supporting our work using the link below.
We begin with a few words about Git. Despite its great utility to versionize the code, Git is not ideal for data storage. The huge binary files, images, videos, or text documents usually used to train the ML models can cause problems when stored in a repo. The data modified with multiple commits can consume a lot of space in your repo, so it takes a long time to make git push
.
DVC solves this problem as it supports different types of remote storage (Amazon S3, GCS, Azure, HDFS, etc.) and interacts with them in an intuitive Git-similar way.

DVC features can be grouped into several components:
- Data and model versioning: DVC handles the datasets stored separately from the repo and assures efficient sharing and switching back and forth between branches.
- Data and model access: it handles the way how to use artifacts and import the data from another DVC project.
- Data pipelines: describes models artifacts and provides the way to reproduce them.
- Metrics, parameters, and plots: this feature lets you evaluate the ML model and keep track of validation metrics between versions.