Member-only story

Build Production-Ready ML Workflow With DVC and S3

DVC: Same as Git but for data

Evgenii Munin

Published in

Better Programming

6 min readJul 15, 2022

In this article, we will introduce Data Version Control (DVC). This is an open source tool developed by the Iterative.ai team that is used to make machine learning (ML) models shareable and reproducible.

Before we get started, I would appreciate if you join Medium membership community. Each week I and my fellow authors share our industrial experience in DS, MLOps, NLP, CV. In these Medium tutorials you can discover the best practices and new techniques in the world of Machine Learning. Become a member of the community and unlock your full tech potential by supporting our work using the link below.

Join Medium with my referral link - Evgenii Munin

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

medium.com

We begin with a few words about Git. Despite its great utility to versionize the code, Git is not ideal for data storage. The huge binary files, images, videos, or text documents usually used to train the ML models can cause problems when stored in a repo. The data modified with multiple commits can consume a lot of space in your repo, so it takes a long time to make git push.

DVC solves this problem as it supports different types of remote storage (Amazon S3, GCS, Azure, HDFS, etc.) and interacts with them in an intuitive Git-similar way.

Git — DVC commands similarity (source: https://dvc.org/doc/use-cases/data-and-model-files-versioning)

DVC features can be grouped into several components:

Data and model versioning: DVC handles the datasets stored separately from the repo and assures efficient sharing and switching back and forth between branches.
Data and model access: it handles the way how to use artifacts and import the data from another DVC project.
Data pipelines: describes models artifacts and provides the way to reproduce them.
Metrics, parameters, and plots: this feature lets you evaluate the ML model and keep track of validation metrics between versions.

Better Programming

Build Production-Ready ML Workflow With DVC and S3

DVC: Same as Git but for data

Join Medium with my referral link - Evgenii Munin

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

Create an account to read the full story.

Published in Better Programming

Written by Evgenii Munin

Responses (1)