Europe Union
Back Back to Blog
Published: 14/01/2023

DVC: from zero to hero in an existing project and messy AWS S3 buckets

DVC

A quick overview of DVC and how it helped overcome data and model tracking issues.

What is DVC

DVC, or Data Version Control, is an open-source version control system for Machine Learning projects. In this article, I’d like to show you, first of all, what it is, and how we used it in a project for one of our clients at DAC.digital

What is DVC?

  • Data version control — the GIT for models and data
  • Platform agnostic (Linux, Mac OS, Windows)
  • Better than git-LFS (large file system)

Where can we use DVC?

If you’d like to learn more about the above use cases, please refer to:

How DVC helped us manage our models and data in a more efficient and more elegant way

Problem statement:

Our structure and workflow looked a bit like this

From zero to hero DVC

As you can see in the diagram above, our workflow structure was very messy. First of all, we had a model training repository from which our Deep Learning experts trained and tested new models. After training, they would simply upload these models to an s3 bucket with some filename.

In our main repository, Computer Vision experts would utilize these Deep Learning models in inference mode for various tasks. However, there was always a problem between accessing the right model, knowing which models are currently used in some branches and problem with download all available models in the s3 bucket. To access these models locally, we had a separate bash script which would download them from an untracked google drive, and in GitHub CI, they were downloaded directly from the s3 bucket (all of them). I don’t think I need to explain the issues this caused.

Step-by-step de-cluttering of our workflow and DVC setup:

Please Note: It is strongly advised to have separate data and model registry GitHub repositories in the case when models and data are used across several projects and repositories. Files would be added and tracked directly on a DVC tracked s3 bucket.

In the step-by-step instruction below, our approach is a little different — as we store all associated DVC files on our main repository. The key reasoning behind this stems from the limitation that in GitHub CI, the build process only has access to a single repository (the current one where GitHub actions are enabled) by default.

If we were to pull data that is tracked from another private repository, i.e. www.github.com/our_organisation/model-registry, GitHub would require using a wider token than GITHUB_TOKEN, such as a personal-authentication-token which is not ideal in our situation — as it constrains the build process on an individual who may leave, and a token which could expire. Overall, the downsides of this approach outweighed the advantages in our particular case. However, if your models were used across multiple repositories, then for one, I suggest checking out the credential issue on:

If you are fine with using a personal authentication token, then I would personally recommend https://github.com/marketplace/actions/setup-git-credentials.

Step-by-step approach:

  1. Move all data in AWS S3 to a single organization for easy access. This would prevent the sharing of private keyIDs to access and download data from various organizations if a team member weren’t a part of both.
  2. Create a new bucket to serve as a remote registry for DVC from the main repository.
  3. Add remote to the main repository

dvc remote add -d myremote s3://path/to/remote

4. Refactor all non .py files in our repo to be moved to a separate folder. Then track folder and files using DVC i.e.

dvc add some_data/
dvc push

5. Restructure AWS S3 bucket used for model storage. In our scenario, we will simply have a better-structured s3 bucket, and we will track these files from DVC.

s3://models-our-client/segmentation/some_model_v1.pth
s3://models-our-client/tracking/some_model_v1.pth
s3://models-our-client/classification/some_model_v1.pth

6. “Import” models currently in use models into the main repository

Here we import a model from external storage (not associated with our remote storage), and start tracking it along with a copy in our remote storage. For more info, please refer to: https://dvc.org/doc/command-reference/import-url

dvc import-url s3://models-our-client/segmentation/some_model_v1.pth
dvc import-url s3://models-our-client/tracking/some_model_v1.pth
dvc import-url s3://models-our-client/classification/some_model_v1.pth

dvc add *.dvc # adds all files
git add .
git commit -m “adding remote models”
dvc push

Result:

From zero to hero DVC

Step 8 (extra): Use case when a user wants to update a model with a different filename.

dvc remove model_v1.pth.dvc
dvc add model_v1_extra.pth
git add .
git commit -m “replaced model”
dvc push
git push

Step 9 (extra): Use case when a user wants to update a model with the same filename

dvc add model_v1.pth
git add .
git commit -m “new model version”
dvc push
git push

Step 10: Use https://studio.iterative.ai to:

  • Register new model versions
  • Track experiments
  • Explore data across various branches

Tips and tricks:

Future work:

  • DVC tracked data-registry repository

Cons of using DVC:

References

Michał Ostyk  

Computer Vision Engineer, DAC.digital

Michał is a Computer Vision Engineer with experience in agriculture, fast food, sports analytics, and healthcare. He loves researching SOTA and converting it into an MVP in Pytorch. However, he recently delved deeper into MLops.

 

Michał Ostyk

Estimate your project.

Just leave your email address and we’ll be in touch soon