A quick overview of DVC and how it helped overcome data and model tracking issues.
DVC, or Data Version Control, is an open-source version control system for Machine Learning projects. In this article, I’d like to show you, first of all, what it is, and how we used it in a project for one of our clients at DAC.digital
What is DVC?
- Data version control — the GIT for models and data
- Platform agnostic (Linux, Mac OS, Windows)
- Better than git-LFS (large file system)
Where can we use DVC?
- Version models and data
- CI/CD for Machine Learning
- Fast and secure data caching hub
- ML experiment tracking (alternatives: https://neptune.ai/blog/dvc-alternatives-for-experiment-tracking#guild)
- Model and Data Registry
If you’d like to learn more about the above use cases, please refer to:
- My GitHub tutorial: https://github.com/Ostyk/dvc-tutorial
- Slides from a workshop DAC.digital
How DVC helped us manage our models and data in a more efficient and more elegant way
Our structure and workflow looked a bit like this
As you can see in the diagram above, our workflow structure was very messy. First of all, we had a model training repository from which our Deep Learning experts trained and tested new models. After training, they would simply upload these models to an s3 bucket with some filename.
In our main repository, Computer Vision experts would utilize these Deep Learning models in inference mode for various tasks. However, there was always a problem between accessing the right model, knowing which models are currently used in some branches and problem with download all available models in the s3 bucket. To access these models locally, we had a separate bash script which would download them from an untracked google drive, and in GitHub CI, they were downloaded directly from the s3 bucket (all of them). I don’t think I need to explain the issues this caused.
Step-by-step de-cluttering of our workflow and DVC setup:
Please Note: It is strongly advised to have separate data and model registry GitHub repositories in the case when models and data are used across several projects and repositories. Files would be added and tracked directly on a DVC tracked s3 bucket.
In the step-by-step instruction below, our approach is a little different — as we store all associated DVC files on our main repository. The key reasoning behind this stems from the limitation that in GitHub CI, the build process only has access to a single repository (the current one where GitHub actions are enabled) by default.
If we were to pull data that is tracked from another private repository, i.e. www.github.com/our_organisation/model-registry, GitHub would require using a wider token than GITHUB_TOKEN, such as a personal-authentication-token which is not ideal in our situation — as it constrains the build process on an individual who may leave, and a token which could expire. Overall, the downsides of this approach outweighed the advantages in our particular case. However, if your models were used across multiple repositories, then for one, I suggest checking out the credential issue on:
- GitHub: https://github.com/iterative/dvc/issues/8068#issuecomment-1243783489
- Discord: https://discord.com/channels/485586884165107732/1057317426997514291
If you are fine with using a personal authentication token, then I would personally recommend https://github.com/marketplace/actions/setup-git-credentials.
- Move all data in AWS S3 to a single organization for easy access. This would prevent the sharing of private keyIDs to access and download data from various organizations if a team member weren’t a part of both.
- Create a new bucket to serve as a remote registry for DVC from the main repository.
- Add remote to the main repository
dvc remote add -d myremote s3://path/to/remote
4. Refactor all non
.py files in our repo to be moved to a separate folder. Then track folder and files using DVC i.e.
dvc add some_data/
5. Restructure AWS S3 bucket used for model storage. In our scenario, we will simply have a better-structured s3 bucket, and we will track these files from DVC.
6. “Import” models currently in use models into the main repository
Here we import a model from external storage (not associated with our remote storage), and start tracking it along with a copy in our remote storage. For more info, please refer to: https://dvc.org/doc/command-reference/import-url
dvc import-url s3://models-our-client/segmentation/some_model_v1.pth
dvc import-url s3://models-our-client/tracking/some_model_v1.pth
dvc import-url s3://models-our-client/classification/some_model_v1.pth
dvc add *.dvc # adds all files
git add .
git commit -m “adding remote models”
Step 8 (extra): Use case when a user wants to update a model with a different filename.
dvc remove model_v1.pth.dvc
dvc add model_v1_extra.pth
git add .
git commit -m “replaced model”
Step 9 (extra): Use case when a user wants to update a model with the same filename
dvc add model_v1.pth
git add .
git commit -m “new model version”
Step 10: Use https://studio.iterative.ai to:
- Register new model versions
- Track experiments
- Explore data across various branches
Tips and tricks:
- DVC has VSC and PyCharm extensions
- DVC studio is a really GUI to manage your models and experiments
- DVC tracked data-registry repository
Cons of using DVC:
- DVC studio has a 2-person limit with the free plan
- Not always the ideal solution for experiment tracking. Please refer to this article to check out some alternatives: https://neptune.ai/blog/dvc-alternatives-for-experiment-tracking#guild
- Sketches of workflows made with https://app.diagrams.net/
Computer Vision Engineer, DAC.digital
Michał is a Computer Vision Engineer with experience in agriculture, fast food, sports analytics, and healthcare. He loves researching SOTA and converting it into an MVP in Pytorch. However, he recently delved deeper into MLops.