The .gitignore
file contains paths to the data files so they are not tracked by Git.
Configuring DVC with Object Storage
- amazon-s3
- dvc
- machine-learning
- data-science
Git is unarguably the most popular and powerful version control system to store your code and can handle files of up to 5 GB thanks to Git LFS.
However, when it comes to large datasets, you might need to turn to third-party version control tools that are specifically designed to handle them.
Data Version Control (DVC) was specifically designed with this use case in mind. It works alongside Git and allows you to store your data in the remote storage of your choice (such as a Scaleway Object Storage bucket) while storing only the metadata in a Git repository.
In this tutorial, you learn how to use Scaleway Object Storage as a remote storage for DVC.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- A valid API key
- An Object Storage bucket
- A repository to store your metadata
- Made your first request with Scaleway API
- Authenticated to the API for the first time
- Installed the AWS CLI
Setting up DVC
-
Run the following command to install the DVC Python package:
pip3 install dvc -
Run the following command to install the Amazon S3 dependencies:
pip3 install "dvc[s3]" -
Run the following command in the desired repository to initialize DVC:
dvc init -
Run the following command to commit the initial DVC configuration files:
git commit -m "Initialize DVC"
Retrieving and tracking data
-
Run the following command to copy the example data file to your repository:
dvc get https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml -
Run the following command to stage the files you want to store in your bucket:
dvc add data/data.xml -
Run the following command to track the DVC metadata file and
.gitignore
file with git:git add data/data.xml.dvc data/.gitignoreNote -
Run the following command to commit the metadata file:
git commit -m "Add raw data" -
Run the following command to push your changes to your Git repository:
git push
Pushing data to Scaleway Object Storage
-
Run the following command to add your bucket as remote storage for your data:
dvc remote add -d myremote s3://my-bucket/path -
Run the following command to set the Object Storage endpoint of your remote storage:
dvc remote modify myremote \endpointurl https://s3.fr-par.scw.cloudNoteEdit the
endpointurl
according to the geographical region of your bucket. It can either befr-par
(Paris, France),nl-ams
(Amsterdam, The Netherlands), orpl-waw
(Warsaw, Poland). -
Run the following command to push your data to your bucket:
dvc push -
Run the following command to list the content of your bucket:
aws s3api list-objects-v2 --bucket my-bucketThe content of the bucket is displayed:
{"Contents": [{"Key": "22/a1a2931c8370d3aeedd7183606fd7f","LastModified": "2023-05-30T14:49:26.000Z","ETag": "\"22a1a2931c8370d3aeedd7183606fd7f\"","Size": 14445097,"StorageClass": "STANDARD"}]}NoteFile names are generated automatically based on the content of the
.dvc
metadata files created upon tracking the data.
Going further
- Refer to the official DVC documentation for more information on configuration and use cases.
- Refer to the official Git documentation for more information and tutorials on version control.