NavigationContentFooter
Jump toSuggest an edit

Configuring DVC with Object Storage

Reviewed on 18 November 2024Published on 05 June 2023
  • amazon-s3
  • dvc
  • machine-learning
  • data-science

Git is unarguably the most popular and powerful version control system to store your code and can handle files of up to 5 GB thanks to Git LFS.

However, when it comes to large datasets, you might need to turn to third-party version control tools that are specifically designed to handle them.

Data Version Control (DVC) was specifically designed with this use case in mind. It works alongside Git and allows you to store your data in the remote storage of your choice (such as a Scaleway Object Storage bucket) while storing only the metadata in a Git repository.

In this tutorial, you learn how to use Scaleway Object Storage as a remote storage for DVC.

Before you start

To complete the actions presented below, you must have:

  • A Scaleway account logged into the console
  • Owner status or IAM permissions allowing you to perform actions in the intended Organization
  • A valid API key
  • An Object Storage bucket
  • A repository to store your metadata
  • Made your first request with Scaleway API
  • Authenticated to the API for the first time
  • Installed the AWS CLI

Setting up DVC

  1. Run the following command to install the DVC Python package:

    pip3 install dvc
  2. Run the following command to install the Amazon S3 dependencies:

    pip3 install "dvc[s3]"
  3. Run the following command in the desired repository to initialize DVC:

    dvc init
  4. Run the following command to commit the initial DVC configuration files:

    git commit -m "Initialize DVC"

Retrieving and tracking data

  1. Run the following command to copy the example data file to your repository:

    dvc get https://github.com/iterative/dataset-registry \
    get-started/data.xml -o data/data.xml
  2. Run the following command to stage the files you want to store in your bucket:

    dvc add data/data.xml
  3. Run the following command to track the DVC metadata file and .gitignore file with git:

    git add data/data.xml.dvc data/.gitignore
    Note

    The .gitignore file contains paths to the data files so they are not tracked by Git.

  4. Run the following command to commit the metadata file:

    git commit -m "Add raw data"
  5. Run the following command to push your changes to your Git repository:

    git push

Pushing data to Scaleway Object Storage

  1. Run the following command to add your bucket as remote storage for your data:

    dvc remote add -d myremote s3://my-bucket/path
  2. Run the following command to set the Object Storage endpoint of your remote storage:

    dvc remote modify myremote \
    endpointurl https://s3.fr-par.scw.cloud
    Note

    Edit the endpointurl according to the geographical region of your bucket. It can either be fr-par (Paris, France), nl-ams (Amsterdam, The Netherlands), or pl-waw (Warsaw, Poland).

  3. Run the following command to push your data to your bucket:

    dvc push
  4. Run the following command to list the content of your bucket:

    aws s3api list-objects-v2 --bucket my-bucket

    The content of the bucket is displayed:

    {
    "Contents": [
    {
    "Key": "22/a1a2931c8370d3aeedd7183606fd7f",
    "LastModified": "2023-05-30T14:49:26.000Z",
    "ETag": "\"22a1a2931c8370d3aeedd7183606fd7f\"",
    "Size": 14445097,
    "StorageClass": "STANDARD"
    }
    ]
    }
    Note

    File names are generated automatically based on the content of the .dvc metadata files created upon tracking the data.

Going further

  • Refer to the official DVC documentation for more information on configuration and use cases.
  • Refer to the official Git documentation for more information and tutorials on version control.
Was this page helpful?
API DocsScaleway consoleDedibox consoleScaleway LearningScaleway.comPricingBlogCareers
© 2023-2024 – Scaleway