NavigationContentFooter
Jump toSuggest an edit

Distributed Data Lab - Quickstart

Reviewed on 31 July 2024Published on 10 July 2024

Distributed Data Lab is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.

It is composed of the following:

  • Cluster: An Apache Spark cluster powered by a Kubernetes architecture.

  • Notebook: A JupyterLab service operating on a dedicated node type.

Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring numerous CPUs and substantial RAM.

The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster.

Before you start

To complete the actions presented below, you must have:

  • A Scaleway account logged into the console
  • Owner status or IAM permissions allowing you to perform actions in the intended Organization
  • Signed up to the private beta and received a confirmation email.
  • Optionally, an Object Storage bucket

How to create a Distributed Data Lab cluster

  1. Click Data Lab under Managed Services on the side menu.

  2. Click Create Data Lab cluster. The creation wizard displays.

  3. Complete the following steps in the wizard:

    • Choose an Apache Spark version from the drop-down menu.
    • Select a worker node configuration.
    • Enter the desired number of worker nodes.
      Note

      Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.

    • Optionally, choose an Object Storage bucket as your source of data and the place to store the output of your operations.
    • Enter a name for your Data Lab.
    • Optionally, add a description and/or tags for your Data Lab.
    • Verify the estimated cost.
  4. Click Create Data Lab cluster to finish. You are directed to the Data Lab cluster overview page.

How to connect to your Data Lab

  1. Click Data Lab under Managed Services on the side menu. The Distributed Data Lab page displays.

  2. Click the name of the Data Lab cluster you want to connect to. The cluster Overview page displays.

  3. Click Open Notebook in the Notebook section. You are directed to the notebook login page.

  4. Enter your API secret key when prompted for a password, then click Log in. You are directed to the notebook home screen.

How to run the demo file

Each Distributed Data Lab comes with a default quickstart.ipynb demo file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run.

Execute the cells in order to perform pre-determined operations on a dummy data set.

How to set up a new Data Lab environment

  1. From the notebook Launcher tab, select PySpark under Notebook.

  2. In a new cell, copy and paste the code below and replace the placeholders with your API access key, secret key, and the endpoint of your Object Storage Bucket to set up the Apache Spark session:

    %%configure -f
    {
    "name": "My Spark",
    "conf":{
    "spark.hadoop.fs.s3a.access.key": "your-api-access-key",
    "spark.hadoop.fs.s3a.secret.key": "your-api-access-key",
    "spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint"
    }
    }
    Note

    The Object Storage bucket endpoint is required only if you did not specify a bucket when creating the cluster.

  3. In a new cell below, copy and paste the following command to initialize the Apache Spark session:

    from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType
  4. Execute the two cells you just created.

    Note

    The initialization of your Apache Spark session can take a few minutes.

    Once initialized, the information of the Spark session displays.

You can now execute commands that will run on the resources defined when creating the Distributed Data Lab.

How to delete a Distributed Data Lab

Important

This action is irreversible and will permanently delete this Data Lab cluster and all its associated data.

  1. From the Overview tab of your Data Lab cluster, click the Settings tab, then select Delete cluster.

  2. Enter DELETE in the confirmation pop-up to confirm your action.

  3. Click Delete Data Lab cluster.

API DocsScaleway consoleDedibox consoleScaleway LearningScaleway.comPricingBlogCareers
© 2023-2024 – Scaleway