Distributed Data Lab

GeneralLink to this anchor

What workloads is Distributed Data Lab suited for?Link to this anchor

Distributed Data Lab supports a range of workloads, including:

Complex analytics.
Machine learning tasks.
High-speed operations on large datasets.

It offers scalable CPU and GPU instances with flexible node limits, and robust Apache Spark library support.

What is Apache Spark?Link to this anchor

Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

How does Apache Spark work?Link to this anchor

Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like Hadoop MapReduce. It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

How am I billed for Distributed Data Lab?Link to this anchor

Distributed Data Lab is billed based on two factors:

the main node configuration selected
the worker node configuration selected, and the number of worker nodes in the cluster

ClustersLink to this anchor

Can I upscale or downscale a Distributed Data Lab?Link to this anchor

Yes, you can upscale a Data Lab cluster to distribute your workloads across more worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.

You can still access the notebook of a Data Lab cluster with zero worker nodes, but you cannot perform any calculations. You can resume the activity of your cluster by provisioning at least one worker node.

Can I run a Distributed Data Lab using GPUs?Link to this anchor

Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia’s RAPIDS Accelerator For Apache Spark, an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.

StorageLink to this anchor

What data source options are available?Link to this anchor

Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console, or any other Amazon S3-compatible CLI tool.

Can I connect to S3 buckets from other cloud providers?Link to this anchor

Currently, connections are limited to Scaleway’s Object Storage environment.

NotebookLink to this anchor

What notebook is included with Dedicated Data Labs?Link to this anchor

The service provides a JupyterLab notebook running on a dedicated CPU instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.

Can I connect my local JupyterLab to the Data Lab?Link to this anchor

Remote connections to a Data Lab cluster are currently not supported.

Was this page helpful?