Skip to navigationSkip to main contentSkip to footerScaleway DocsSparklesIconAsk our AI
SparklesIconAsk our AI

Clusters for Apache Spark™ FAQ

Overview

What is Apache Spark™?

Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

How does Apache Spark™ work?

Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like Hadoop MapReduce. It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

What workloads is Clusters for Apache Spark™ suited for?

Clusters for Apache Spark™ supports a range of workloads, including:

  • Big data processing (large-scale data transformation, cleaning, and analysis)
  • Machine learning (training models, predictive analytics, recommendation systems)
  • Real-time analytics (streaming data, live dashboards, fraud detection)
  • Data integration (ETL pipelines, combining data from multiple sources)
  • Interactive querying (SQL-based exploration of large datasets)

It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support.

I'm looking for Data Lab for Apache Spark™. Has it been discontinued?

Data Lab for Apache Spark™ has been renamed to Clusters for Apache Spark™. It is the same product, just with a new name. All features and functionality remain unchanged.

Offering and availability

What notebook is available with an Apache Spark™ cluster?

The service provides an optional JupyterLab notebook running on a shared CPU Instance. When chosen, the notebook is fully integrated with the Apache Spark™ cluster for seamless data processing and calculations.

Pricing and billing

How am I billed for Clusters for Apache Spark™?

Clusters for Apache Spark™ is billed based on the following factors:

  • The main node configuration selected.
  • The worker node configuration selected, and the number of worker nodes in the cluster.
  • The persistent volume size provisioned.
  • The presence of a notebook.

Compatibility and integration

Can I run an Apache Spark™ cluster using GPUs?

Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's RAPIDS Accelerator For Apache Spark, an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.

Can I connect a separate notebook environment to the Apache Spark™ cluster?

Yes, you can connect a different notebook via Private Networks.

Refer to the dedicated documentation for comprehensive information on how to connect to an Apache Spark™ cluster over Private Networks.

Usage and management

Can I upscale or downscale an Apache Spark™ cluster?

Yes, you can upscale an Apache Spark™ cluster to distribute your workloads across more worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.

You can still access the notebook of an Apache Spark™ cluster with zero worker nodes, but you cannot perform any calculations. You can resume the activity of your cluster by provisioning at least one worker node.

SearchIcon
No Results