Distributed Data Lab - Concepts

Reviewed on 31 July 2024

Data Lab

A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.

Distributed Data Lab

A Distributed Data Lab is a data lab that is distributed across multiple worker nodes to accelerate the processing of large datasets to save time and gain access to actionable insights faster.

Fixture

A fixture is a set of data forming a request used for testing purposes.

JupyterLab

JupyterLab is a web-based platform for interactive computing, letting you work with notebooks, code, and data all in one place. It builds on the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.

Lighter

Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the Lighter repository.

Notebook

A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.

Apache Spark Cluster

An Apache Spark cluster is an orchestrated set of machines over which the distributed/Big data calculus is going to be processed. In the case of this project, the Apache Spark cluster is a Kubernetes cluster, upon which Apache Spark has been installed in every pod deployed. For more details, check out the Apache Spark documentation.

SparkMagic

SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the SparkMagic repository.

Transaction

An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.

Was this page helpful?