Fundamentos del Data Management con Databricks
Databricks como solución integral
¿Qué es Databricks y para qué sirve?
Infraestructura de almacenamiento y procesamiento en Databricks
Spark como motor de procesamiento Big Data
Quiz: Fundamentos del Data Management con Databricks
Administración y Gestion de la Plataforma de Databricks
Preparación de cluster de procesamiento
Preparación de cluster de almacenamiento
¿Qué son las transformaciones y acciones en Spark?
¿Qué son los RDD en Apache Spark?
Apache Spark: transformaciones
Apache Spark: acciones
Lectura de datos con Spark
¿Qué es la Spark UI?
¿Cómo instalar una librería en Databricks?
Spark en local vs. en la nube
Quiz: Administración y Gestion de la Plataforma de Databricks
Apache Spark SQL y UDF
¿Qué son los Dataframes en Apache Spark?
Laboratorio - PySpark SQL - Parte 1
Laboratorio - PySpark SQL - Parte 2
UDF en Apache Spark
Quiz: Apache Spark SQL y UDF
Implementacion de un Delta Lake en Databricks
Arquitectura Data Lake vs Delta Lake
Características y beneficios del Delta Lake
Medallion architecture
Comandos esenciales de DBFS
Implementación de un Delta Lake sobre Databrikcs - Parte 1
Implementación de un Delta Lake sobre Databrikcs - Parte 2
Plataforma versátil
You don't have access to this class
Keep learning! Join and start boosting your career
The foundation of Apache Spark lies in RDDs, or Resilient Distributed Datasets. They are an immutable, distributed collection of objects that enable parallel processing within a cluster of computers. This brings efficiency and resilience to the processes of handling large volumes of data.
Immutability: Once an RDD is created, it cannot be modified. This feature ensures data integrity, since it cannot be accidentally altered during analysis processes.
Distribution: RDDs are executed in parallel across the different nodes in the cluster. This ensures high processing speed and efficiency when working with large amounts of data.
Resilience: RDDs can recover from failures, which protects the workflow from possible system outages or errors.
There are two main methods for creating an RDD in Apache Spark:
From scratch: you can create an empty RDD or from a list using specific functions for parallelization. This method is more manual and controlled.
From an existing external file or dataset: It is possible to set up an external file or dataset to be read as an RDD. This procedure is more efficient when working with large volumes of pre-existing data.
RDDs in Spark accept two types of operations:
Transformations: These operations create a new RDD from the original. Transformations are lazy, meaning that they are not executed until the resulting RDD is needed.
Actions: Contrary to transformations, actions return a value to the Spark program, initiating the evaluation of the transformations needed to produce the resulting data.
Exploit parallelism: Take advantage of the ability of RDDs to distribute tasks among cluster nodes. Consider their use in processes that require handling large amounts of data simultaneously.
Pay attention to transformations: Remember that they are lazy. If you expect immediate results, verify the actions that will trigger the transformations.
Incorporate external data: When possible, set up your existing files or datasets as RDDs to maximize processing efficiency.
I recommend you review additional materials to dive deeper into the concepts related to RDDs in Spark. Using these resources will help you better understand and take full advantage of this powerful technology in your projects.
Contributions 5
Questions 0
Want to see more contributions, questions and answers from the community?