Fundamentos del Data Management con Databricks
Databricks como soluci贸n integral
驴Qu茅 es Databricks y para qu茅 sirve?
Infraestructura de almacenamiento y procesamiento en Databricks
Spark como motor de procesamiento Big Data
Quiz: Fundamentos del Data Management con Databricks
Administraci贸n y Gestion de la Plataforma de Databricks
Preparaci贸n de cluster de procesamiento
Preparaci贸n de cluster de almacenamiento
驴Qu茅 son las transformaciones y acciones en Spark?
驴Qu茅 son los RDD en Apache Spark?
Apache Spark: transformaciones
Apache Spark: acciones
Lectura de datos con Spark
驴Qu茅 es la Spark UI?
驴C贸mo instalar una librer铆a en Databricks?
Spark en local vs. en la nube
Quiz: Administraci贸n y Gestion de la Plataforma de Databricks
Apache Spark SQL y UDF
驴Qu茅 son los Dataframes en Apache Spark?
Laboratorio - PySpark SQL - Parte 1
Laboratorio - PySpark SQL - Parte 2
UDF en Apache Spark
Quiz: Apache Spark SQL y UDF
Implementacion de un Delta Lake en Databricks
Arquitectura Data Lake vs Delta Lake
Caracter铆sticas y beneficios del Delta Lake
Medallion architecture
Comandos esenciales de DBFS
Implementaci贸n de un Delta Lake sobre Databrikcs - Parte 1
Implementaci贸n de un Delta Lake sobre Databrikcs - Parte 2
Plataforma vers谩til
You don't have access to this class
Keep learning! Join and start boosting your career
We will explore how to generate RDDs (Resilient Distributed Datasets) using the Databricks platform to perform distributed data analysis efficiently. RDDs are the fundamental unit within Apache Spark, providing fault tolerance and efficiency in the execution of parallel operations.
To start, make sure your cluster is active in Databricks by selecting the Platzi cluster or one you already have configured. Then, head to the Workspace section in Databricks, where you can import a notebook from your file system or a URL, such as from a GitHub repository.
File
and use drag-and-drop to load compatible files such as .dbc
, .scala
, .py
, .sql
, R
, .ipynb
, etc.The example notebook, "What are Apache Spark RDDs Part 2?", will be available in the course resources. Once imported, attack your cluster to start running the code.
Before proceeding to create RDDs, let's understand two essential concepts in Spark:
In Databricks, these components are already configured, eliminating the need to manually initialize Spark Session or Spark Context, optimizing the environment for the user.
Creating an RDD can be done in different ways depending on your needs. In the following, we will explore how to create an empty RDD and a partitioned RDD:
# Create an empty RDDrdd_empty = sc.emptyRDD()
# Create an RDD with three partitionsrdd_with_partitions = sc.parallelize([1, 2, 3, 4, 5], numSlices=3)
# Collect and view partitionsprint(rdd_with_partitions.collect()) # Output: [1, 2, 3, 4, 5]print(rdd_with_partitions.getNumPartitions()) # Output: 3
In the example, parallelize
is the function that distributes the data across partitions, allowing their parallel execution.
Transformations in Spark are operations that are applied to RDDs to generate new RDDs. These are lazy, meaning that they are not executed immediately until an action is performed that triggers them.
Map and Filter:
Map: applies a function to each RDD element.
rdd_mapped = rdd_with_partitions.map(lambda x: x * 2)print(rdd_mapped.collect()) # Output: [2, 4, 6, 8, 10].
Filter: filter the elements that meet a specific condition.
rdd_filtering = rdd_with_partitions.filter(lambda x: x % 2 == 0)print(rdd_filtering.collect()) # Output: [2, 4]
Distinct and ReduceByKey:
Distinct: Returns the unique elements of the RDD.
rdd_unique = rdd_with_partitions.distinct()print(rdd_unique.collect()) # Output: [1, 2, 3, 4, 5].
ReduceByKey: Applies a reduce function to key-value pairs.
rdd_pairs = sc.parallelize([('A', 3), ('B', 2), ('A', 2), ('B', 4)])rdd_reduced = rdd_pairs.reduceByKey(lambda a, b: a + b)print(rdd_reduced.collect()) # Output: [('A', 5), ('B', 6)]
SortByKey:
rdd_ordered = rdd_pairs.sortByKey(ascending=False)print(rdd_ordered.collect()) # Output: [('B', 6), ('A', 5)]
Transformations such as map
, filter
, distinct
, among others, are fundamental to modify and structure the data in an RDD. It is important to remember that all these operations generate new RDDs, since the original RDDs are immutable.
In summary, learning how to handle Databricks and Apache Spark efficiently allows us to take full advantage of their capabilities for large and complex data analysis in a distributed and fault-tolerant manner. Keep experimenting and expanding your skills with practice!
Contributions 4
Questions 2
Want to see more contributions, questions and answers from the community?