Fundamentos del Data Management con Databricks
Databricks como soluci贸n integral
驴Qu茅 es Databricks y para qu茅 sirve?
Infraestructura de almacenamiento y procesamiento en Databricks
Spark como motor de procesamiento Big Data
Quiz: Fundamentos del Data Management con Databricks
Administraci贸n y Gestion de la Plataforma de Databricks
Preparaci贸n de cluster de procesamiento
Preparaci贸n de cluster de almacenamiento
驴Qu茅 son las transformaciones y acciones en Spark?
驴Qu茅 son los RDD en Apache Spark?
Apache Spark: transformaciones
Apache Spark: acciones
Lectura de datos con Spark
驴Qu茅 es la Spark UI?
驴C贸mo instalar una librer铆a en Databricks?
Spark en local vs. en la nube
Quiz: Administraci贸n y Gestion de la Plataforma de Databricks
Apache Spark SQL y UDF
驴Qu茅 son los Dataframes en Apache Spark?
Laboratorio - PySpark SQL - Parte 1
Laboratorio - PySpark SQL - Parte 2
UDF en Apache Spark
Quiz: Apache Spark SQL y UDF
Implementacion de un Delta Lake en Databricks
Arquitectura Data Lake vs Delta Lake
Caracter铆sticas y beneficios del Delta Lake
Medallion architecture
Comandos esenciales de DBFS
Implementaci贸n de un Delta Lake sobre Databrikcs - Parte 1
Implementaci贸n de un Delta Lake sobre Databrikcs - Parte 2
Plataforma vers谩til
You don't have access to this class
Keep learning! Join and start boosting your career
When working with Apache Spark, it is essential to understand the actions we can perform on RDDs (Resilient Distributed Datasets). Actions are operations that are immediately executed on RDDs, causing all associated transformations to be performed automatically. Importantly, unlike transformations, actions do not modify RDDs. They simply interact with them to obtain results such as displaying content or performing counts.
count
: This action counts the number of items in a given RDD. It is particularly useful to quickly get the volume of data we are handling.
first
: Returns the first element of an RDD. It is an effective way to verify which element is positioned at the beginning of the dataset.
take
: Returns a specific number of elements (user-defined) from the beginning of the RDD. For example, take(3)
would return the first three elements.
Example code block in Scala:
val num = sc.parallelize(List(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10))val count = num.count()val firstElement = num.first()val firstThree = num.take(3)
collect
is another popular action in Spark. This function collects all the elements of an RDD and returns them as a collection in the driver program. Although powerful, it should be used with caution on large data sets, as it can overload the driver's memory.
Example of use in code:
val allElements = num.collect().
In addition to the actions mentioned above, Spark offers a considerable variety of options for manipulating RDDs:
countByKey
: Similar to count
, but provides a count based on a specific key.takeSample
: Allows to get a random sample of the RDD elements.saveAsTextFile
: Saves the RDD contents to a text file in a particular format.It is advisable to explore and experiment with these actions to fully understand their utilities and advantages in various scenarios.
The best way to learn is by applying. Here are some practical exercises so you can experiment with what you have learned:
These exercises not only reinforce the understanding of actions, but also provide valuable practice in handling RDDs. Go ahead, experiment and enjoy the learning process in Databricks or in your preferred environment!
Contributions 21
Questions 1
Want to see more contributions, questions and answers from the community?