You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesi贸n a prueba de IA

Antes: $249

Currency
$209
Suscr铆bete

Termina en:

2 D铆as
5 Hrs
30 Min
48 Seg

Apache Spark: acciones

10/25
Resources

What are actions in Apache Spark?

When working with Apache Spark, it is essential to understand the actions we can perform on RDDs (Resilient Distributed Datasets). Actions are operations that are immediately executed on RDDs, causing all associated transformations to be performed automatically. Importantly, unlike transformations, actions do not modify RDDs. They simply interact with them to obtain results such as displaying content or performing counts.

What are some of the most common actions?

  • count: This action counts the number of items in a given RDD. It is particularly useful to quickly get the volume of data we are handling.

  • first: Returns the first element of an RDD. It is an effective way to verify which element is positioned at the beginning of the dataset.

  • take: Returns a specific number of elements (user-defined) from the beginning of the RDD. For example, take(3) would return the first three elements.

Example code block in Scala:

val num = sc.parallelize(List(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10))val count = num.count()val firstElement = num.first()val firstThree = num.take(3)

How is the collect action used?

collect is another popular action in Spark. This function collects all the elements of an RDD and returns them as a collection in the driver program. Although powerful, it should be used with caution on large data sets, as it can overload the driver's memory.

Example of use in code:

val allElements = num.collect().

What other actions does Apache Spark offer?

In addition to the actions mentioned above, Spark offers a considerable variety of options for manipulating RDDs:

  • countByKey: Similar to count, but provides a count based on a specific key.
  • takeSample: Allows to get a random sample of the RDD elements.
  • saveAsTextFile: Saves the RDD contents to a text file in a particular format.

It is advisable to explore and experiment with these actions to fully understand their utilities and advantages in various scenarios.

How to put the concepts of actions and transformations into practice?

The best way to learn is by applying. Here are some practical exercises so you can experiment with what you have learned:

  1. RDD creation: Generate an RDD from a list containing the numbers 1 to 10.
  2. Filtering even elements: Filter the even elements of an RDD.
  3. Maximum and minimum values: Finds the highest and lowest value in the RDD.
  4. Average calculation: Calculates the average of the numbers in the RDD.
  5. Union of RDDs: Perform the union of two RDDs and observe the result.

These exercises not only reinforce the understanding of actions, but also provide valuable practice in handling RDDs. Go ahead, experiment and enjoy the learning process in Databricks or in your preferred environment!

Contributions 21

Questions 1

Sort by:

Want to see more contributions, questions and answers from the community?

Mi respuesta a la tarea: ![](https://static.platzi.com/media/user_upload/Captura%20de%20pantalla%202024-03-29%20215251-32a0385f-1b56-419f-8868-d752f7104402.jpg)![](https://static.platzi.com/media/user_upload/Captura%20de%20pantalla%202024-03-29%20215306-963de59b-4edb-4a77-88ae-79624f779193.jpg)
mi respuesta del reto1: ![](https://static.platzi.com/media/user_upload/image-afc8e81b-f508-4114-ad8d-fcc3d913d9bb.jpg) ![](https://static.platzi.com/media/user_upload/image-31acbcc7-81df-4464-a116-091188fe05f9.jpg) ![](https://static.platzi.com/media/user_upload/image-0f563f9f-2147-4c39-baa6-fc8395031ab2.jpg) ![](https://static.platzi.com/media/user_upload/image-bf1ec177-2655-42e7-a196-80ba109b3c89.jpg)
Mi solucion ![](https://static.platzi.com/media/user_upload/Screenshot%20%283%29-311e44df-6ba2-4071-a055-381e1420ff27.jpg)
![](https://static.platzi.com/media/user_upload/image-380ec1f6-8215-4eb5-a39e-427a1af79357.jpg)![](https://static.platzi.com/media/user_upload/image-cb43878d-1a76-4f46-b150-0296e9cedb1e.jpg)![](https://static.platzi.com/media/user_upload/image-b12f8088-0eae-4641-91ef-2932f70e7f91.jpg)![](https://static.platzi.com/media/user_upload/image-b78ebc95-b234-446a-9f87-e9e626a6f0ff.jpg)
Mi soluci贸n: ![](https://static.platzi.com/media/user_upload/image-a78755e1-3f54-4126-bebc-e61095e36e3a.jpg) ![](https://static.platzi.com/media/user_upload/image-ed5887c5-5309-4d48-949f-6df41d6cbff6.jpg) ![](https://static.platzi.com/media/user_upload/image-08708f32-aa7d-4f5b-91f3-5003379c1eca.jpg)
![](https://static.platzi.com/media/user_upload/image-a44de774-52ee-4c76-82b2-2b4832ae9803.jpg)![](https://static.platzi.com/media/user_upload/image-b09f25f5-bac3-4843-bcd4-028cf2c097fa.jpg)
Mi soluci贸n gente: ![](https://static.platzi.com/media/user_upload/image-72a4d651-3265-4203-9b26-b1edaf27606e.jpg) ![](https://static.platzi.com/media/user_upload/image-c76fc614-dbb2-42a0-ae9d-8ca012cd2273.jpg) ![](https://static.platzi.com/media/user_upload/image-a7880d26-78eb-48df-a564-8f21c0a1ae47.jpg)
MI tarea: ![](https://static.platzi.com/media/user_upload/image-7b756fb6-5e9a-4194-b86d-1765c515c877.jpg)![](https://static.platzi.com/media/user_upload/image-86bc22e8-915d-4f1d-aa5b-cf775c2fcbba.jpg) ![](https://static.platzi.com/media/user_upload/image-93ea192e-a326-4b5e-b60c-38963d99e3df.jpg)
Asi lo resolvi: ![](https://static.platzi.com/media/user_upload/image-340f94b7-43de-4c3f-9eed-e59e1ec40bcd.jpg) ![](https://static.platzi.com/media/user_upload/image-a712691b-7f43-451b-a27e-1ab2ae402241.jpg)
`# ejerciciolista_nros = [1,2,3,4,5,6,7,8,9,10]rdd_homework = sc.parallelize(lista_nros)rdd_homework.collect()#print (f'el RRD esta compuesto por: {rdd_homework}')rdd_par = 聽rdd_homework.filter( lambda x : x % 2 == 0) #filtro para obtener los numeros pares.rdd_par.collect()# maximos y minimos rdd_max = rdd_homework.max()rdd_min = rdd_homework.min()print 聽(f'el maximo es {rdd_max} y el minimo es {rdd_min}')rdd_sumavg = rdd_max + rdd_minrdd_avg = rdd_sumavg / 2print(f'el promedio es: {rdd_avg}')# calculando promedio de la lista de numeros con meanrdd_avg = rdd_homework.mean()print(f'el promedio es: {rdd_avg}')# uniendo dos RDDslista_nrostwo = [1,3,5,7,9,11,13,15,17,19,21]rdd_homeworktwo = sc.parallelize(lista_nrostwo)rdd_homeworktwo.collect()` `rdd_join = rdd_homework.union(rdd_homeworktwo)rdd_join.collect()`
![](https://static.platzi.com/media/user_upload/image-f601737e-4e06-40ea-b277-074629245f73.jpg) ![](https://static.platzi.com/media/user_upload/image-16eaa869-0f93-4c7b-b371-88482742655c.jpg) Done!
Mis respuestas: ![](https://static.platzi.com/media/user_upload/image-851a764a-c145-4e62-ae7f-9a9d4fb7090f.jpg) ![](https://static.platzi.com/media/user_upload/image-250877f2-8fe3-4845-b847-1305b6d18dbb.jpg) ![](https://static.platzi.com/media/user_upload/image-d32f02bf-0ba8-4831-9e06-d63f078c1f61.jpg) ![](https://static.platzi.com/media/user_upload/image-04379eb0-1a88-4c0c-8a1f-f5500163c2b9.jpg) ![](https://static.platzi.com/media/user_upload/image-fadd5d49-f39e-4433-a46c-c969d9c173c5.jpg)
I am a fan of these exercises![](https://static.platzi.com/media/user_upload/image-36490e4c-085d-4d2f-8905-24da2b119107.jpg)![](https://static.platzi.com/media/user_upload/image-1fb1b980-3037-4d9d-a487-53039236259a.jpg)![](https://static.platzi.com/media/user_upload/image-5d3046ff-36e7-4c9d-bd42-8361f90c9568.jpg)
Hola! Comparto mi soluci贸n al reto: ![](https://static.platzi.com/media/user_upload/image-6c7181b8-33c1-4784-8e81-de9d8f1879a8.jpg)![](https://static.platzi.com/media/user_upload/image-b9578ae2-fe37-4a2f-9fb9-6a0ce17010a7.jpg)![](https://static.platzi.com/media/user_upload/image-8eaa4a21-f453-4061-a04d-ca13e8bffcf3.jpg)
```python #Create a RDD list_1_10 = [1,2,3,4,5,6,7,8,9,10] list_rdd = sc.parallelize(list_1_10) list_rdd.collect() #Filter pair values pair_values = list_rdd.filter(lambda x : x % 2 == 0) pair_values.collect() #Find Max and Min in the list display("The max number of the list is:",list_rdd.max(), "The min number of the list is:",list_rdd.min()) #Find the average of the list display("The average of the list is:", list_rdd.mean()) #Join RDDs join_rdd = list_rdd.union(pair_values) join_rdd.collect() ```#Create a RDDlist\_1\_10 = \[1,2,3,4,5,6,7,8,9,10] list\_rdd = sc.parallelize(list\_1\_10)list\_rdd.collect()
comparto el c贸digo: ![](https://static.platzi.com/media/user_upload/image-c197bd48-5243-40cb-bf58-a511cb4b05bd.jpg) ![](https://static.platzi.com/media/user_upload/image-a682e691-cab7-4020-b8c6-9ce1f5f07d4a.jpg)
Adjunto mi tarea profe: ![](https://static.platzi.com/media/user_upload/image-847411ee-f3c1-4bf2-a021-c8f036768f43.jpg)![](https://static.platzi.com/media/user_upload/image-a8fbd165-931e-4c14-bdb7-ebbee31b7997.jpg) ![](https://static.platzi.com/media/user_upload/image-ced767dc-2e6b-4df1-ae07-417f8fce3cb2.jpg) ![](https://static.platzi.com/media/user_upload/image-718aff99-f546-4df9-8f95-ed346f44b537.jpg)
En Apache Spark, las **acciones** son operaciones que ejecutan el flujo de transformaciones definido sobre un RDD, DataFrame o Dataset, devolviendo un resultado al controlador (driver) o guardando los datos en almacenamiento externo. A diferencia de las transformaciones, que son **evaluadas de forma perezosa**, las acciones **desencadenan la ejecuci贸n** de todas las transformaciones previas en el pipeline. ### **Caracter铆sticas de las Acciones** 1. **Inicia el c谩lculo**: Producen un resultado final o guardan datos en un sistema externo. 2. **Devuelve un valor**: El resultado puede ser un valor al controlador, un conteo, una lista de elementos o datos almacenados. 3. **Forzan la evaluaci贸n**: Ejecutan todas las transformaciones acumuladas hasta ese punto. 4. **Consume recursos del cl煤ster**: Las acciones generan cargas en los nodos al procesar los datos. ### **Principales Acciones en Spark** A continuaci贸n, se describen las acciones m谩s comunes, junto con ejemplos en Python y Scala. #### 1. `collect()` * Recupera todos los elementos del RDD o DataFrame y los devuelve al controlador como una lista. * **Uso**: * Ideal para conjuntos de datos peque帽os, ya que todos los datos deben caber en la memoria del controlador. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 3, 4, 5]) resultado = rdd.collect() print(resultado) # Salida: \[1, 2, 3, 4, 5] #### 2. `count()` * Cuenta el n煤mero total de elementos en un RDD o DataFrame. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 3, 4, 5]) total = rdd.count() print(total) # Salida: 5 #### 3. `take(n)` * Recupera los primeros `n` elementos del RDD o DataFrame. * **Ejemplo**:rdd = sc.parallelize(\[10, 20, 30, 40, 50]) primeros = rdd.take(3) print(primeros) # Salida: \[10, 20, 30] #### 4. `reduce(funci贸n)` * Aplica una funci贸n de reducci贸n a los elementos del RDD o DataFrame y devuelve un 煤nico valor. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 3, 4]) suma = rdd.reduce(lambda x, y: x + y) print(suma) # Salida: 10 #### 5. `first()` * Devuelve el primer elemento del RDD o DataFrame. * **Ejemplo**:rdd = sc.parallelize(\[100, 200, 300]) primero = rdd.first() print(primero) # Salida: 100 #### 6. `saveAsTextFile(path)` * Guarda el RDD como un archivo de texto en la ruta especificada. Cada partici贸n se almacena como un archivo separado. * **Ejemplo**:rdd = sc.parallelize(\["Hola", "Mundo", "Spark"]) rdd.saveAsTextFile("ruta/salida.txt") #### 7. `saveAsSequenceFile(path)` * Guarda los datos como un archivo de secuencia (usado para claves y valores). * **Ejemplo**:rdd = sc.parallelize(\[(1, "A"), (2, "B"), (3, "C")]) rdd.saveAsSequenceFile("ruta/salida-secuencia") #### 8. `countByValue()` * Devuelve un mapa con la frecuencia de cada elemento en el RDD. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 2, 3, 3, 3]) frecuencias = rdd.countByValue() print(frecuencias) # Salida: {1: 1, 2: 2, 3: 3} #### 9. `foreach(funci贸n)` * Aplica una funci贸n a cada elemento del RDD sin devolver un resultado al controlador. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 3]) rdd.foreach(lambda x: print(x)) # Imprime cada elemento #### 10. `takeSample(withReplacement, num, seed=None)` * Devuelve una muestra de elementos del RDD. * **Par谩metros**: * `withReplacement`: Indica si los elementos pueden repetirse en la muestra. * `num`: Tama帽o de la muestra. * `seed`: Semilla opcional para reproducibilidad. * **Ejemplo**:rdd = sc.parallelize(\[1, 2, 3, 4, 5]) muestra = rdd.takeSample(False, 3, seed=42) print(muestra) # Salida: \[1, 4, 5] #### 11. `saveAsObjectFile(path)` * Serializa el RDD y lo guarda como un archivo binario en la ruta especificada. * **Ejemplo**:rdd = sc.parallelize(\["a", "b", "c"]) rdd.saveAsObjectFile("ruta/archivo\_objeto") ### **Consideraciones importantes** * Usar acciones como `collect()` o `take()` con datos grandes puede saturar la memoria del controlador. En esos casos, es mejor usar acciones que guarden los datos en almacenamiento externo. * Las acciones desencadenan todas las transformaciones pendientes, por lo que es esencial dise帽ar flujos eficientes. ### **Ejemplo de flujo completo** from pyspark import SparkContext \# Crear un contexto de Spark sc = SparkContext("local", "AccionesEjemplo") \# Crear un RDD rdd = sc.parallelize(\[1, 2, 3, 4, 5, 6]) \# Transformaciones rdd\_pares = rdd.filter(lambda x: x % 2 == 0) rdd\_cuadrados = rdd\_pares.map(lambda x: x\*\*2) \# Acci贸n resultado = rdd\_cuadrados.collect() print(resultado) # Salida: \[4, 16, 36] \# Detener el contexto sc.stop() En este flujo: 1. Se crean transformaciones (`filter` y `map`) para filtrar n煤meros pares y elevarlos al cuadrado. 2. Se ejecuta la acci贸n `collect()` para obtener el resultado final.
![](https://static.platzi.com/media/user_upload/image-ae3687b7-1caa-450e-86bd-904f3b21653e.jpg) ![](https://static.platzi.com/media/user_upload/image-d700a97f-9e6e-4120-a696-3c2562c16d08.jpg)![]() ![]()solo me falt贸 el AVG 鈽猴笍
Vi en los comentarios que para unir se usa la funcion "Union" pero al intentarlo de la siguiente manera tambien funcion贸. Es correcto? ![](https://static.platzi.com/media/user_upload/image-d509877d-92ce-467a-b136-a418987031ca.jpg)
Comparto mi soluci贸n al reto de la clase: ![](https://static.platzi.com/media/user_upload/image-5f10b627-8aba-4a78-962d-2b01803c6e2d.jpg) ![]()![](https://static.platzi.com/media/user_upload/image-ae6564b8-3785-4ac6-846d-06348628c682.jpg) ![]()![]()![]()![](https://static.platzi.com/media/user_upload/image-105b0942-9d56-4a91-9185-ad34b98cabe7.jpg)