Conociendo Apache Spark
Todo lo que aprender谩s sobre Spark para Big Data
Introducci贸n a Apache Spark
Introducci贸n a los RDDs y DataFrames
Configuraci贸n
Instalaci贸n del ambiente de trabajo
Jupyter vs CLI: ejecuci贸n de Spark desde la l铆nea de comandos
Jupyter vs CLI: ejecuci贸n de Spark en Jupyter Notebook
Operaciones RDDs
RDD y DataFrames
Transformaciones y acciones
Acciones de modificaci贸n sobre RDDs
Acciones de conteo sobre RDDs
Soluci贸n reto deportistas
Operaciones num茅ricas
Data Frames y SQL
Creaci贸n de DataFrames
Inferencia de tipos de datos
Operaciones sobre DF
Agrupaciones y operaciones join sobre DF
Soluci贸n reto joins
Funciones de agrupaci贸n
SQL
驴Qu茅 es un UDF?
UDF
Persistencia y particionado
Particionado
Comprendiendo la persistencia y particionado
Particionando datos
Conclusiones
Conclusiones
You don't have access to this class
Keep learning! Join and start boosting your career
Learning how to handle data partitioning and persistence in Spark is crucial to optimize process performance and efficiency. Without proper handling of these aspects, systems can unnecessarily recompute data, affecting performance and increasing costs. In this class, you will learn how to use and configure different persistence techniques in Spark, allowing your data to be available efficiently throughout processing.
When working with large data sets, constant recomputing can be inefficient and costly. Using tools such as cache
and persist
allows you to keep data in memory or on disk, avoiding reprocessing and improving response times. Here are some common methods:
Both methods help to keep data in serialized form, making it easy to access and manage within PySpark.
Spark offers several levels of persistence, each with its own benefits and limitations. The choice of which level to use depends on the business requirement and the performance you want to achieve. Some of the persistence parameters are:
In a Jupyter Notebook environment, it is possible to apply persistence using specific functions and adjusting parameters as needed. Here is a brief example of how to do it:
from pyspark import StorageLevel
# Imagine you already have a dataframe named medalist_by_year# Check initially if it is cachedis_cached = medalist_by_year.is_cached() # Will return False if it is not cached
# Place the cached dataframemedalist_by_year.cache()
# Verify current persistence levelcurrent_persistence = medallist_by_year.rdd.getStorageLevel()
# To modify persistence levelmedallist_by_year.unpersist() # Remove existing persistence
# Apply new persistence levelmedallist_by_year.rdd.persist(StorageLevel.MEMORY_AND_DISK)
Undoubtedly, Spark allows the creation of customized persistence levels that fit particular business needs. The flexibility of this tool makes it possible to replicate data several times if high availability is required, configuring the parameters as needed. For example, create a persistence level that replicates three times:
# Create a custom StorageLevel instancecustom_storage_level = StorageLevel(True, True, False, False, False, 3)
# Apply the new custom levelmedalist_per_year.rdd.persist(custom_storage_level).
As we progress through the class, you will learn more about how to partition your data and how these partitions can influence the performance of your operations, leading you to a more complete mastery of Spark. So stay motivated and keep learning with enthusiasm!
Contributions 7
Questions 2
Interesante clase
interesante
Gran clase!!
from pyspark.storagelevel import StorageLevel
# Para saber si est谩 almacenado en cache?
# Cada vez que llame los valores se debe traer la info desde la fuente
medallista.is_cached
# Para guardarlo en cache
medallista.rdd.chache()
# Para ver en que forma se persiste la informaci贸n
medallista.rdd.getStorageLevel()
# Para quitar una persistencia
medallista.rdd.unpersist()
# Para aplicar una persistencia
# Es una replica de todo el RDD
medallista.rdd.persist(StorageLevel.MEMORY_AND_DISK_2)
#Para crear un particionamiento
StorageLevel.MEMORY_AND_DISK_3 = StorageLevel(True,True,False,False,3)
# Aplicar la persistemcia creata
medallista.rdd.persist(StorageLevel.MEMORY_AND_DISK_3)
Want to see more contributions, questions and answers from the community?