You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesi贸n a prueba de IA

Antes: $249

Currency
$209
Suscr铆bete

Termina en:

0 D铆as
22 Hrs
6 Min
43 Seg

Comprendiendo la persistencia y particionado

23/25
Resources

How does data persistence work in Spark?

Learning how to handle data partitioning and persistence in Spark is crucial to optimize process performance and efficiency. Without proper handling of these aspects, systems can unnecessarily recompute data, affecting performance and increasing costs. In this class, you will learn how to use and configure different persistence techniques in Spark, allowing your data to be available efficiently throughout processing.

Why is it important to keep data in memory?

When working with large data sets, constant recomputing can be inefficient and costly. Using tools such as cache and persist allows you to keep data in memory or on disk, avoiding reprocessing and improving response times. Here are some common methods:

  • Cache: Stores data in memory for quick access.
  • Persist: Offers the option of storing data both in memory and on disk, as required.

Both methods help to keep data in serialized form, making it easy to access and manage within PySpark.

What types of persistence does Spark offer?

Spark offers several levels of persistence, each with its own benefits and limitations. The choice of which level to use depends on the business requirement and the performance you want to achieve. Some of the persistence parameters are:

  • Disk: Indicates whether data should be stored on disk.
  • Memory: Determines whether the data will be in memory.
  • Serialization: Decides if the data is handled in serialized format.
  • Replication: Number of times data is replicated to increase availability and reliability. A minimum replication of three is recommended to minimize the risk of data loss.

How to apply persistence in Jupyter Notebook?

In a Jupyter Notebook environment, it is possible to apply persistence using specific functions and adjusting parameters as needed. Here is a brief example of how to do it:

from pyspark import StorageLevel
 # Imagine you already have a dataframe named medalist_by_year# Check initially if it is cachedis_cached = medalist_by_year.is_cached() # Will return False if it is not cached
 # Place the cached dataframemedalist_by_year.cache()
 # Verify current persistence levelcurrent_persistence = medallist_by_year.rdd.getStorageLevel()
 # To modify persistence levelmedallist_by_year.unpersist() # Remove existing persistence
 # Apply new persistence levelmedallist_by_year.rdd.persist(StorageLevel.MEMORY_AND_DISK)

Is it possible to create custom persistence levels?

Undoubtedly, Spark allows the creation of customized persistence levels that fit particular business needs. The flexibility of this tool makes it possible to replicate data several times if high availability is required, configuring the parameters as needed. For example, create a persistence level that replicates three times:

# Create a custom StorageLevel instancecustom_storage_level = StorageLevel(True, True, False, False, False, 3)
 # Apply the new custom levelmedalist_per_year.rdd.persist(custom_storage_level).

As we progress through the class, you will learn more about how to partition your data and how these partitions can influence the performance of your operations, leading you to a more complete mastery of Spark. So stay motivated and keep learning with enthusiasm!

Contributions 7

Questions 2

Sort by:

Want to see more contributions, questions and answers from the community?

Interesante clase

interesante

Gran clase!!

from pyspark.storagelevel import StorageLevel

# Para saber si est谩 almacenado en cache?
# Cada vez que llame los valores se debe traer la info desde la fuente 
medallista.is_cached


# Para guardarlo en cache
medallista.rdd.chache()

# Para ver en que forma se persiste la informaci贸n
medallista.rdd.getStorageLevel()

# Para quitar una persistencia
medallista.rdd.unpersist()


# Para aplicar una persistencia
# Es una replica de todo el RDD
medallista.rdd.persist(StorageLevel.MEMORY_AND_DISK_2)

#Para crear un particionamiento
StorageLevel.MEMORY_AND_DISK_3 = StorageLevel(True,True,False,False,3)

# Aplicar la persistemcia creata
medallista.rdd.persist(StorageLevel.MEMORY_AND_DISK_3)