You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

0 Días
22 Hrs
51 Min
51 Seg

Parámetros en ETL

23/26
Resources

How to orchestrate an ETL flow efficiently?

Orchestrating an extract, transform and load (ETL) flow is essential to ensure that the process runs in an optimized manner. Let's explore how to create a job to organize data transformations and ensure efficient execution. If you want to learn more about ETL in depth, I recommend the ETL with Python and Pentaho course available on Platzi.

How to start with the creation of a job?

To get started, it is crucial to organize the existing transformations in the right order. Follow these steps:

  1. Create a new job: This will allow you to call each transformation in the specific order you need.
  2. Load order: First of all, it is essential to load the dimensions (e.g. customers and territories). This is because sales will depend on those dimensions being fully available to cross-reference with their respective IDs.

How to calculate the maximum ID of a table?

To calculate the maximum ID, we will create a transformation that works dynamically for any table and field. Here are the basic steps:

SELECT MAX({consecutive}) AS consecutive FROM {table};
  • Parameters: Uses variables to define the table name and the consecutive field, allowing this same transformation to work for any table.
  • Null validation: If the result is null, replace by zero, since we need integers.
  • Sum of values: Use a constant to add "1" to the maximum ID obtained, ensuring that new inserts are unique.

How to calculate the maximum date for new uploads?

The strategy for calculating the maximum date is also based on the use of parameterized variables:

  1. Querying the maximum date: Similar to the calculation of the maximum ID, the maximum date recorded is obtained.
  2. Handling null values: If no date is available, define a value far in the past, e.g., "01-01-1900", to ensure that all historical records are captured in the first load.
  3. Use of dates in transformations: In each transformation, filter records to include only new data since the last recorded date, thus optimizing the loading process.

How to ensure the correct execution of the job?

Once the variables for the maximum record ID and date have been defined, they must be set as environment variables so that they are accessible throughout the ETL process. This is achieved by configuring:

  • Variables: Make sure that all transformations use the same name for the variables.
  • Filters on incoming queries: Filter database queries using the configured variables to ensure that only new or updated records are retrieved.

I invite you to experiment with setting these variables in your own transformations and see how it improves the efficiency of your ETL loads. Every step taken towards optimizing ETL flows is an investment in Data Engineering skills - keep learning!

Contributions 0

Questions 3

Sort by:

Want to see more contributions, questions and answers from the community?