You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesi贸n a prueba de IA

Antes: $249

Currency
$209
Suscr铆bete

Termina en:

0 D铆as
14 Hrs
53 Min
15 Seg

Tokenizaci贸n Avanzada con Transformers y Hugging Face

10/16
Resources

Tokenization with pre-trained models represents a significant advance in natural language processing, enabling deeper and more contextual analysis of text. Unlike traditional methods, these new techniques better capture the nuances of actual language, including elements such as emojis and special characters. Let's see how to implement these powerful tools in our projects.

What are the advantages of pre-trained tokenization models?

Traditional tokenization methods such as Bag of Words or TF-IDF required several steps of text cleaning and conditioning before it could be properly processed. While these methods have been useful for a long time, they have significant limitations when it comes to understanding the full context or handling special features of the modern language.

Pre-trained tokenization models, such as those available in Hugging Face, offer significant advantages:

  • Context understanding: Better understand word relationships and contextual meaning.
  • Handling of special characters and emojis: They correctly process elements that traditional tokenizers ignored or eliminated.
  • Word subdivision: They can decompose words into meaningfulsub word units.
  • Handling large vocabularies: They work well with infrequent words.
  • Morphology capture: They understand the internal structure of words.
  • Compatibility with pre-trained models: They integrate seamlessly with advanced architectures such as BERT or GPT.

These features make modern tokenization much closer to reality and to the format in which we communicate today.

How to implement tokenization with Transformers and Hugging Face?

To implement this advanced tokenization, we need to follow some specific steps. It is important to note that these models require significant computational resources, so it is recommended to use a GPU.

Preparing the environment

The first thing is to make sure we have access to a GPU:

# Check the notebook configuration to confirm GPU connection# This can be done from the interface: Edit > Notebook Configuration.

Next, we load our raw dataset:

import pandas as pd# We load the original raw filedf_complete = pd.read_csv('review_data_original.csv')# We visualize to confirm we have the raw datadf_complete.head().

Tokenizer implementation

To use Hugging Face tokenizers, we need to import the corresponding library:

from transformers import AutoTokenizer

The choice of the appropriate tokenizer depends on several factors:

  • The language of the text (in this case, Spanish).
  • the specific area or focus (finance, marketing, etc.)
  • The type of pre-trained model to be used subsequently

For our example with Spanish text, we will use a pre-trained model developed by the University of Chile based on BERT:

# We download the pre-trained tokenizer for Spanishtokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased").

Tokenization of an example text

Once the tokenizer is loaded, we can test it with a sample text:

# Sample textsample_text = "No good, my screen went away in less than eight months"
 # We apply the tokenizertokens = tokenizer(sample_text)
 # We display the resultsprint("Original text:", sample_text)print("Tokens:",tokens)

What tokenization techniques are used in these models?

There are mainly two tokenization techniques used in modern models:

BPE (Byte Pair Encoding).

This technique starts with single characters and iteratively merges the most frequent pairs to form new tokens. It is particularly effective for languages with complex morphology and allows handling unknown words by splitting them into subunits.

WordPiece

Similar to BPE, but uses a slightly different approach to token merging. This method is used by models such as BERT and is especially good for capturing common prefixes and suffixes in words.

Both techniques allow:

  • Segment words into meaningful subunits.
  • Handle large vocabularies with infrequent words
  • Capture word morphology
  • Generate tokens that can be recombined to form whole words.

In the example we saw, the tokenizer decomposed some words into subunits, demonstrating how these models parse text in a more granular way than traditional methods.

Tokenization with pre-trained models represents a significant advance in natural language processing, allowing us to analyze text with greater precision and contextualization. These techniques are fundamental to take advantage of the full potential of modern language models. Have you experimented with different tokenizers? What differences have you noticed in your projects? Share your experience in the comments.

Contributions 1

Questions 2

Sort by:

Want to see more contributions, questions and answers from the community?

La tokenizaci贸n es el proceso de dividir un texto en unidades m谩s peque帽as, conocidas como tokens. Estos tokens pueden ser palabras, subpalabras o caracteres. En el contexto de los modelos de procesamiento del lenguaje natural (NLP) y espec铆ficamente con transformadores, la tokenizaci贸n permite que los modelos comprendan mejor el contexto y la morfolog铆a del lenguaje. Las t茅cnicas avanzadas de tokenizaci贸n, como las que ofrece Hugging Face, permiten manejar vocabularios amplios y capturar caracter铆sticas espec铆ficas del idioma, lo que mejora la efectividad en tareas como an谩lisis de sentimiento y extracci贸n de informaci贸n.