Introducci贸n y Fundamentos del NLP
Procesamiento de Lenguaje Natural
Configuraci贸n del Entorno y Exploraci贸n de Datos
Preprocesamiento Inicial
Quiz: Introducci贸n y Fundamentos del NLP
T茅cnicas Tradicionales de NLP para Documentos Empresariales
Tokenizaci贸n, Stemming y Lematizaci贸n
Visualizaci贸n y generaci贸n de nubes de palabras
Representaci贸n Vectorial: Bag-of-Words y TF-IDF
Extracci贸n de T茅rminos Clave y Modelado de Temas
Clasificaci贸n Tradicional para An谩lisis de Sentimientos y Categor铆as
Quiz: T茅cnicas Tradicionales de NLP para Documentos Empresariales
Introducci贸n y Profundizaci贸n en Transformers para Aplicaciones Empresariales
Fundamentos de Transformers y su Relevancia en NLP
Tokenizaci贸n Avanzada con Transformers y Hugging Face
Uso de Modelos Preentrenados de Transformers para Clasificaci贸n
Reconocimiento de Entidades (NER) en Documentos Corporativos con Transformers
Fine-Tuning de Transformers para Datos Empresariales
Quiz: Introducci贸n y Profundizaci贸n en Transformers para Aplicaciones Empresariales
Proyecto Final y Estrategia Comercial B2B
Desarrollo y Prototipado de la Aplicaci贸n Empresarialparte 1
Desarrollo y Prototipado de la Aplicaci贸n Empresarialparte 2
Despliegue del proyecto en Hugging Face
You don't have access to this class
Keep learning! Join and start boosting your career
Tokenization with pre-trained models represents a significant advance in natural language processing, enabling deeper and more contextual analysis of text. Unlike traditional methods, these new techniques better capture the nuances of actual language, including elements such as emojis and special characters. Let's see how to implement these powerful tools in our projects.
Traditional tokenization methods such as Bag of Words or TF-IDF required several steps of text cleaning and conditioning before it could be properly processed. While these methods have been useful for a long time, they have significant limitations when it comes to understanding the full context or handling special features of the modern language.
Pre-trained tokenization models, such as those available in Hugging Face, offer significant advantages:
These features make modern tokenization much closer to reality and to the format in which we communicate today.
To implement this advanced tokenization, we need to follow some specific steps. It is important to note that these models require significant computational resources, so it is recommended to use a GPU.
The first thing is to make sure we have access to a GPU:
# Check the notebook configuration to confirm GPU connection# This can be done from the interface: Edit > Notebook Configuration.
Next, we load our raw dataset:
import pandas as pd# We load the original raw filedf_complete = pd.read_csv('review_data_original.csv')# We visualize to confirm we have the raw datadf_complete.head().
To use Hugging Face tokenizers, we need to import the corresponding library:
from transformers import AutoTokenizer
The choice of the appropriate tokenizer depends on several factors:
For our example with Spanish text, we will use a pre-trained model developed by the University of Chile based on BERT:
# We download the pre-trained tokenizer for Spanishtokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased").
Once the tokenizer is loaded, we can test it with a sample text:
# Sample textsample_text = "No good, my screen went away in less than eight months"
# We apply the tokenizertokens = tokenizer(sample_text)
# We display the resultsprint("Original text:", sample_text)print("Tokens:",tokens)
There are mainly two tokenization techniques used in modern models:
This technique starts with single characters and iteratively merges the most frequent pairs to form new tokens. It is particularly effective for languages with complex morphology and allows handling unknown words by splitting them into subunits.
Similar to BPE, but uses a slightly different approach to token merging. This method is used by models such as BERT and is especially good for capturing common prefixes and suffixes in words.
Both techniques allow:
In the example we saw, the tokenizer decomposed some words into subunits, demonstrating how these models parse text in a more granular way than traditional methods.
Tokenization with pre-trained models represents a significant advance in natural language processing, allowing us to analyze text with greater precision and contextualization. These techniques are fundamental to take advantage of the full potential of modern language models. Have you experimented with different tokenizers? What differences have you noticed in your projects? Share your experience in the comments.
Contributions 1
Questions 2
Want to see more contributions, questions and answers from the community?