You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

0 Días
13 Hrs
26 Min
56 Seg

Tokenización, Stemming y Lematización

4/16
Resources

Text processing is a fundamental skill in the world of artificial intelligence and data analysis. Tokenization, stopword removal, stemming and lemmatization are essential techniques that allow us to transform unstructured text into actionable data for algorithms. These tools help us extract the essential meaning from texts, reducing complexity and improving the efficiency of our models.

What is tokenization and why is it important?

Tokenization is the process of breaking text into smaller units called tokens, which can be groups of words, individual words or even single characters. This process is critical because it allows language models to process and understand text.

Each language model has its own tokenizer, which means that the way they divide text can vary significantly. For example, if we analyze the phrase "hello, how are you?" in different OpenAI models:

  • ChatGPT-3 uses 8 tokens for this sentence.
  • GPT-3.5 splits it into 6 tokens
  • GPT-4 processes it using only 5 tokens.

This variation is important because the number of tokens directly affects the cost and efficiency of text processing in AI applications.

Popular libraries for tokenization

There are two popular libraries for vectorizing and manipulating text in Python:

  1. NLTK (Natural Language Toolkit)
  2. spaCy

Both offer powerful tools for natural language processing, including multilingual capabilities. To work with Spanish text, it is necessary to specify the language or download specific models for Spanish.

# Example with spaCyimport spacy# Download Spanish model# python -m spacy download es_core_news_smnlp = spacy.load("es_core_news_sm")
sample_text = "Horrible, we had to buy another one because neither we know English nor an IT guy could install it."doc = nlp(sample_text)
 # Get tokenstokens = [token.text for token in doc]# Get sentencessentences = [sent.text for sent in doc.sents]
print(tokens)

How to refine the text preprocessing to improve the results?

To better capture the main idea of a text and reduce the number of tokens, we can apply several refinement techniques. These techniques allow us to better manage resources and improve the quality of our analysis.

Elimination of stopwords

Stopwords are common words that do not contribute relevant information to the analysis, such as articles, prepositions and conjunctions. Eliminating them reduces the noise in our data.

Example: In the sentence "Horrible, we had to buy another one because neither we know English nor a computer scientist could install it", words like "we", "that", "because", "nor" would be considered stopwords.

Stemming vs. Lemmatization

These techniques reduce words to their base forms, but with different approaches:

  1. Stemming: trims the word to its root, often by removing suffixes.

    • Example: "buying" → "purchase."
    • Advantage: It is fast and simple.
    • Disadvantage: Can produce poorly readable results and lose semantic meaning.
  2. Lemmatization: Converts the word to its canonical form or lemma.

    • Example: "buying" → "buy".
    • Advantage: Preserves semantic meaning better.
    • Disadvantage: More computationally complex.
# Implementing refinement techniquesimport nltkfrom nltk.corpus import stopwordsfrom nltk.stem import SnowballStemmerfrom nltk.stem import WordNetLemmatizer
 # Download needed resourcesnltk.download('stopwords')nltk.download('wordnet')
 # Configure for Spanishstop_words = set(stopwords.words('spanish'))stemmer = SnowballStemmer('spanish')
sample_text = "Horrible, we had to buy another one because neither we know English nor an IT guy could install it."tokens = nltk.word_tokenize(sample_text,  language='spanish')
 # Remove stopwordstokens_without_stopwords = [word for word in tokens if word.lower() not in stop_words]
 # Apply stemmingstemmed_tokens = [stemmer.stem(word) for word in tokens_without_stopwords]
 # Apply lemmatization (requires POS tagging for best results)lemmatizer = WordNetLemmatizer()lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens_without_stopwords]
print("Original tokens:", tokens)print("No stopwords:", tokens_without_stopwords)print("With stemming:", stemmed_tokens)print("With lemmatization:", lemmatized_tokens)

Processing results

By applying these techniques to our example text, we can see how the words are transformed:

  • Original tokens: All words in the text
  • Without stopwords: "Horrible buy we know computer English could install".
  • With stemming: "Horribl compr sab inglés informát pud instal".
  • With lemmatization: "Horrible to buy know English computer could install".

Lemmatization better preserves meaning while reducing variability, making it ideal for many natural language processing applications.

How to prepare data for text visualizations?

Once we have refined and processed our dataset, we are ready to generate visualizations such as word clouds. These graphical representations allow us to quickly identify the most relevant terms in a text corpus.

To create an effective word cloud, it is recommended:

  1. Eliminate stopwords to focus on meaningful words
  2. Apply lemmatization to group variants of the same word
  3. Normalize the text (convert to lowercase, remove punctuation)
  4. Assign weights to words according to their frequency or importance.

Word clouds are especially useful for:

  • Exploratory analysis of textual data
  • Quick identification of main themes
  • Visual communication of results to non-technical audiences

Text processing is a fascinating field that combines linguistics and computer science to extract meaning from unstructured data. Mastering these techniques will allow you to develop more sophisticated text analysis and artificial intelligence applications. What text processing techniques have you used in your projects? Share your experience in the comments.

Contributions 2

Questions 0

Sort by:

Want to see more contributions, questions and answers from the community?

En contextos de análisis de texto más avanzados, resulta fundamental adaptar las técnicas de preprocesamiento al objetivo específico del proyecto. Por ejemplo, en tareas de análisis de sentimientos o clasificación semántica, la eliminación indiscriminada de *stopwords* puede inducir pérdida de información crítica, especialmente cuando dichas palabras cumplen una función gramatical clave (como negaciones o conectores adversativos). Asimismo, en corpora con alta variabilidad lingüística —como texto generado por usuarios o contenido multilingüe—, puede ser necesario entrenar tokenizadores personalizados o incorporar modelos contextualizados, como los basados en *Transformers* (p. ej., BERT o RoBERTa), que permiten preservar el contexto y desambiguar significados. La elección adecuada de técnicas no solo mejora la precisión de los modelos, sino que también optimiza el uso computacional en flujos de procesamiento a gran escala.
Excelente, siempre había querido saber como se hace la tokenización.