Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

0 Días
13 Hrs
21 Min
1 Seg

Etiquetado rápido en Python: español e inglés

2/26
Resources
Transcript

How do I tag English words using NLTK?

Tagging English words with NLTK is a relatively straightforward process thanks to the pre-trained algorithms available in the library. To get started, we need to import the NLTK library and download some essential resources such as the word tokenizer. Once we have the resources ready, we can apply the tagging using NLTK's pos_tag method.

import nltknltk.download('averaged_perceptron_tagger')nltk.download('punkt')from nltk.tokenize import word_tokenize
 # English text to tagtext = "The quick brown fox jumps over the lazy dog"
 # Tokenization of texttokens = word_tokenize(text)
 # Tagging of wordstags = nltk.pos_tag(tokens)print(tags)

This code segments a text into words and assigns them grammatical tags such as noun (NN), verb (VB), etc. These tags correspond to a known standard of grammatical categories.

How can we understand the meaning of the tags?

NLTK provides a tagging set that we can download to get more information about what these tags mean. This is done by the following process:

nltk.download('tagsets')
 # List of tags that we want to explore tagstags = ['CC', 'RB', 'PRP']
 # Show meaning of each tagfor tag in tags: description = nltk.help.upenn_tagset(tag) print(description)

Using the tag set download, we can see what grammatical category each tag represents and examples of its usage.

How to tag Spanish words using NLTK?

Tagging Spanish words with NLTK requires additional steps since there are no pre-trained algorithms for this language. Here are some steps to follow:

  1. Obtain a Spanish corpus, such as cess_esp.
  2. Train a model with a subset of the corpus.
  3. Evaluate the model.

How does tagging using unigrams work?

The unigram tagger uses the context of a word to assign it a grammatical tag.

from nltk.corpus import cess_esp as cesfrom nltk.tag import UnigramTagger
 # Separate phrases from the corpus cess_espphrases = ces.sents()
 # Define a fraction of the dataset for trainingfraction = int(len(phrases) * 0.9)
 # Train a unigram taggerunigram_tagger tagger = UnigramTagger(phrases[:fraction])
 # Evaluate taggerprecision = unigram_tagger.evaluate(phrases[fraction:])print(f "Unigram tagger precision: {precision}")

This process involves training the tagger with 90% of the dataset and evaluating its accuracy with the remaining 10%. We can also tag new sentences in Spanish with etiquetador_unigrama.tag().

Is tagging using bigrams better?

Although one might think that bigrams - which consider the context of consecutive word pairs - might be more effective, in practice this is not always the case. The implementation is similar, but generally yields lower precision, as shown below:

from nltk.tag import BigramTagger
 # Train a bigram tagger bigramtagger tagger_bigram = BigramTagger(phrases[:fraction])
 # Evaluate the bigram taggerprecision_bigram = tagger_bigram.evaluate(phrases[fraction:])print(f "Bigram tagger precision: {precision_bigram}")

It is crucial to test both methods and choose the most suitable one depending on the use case. The unigram tagger usually gives better results in English.

Keep exploring the fascinating world of natural language processing! Each tool you master brings you closer to becoming an expert in this constantly evolving area.

Contributions 15

Questions 5

Sort by:

Want to see more contributions, questions and answers from the community?

Lo que me encanta, es que el profe ya tiene todo planeado y estructurado (se ve desde el maproad pasado). Digo, yo entiendo que ayude ver que escriban codigo para que se vea dinamico; pero que, por la complejidad del tema, decida mejor invertir el tiempo en explicar como funcionan las herramientas, me parece la mejor decision ❤️

Para no tener que copiar los tags a mano:

nltk.download('tagsets')
for _, tag in nltk.pos_tag(text):
    print(nltk.help.upenn_tagset(tag))
Como dato curioso a 7 de Octubre de 2023, el método evaluate tira un mensaje de warning. Para evitarlo utilizar método accuracy de la forma: ```python cess_sents = cess.tagged_sents() fraction = int(len(cess_sents)*90/100) uni_tagger = ut(cess_sents[:fraction]) #uni_tagger.evaluate(cess_sents[fraction+1:]) #deprecada .evaluate() uni_tagger.accuracy(cess_sents[fraction+1:]) ```cess\_sents = cess.tagged\_sents()fraction = int(len(cess\_sents)\*90/100)uni\_tagger = ut(cess\_sents\[:fraction])#uni\_tagger.evaluate(cess\_sents\[fraction+1:]) #deprecada .evaluate()uni\_tagger.accuracy(cess\_sents\[fraction+1:]) También sirve para bigramas. Espero les ayude. Muy buen curso.

En este link se habla mas del tema, es de la pag. de NLTK: Push Me

decia que hay un error al inicio de la demostración en español
el orden correcto es
cess_sents = cess.tagged_sents()
fraction = int(len(cess_sents)*90/100)
cess_sents = cess.tagged_sents()
uni_tagger = ut(cess_sents[:fraction])
uni_tagger.evaluate(cess_sents[fraction+1:])

Si se utiliza el libro de colab del curso…
para agregar más notas abajo de " # @title …"
se puede utilizar " # @markdown " seguido de las notas que queramos agregar.

por qué el “+1” ?

bi_tagger = bt(cess_sents[:fraction])
bi_tagger.evaluate(cess_sents[fraction+1:])
nltk.download('tagsets')

for tag in ['CC', 'RB', 'PRP']:
  print(nltk.help.upenn_tagset(tag))   

CC: conjunción, coordinación

RB: adverbio

PRP: pronombre, personal

Excelente, que buen ejemplo

Espere por mucho el curso y esta buenisimo

Hola, me aparece este error. Ya reinicie el kernel, desinstale y reinstale NLTK e incluso borre y volvi a crear la carpeta pero nada. También intente en Jupyter lab e igualmente me aparece el error. ¿Qué puedo hacer? ![](https://static.platzi.com/media/user_upload/image-8492a0cd-9052-4694-9823-c2db19313c97.jpg)
### Herramientas para etiquetado rápido en Python: 1. **NLTK (Natural Language Toolkit):** * Soporta etiquetado POS en inglés. * Para español, puedes usar corpus adicionales y modelos entrenados con esta librería. 2. **spaCy:** * Una biblioteca rápida y eficiente que soporta etiquetado POS tanto en inglés como en español con modelos preentrenados. * Es muy popular por su rendimiento y facilidad de uso.

Ahora deberías usar accuracy() en vez de evaluate() para evaluar el modelo del tagger.

uni_tagger = ut(cess_sents[:fraction])

uni_tagger.accuracy(cess_sents[fraction+1:])

resultados con UNITAGGER
[(‘Tres’, ‘dn0cp0’),
(‘tristes’, ‘aq0cp0’),
(‘tigres’, None),
(‘tragaron’, None),
(‘trigo’, ‘ncms000’),
(‘en’, ‘sps00’),
(‘un’, ‘di0ms0’),
(‘trigal’, None)]

resultados con BITAGGER
[(‘Tres’, ‘dn0cp0’),
(‘tristes’, None),
(‘tigres’, None),
(‘tragaron’, None),
(‘trigo’, None),
(‘en’, None),
(‘un’, None),
(‘trigal’, None)]

Alguien sabe como puedo mejorar el etiquetado para que sea mayor al 90%?