You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

2 Días
2 Hrs
11 Min
15 Seg

Entrenando un HMM en Python

8/26
Resources

How to get started with training a latent framebian model?

Getting started with the training of a latent framebian model may seem complicated at first, but with the right steps, the task is considerably simplified. For this purpose, we use Google Colab to write the code to run such training. Initially, we must focus on calculating the necessary counts, setting up a system in which each dictionary element specifically represents the frequency of labels, broadcasts and transitions.

How to structure the initial dictionaries?

  • Tags: This dictionary entry will count how many times each tag appears in the corpus.
  • Emissions: Here, it will record how many times a specific tag corresponds to a specific word.
  • Transitions: Indicates how many times a previous tag is followed by another tag in a subsequent position.

How to read the corpus and count tags?

Reading the corpus is an essential step where the use of the universal grammatical tag convention is suggested. We use a double for as each element of the corpus is a list of tokens, processing one by one.

# Initialization of the tag dictionarytag_counts = {}
 # Read the corpusfor token_list in corpus: previous_tag = None for token in token_list: # Get the current tag tag = token.tag_type # Assuming that the token has this attribute
 # Count tags if tag in tag_counts: tag_counts[tag] += 1 else: tag_counts[tag] = 1

How to calculate the emission and transition probabilities?

It is essential to establish the probabilities of two types: emission and transition. Both play a crucial role in the correct functioning of the model.

How to calculate emission probabilities?

Calculating emission probabilities requires creating a string that reflects the combination of the lowercase word and its label. Similar to tag counting, checks ensure that the counts are accurate.

# Initialization of emission dictionaryemission_counts = {}
for token_list in corpus: for token in token_list: word = token.word.lower() tag = token.tag_type        
 # Create word-tag pair as key pair = f"{word}|{tag}"        
 if pair in emission_counts: emission_counts[pair] += 1 else: emission_counts[pair] = 1

How to handle transition probabilities?

The computation of transitions adds another layer of complexity by involving previous tags. As we traverse the corpus, we constantly update the prior tag to correctly reflect the probabilities.

# Transition dictionary initializationtransition_counts = {}
for token_list in corpus: previous_tag = None for token in token_list: tag = token.tag_type        
 if previous_tag is not None: prev_current_pair = f"{previous_tag}|{tag}"            
 if prev_current_pair in transition_counts: transition_counts[prev_current_pair] += 1 else: transition_counts[prev_current_pair] = 1 previous_tag = tag

How are the calculated probabilities saved and used?

After getting all the counts, we transform these into transition and issue probabilities. Then, we store these in usable files, which is essential for future prediction procedures.

How to store the probabilities?

We use the NumPy library to store each set of probabilities in separate files, strengthening efficiency in storage and access for later applications. This step is key to advance to concrete model predictions.

import numpy as np
 np.save("transition_probabilities.npy", transition_probabilities)np.save("emission_probabilities.npy", emission_probabilities)

This process culminates in a completed and stored latent framebian latent model, ready to be used in word sequences to obtain the most likely corresponding labels. Go ahead and find out how to apply this powerful technique in real-world applications!

Contributions 12

Questions 3

Sort by:

Want to see more contributions, questions and answers from the community?

Para las personas que el transitionDict les da vacío, hay que hacer una pequeña corrección:

El profe define prevtag = None después de los dos ciclos for.
Por lo tanto cada vez que itero el valro de prevtag vuelve aser None y siempre se repite la primera iteración para el TransitionDict.

Para corregirlo hay que definir prevtag = None antes de los dos ciclos for.

Saludos.

para hacer un sort sobre estos diccionarios basados en los valores de las probabilidades encontradas.

# transition sorted
sorted_transitionprobs = sorted(transitionProbDict.items(), key = lambda x: x[1], reverse=True)
sorted_transitionprobs


# Emission sorted
sorted_emissionprobs = sorted(emissionProbDict.items(), key = lambda x: x[1], reverse=True)
sorted_emissionprobs

Creo que en el cálculo de emissionProbDict la condicion del if debío ser:

if tagCountDict[tag]>0:

La ecuacion de Probabilidades de Emisión tiene un pequeño error, el numerador de la ecuacion debe ser el siguiente

C(word,tag) 

en lugar de

C(word|tag)

Genial esta clase!

![](https://static.platzi.com/media/user_upload/image-ed9dd285-c96e-40bc-9229-8a142ecbba9a.jpg) Cada línea contiene las siguientes columnas: 1. Índice de la palabra en la oración. 2. La palabra original. 3. La forma raíz o lema de la palabra. 4. La categoría gramatical (POS). 5. La etiqueta POS detallada. 6. Otros detalles sintácticos.
Que locura todo este tema, estoy fascinado con la computación. La verdad no soy el más habilidoso en matemáticas y algoritmos, pero de lo que he ido comprendiendo me ha volado la cabeza. Por cierto, que buen profesor!

Viendo la calidad de la clase, la importancia del tema y los pocos comentarios, entiende uno por que no se sigue con este nivel de profundidad, parece que en general no son las clases con mayor acogida 😦.

Muy muy buena clase

Código corregido

<
tag_count_dict = {}
emission_dict = {}
transition_dict = {}

tagtype = 'upos'
data_file = open('UD_Spanish-AnCora/es_ancora-ud-dev.conllu', 'r', encoding='utf-8')

prevtag = None
for tokenlist in parse_incr(data_file):
    for token in tokenlist:
        
        #C(tag)
        tag = token[tagtype]
        if tag in tag_count_dict.keys():
            tag_count_dict[tag] += 1
        else:
            tag_count_dict[tag] = 1
            
        #C(word|tag) -> probabilidades de emision
        word_tag = token['form'].lower() + '|' +token[tagtype]
        if word_tag in emission_dict.keys():
            emission_dict[word_tag] += 1
        else:
            emission_dict[word_tag] = 1
            
        # C(tag|tag_previo) -> probabilidades de transición
        if prevtag is None:
            prevtag = tag
            continue
        transitiontags = tag + '|' + prevtag
        if transitiontags in transition_dict.keys():
            transition_dict[transitiontags] = transition_dict[transitiontags] + 1
        else:
            transition_dict[transitiontags] = 1
        prevtag = tag
> 

Para quienes no entiendan el concepto de “El corpus”, es un conjunto de textos suficientemente grande de muestras reales de una lengua determinada.

que bien

jejejj y asi es como un modelo equivale a dos diccionarios 😃 muy bueno