A煤n no tienes acceso a esta clase

Crea una cuenta y contin煤a viendo este curso

Modelos de clasificaci贸n en Python: documentos

20/26
Recursos

Aportes 13

Preguntas 1

Ordenar por:

驴Quieres ver m谩s aportes, preguntas y respuestas de la comunidad? Crea una cuenta o inicia sesi贸n.

En el ejercicio de la clase, a funcion 鈥渄ocument_features鈥 tiene un pequeno error por eso salio tan bajo el accuracy ya que todas las comparaciones dan 鈥淔alse鈥, corrigiendo el accuracy sube a 96%. Solo agregen otra letra al primer for

def document_features(document):
    document_words = set(document)
    features = {}
    for word, i in top_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features```

Usando el Corpus 1 y agregando bigramas sin mucho analisis no mejoro a gran escala el accuracy en promedio 90%, me ayudo mas a balancear la cantidad entre los correos ham y spam

# Descomprimir ZIP
import zipfile
fantasy_zip = zipfile.ZipFile('/content/datasets/email/plaintext/corpus1.zip')
fantasy_zip.extractall('/content/datasets/email/plaintext')
fantasy_zip.close()

# Creamos un listado de los archivos dentro del Corpus1 ham/spam
from os import listdir

path_ham = "/content/datasets/email/plaintext/corpus1/ham/"
filepaths_ham = [path_ham+f for f in listdir(path_ham) if f.endswith('.txt')]

path_spam = "/content/datasets/email/plaintext/corpus1/spam/"
filepaths_spam = [path_spam+f for f in listdir(path_spam) if f.endswith('.txt')]

# Creamos la funcion para tokenizar y leer los archivos 

def abrir(texto):
  with open(texto, 'r', errors='ignore') as f2:
    data = f2.read()
    data = word_tokenize(data)
  return data

# Creamos la lista tokenizada del ham
list_ham = list(map(abrir, filepaths_ham))
# Creamos la lista tokenizada del spam
list_spam = list(map(abrir, filepaths_spam))

nltk.download('stopwords')

# Separamos las palabras mas comunes
all_words = nltk.FreqDist([w for tokenlist in list_ham+list_spam for w in tokenlist])
top_words = all_words.most_common(250)

# Agregamos Bigramas
bigram_text = nltk.Text([w for token in list_ham+list_spam for w in token])
bigrams = list(nltk.bigrams(bigram_text))
top_bigrams = (nltk.FreqDist(bigrams)).most_common(250)


def document_features(document):
    document_words = set(document)
    bigram = set(list(nltk.bigrams(nltk.Text([token for token in document]))))
    features = {}
    for word, j in top_words:
        features['contains({})'.format(word)] = (word in document_words)

    for bigrams, i in top_bigrams:
        features['contains_bigram({})'.format(bigrams)] = (bigrams in bigram)
  
    return features

# Juntamos las listas indicando si tienen palabras de las mas comunes
import random
fset_ham = [(document_features(texto), 0) for texto in list_ham]
fset_spam = [(document_features(texto), 1) for texto in list_spam]
fset = fset_spam + fset_ham[:1500]
random.shuffle(fset)

# Separamos en las listas en train y test
from sklearn.model_selection import train_test_split
fset_train, fset_test = train_test_split(fset, test_size=0.20, random_state=45)

# Entrenamos el programa
classifier = nltk.NaiveBayesClassifier.train(fset_train)

# Probamos y calificamos
classifier.classify(document_features(list_ham[34]))
print(nltk.classify.accuracy(classifier, fset_test))

En la funcion de atributos de documentos hace el contains con la tupla completa de la respuesta de top_words -> (palabra, recurrencia) por eso en todos los outputs de contains da False se tiene que seleccionar solo el primer elemento word[0] de la tupla:

def document_attrs(document, top_words=top_words):
    document_words = set(document)
    attr = {}
    for word in top_words:
        attr[f'contains ({word[0]})'] = (word[0] in document_words)
    
    return attr

con esto el performance mejora muchisimo
.
Saludos!!

Mi solucion al Reto, utilizando solamente el CSV spam-apache.csv.

Click Aqui

Logre obtener un 0.96 de accuracy. 馃槷

Este problema de spam/no spam, es uno de los casos donde es importante el concepto " imbalanced data", por que a pesar de que en el data set hay 250 correos(125 spam , 125 no spam). En realidad lo com煤n es que podemos tener miles de correos que no son spam y solo algunas pocas decenas que si lo son, es decir que tenemos una categoria claramente desbalanceada(spam). Luego si vamos a clasificar un correo lo m谩s probable es que lo etiquete como 鈥渘o spam鈥 . Les recomiendo ver las t茅cnicas usadas para este tipo de situaciones como undersampling, oversampling ,鈥tc.

yo obtube un 92% usando bigrams y ShuffleSplit de KFolds 馃槂.

Mi soluci贸n al reto con 98% de accuracy.
Us茅 los otros 3 corpus, colocaciones y mayor frecuencia de palabras para obtener atributos.

import os
import nltk
import random
from nltk import word_tokenize
from nltk.collocations import *
import pandas as pd
nltk.download("punkt")

!git clone https://github.com/pachocamacho1990/datasets
! unzip datasets/email/plaintext/corpus1.zip
! unzip datasets/email/plaintext/corpus2.zip
! unzip datasets/email/plaintext/corpus3.zip

Funciones para cargar los datasets

# Get Text and labels from folders with plain text files
def get_text_labels_from_folders(folderBase, folderLabels):
  data = []
  labels = []

  for folderLabel in folderLabels:
    for file in os.listdir('{}/{}'.format(folderBase, folderLabel)):
      with open('{}/{}/{}'.format(folderBase, folderLabel, file), encoding='latin-1') as f:
        data.append(f.read())
        labels.append(folderLabel)

  return data, labels

def set_label_num(label_str):
  if label_str == "spam":
      return 1
  else:
      return 0
 
dataCorpus1, labelsCorpus1 = get_text_labels_from_folders('corpus1', ["spam", "ham"])
dataCorpus2, labelsCorpus2 = get_text_labels_from_folders('corpus2', ["spam", "ham"])
dataCorpus3, labelsCorpus3 = get_text_labels_from_folders('corpus3', ["spam", "ham"])
data = dataCorpus1 + dataCorpus2 + dataCorpus3
labels = labelsCorpus1 + labelsCorpus2 + labelsCorpus3

dataframe = pd.DataFrame({'text': data, 'labels': labels})
dataframe = dataframe.sample(frac = 1) 
dataframe['tokens'] = dataframe['text'].apply(lambda x: word_tokenize(x))
dataframe['labels_num'] = dataframe['labels'].apply(lambda x: set_label_num(x))

Funciones para filtrar palabras y obtener colocaciones de n gramas

def filter_words_by_threshold(text_tokenized, threshold = 3, ):
  words = []
  words = [word for word in text_tokenized if len(word) > threshold]
  return words

def get_n_grams_collocations_from_words(words, freq_filter = 10, n_best= 10,
                                       n_gran_measure = nltk.collocations.BigramAssocMeasures()):
  finder = BigramCollocationFinder.from_words(words)
  finder.apply_freq_filter(freq_filter)
  email_spam_collocations = finder.nbest(n_gran_measure.pmi, n_best)
  return email_spam_collocations

Obtener las colocaciones y palabras m谩s comunes en el datasets de spam

spamCorpus1, _ = get_text_labels_from_folders('corpus1', ["spam"])
spamCorpus2, _ = get_text_labels_from_folders('corpus2', ["spam"])
spamCorpus3, _ = get_text_labels_from_folders('corpus3', ["spam"])
spamCorpuses = spamCorpus1 + spamCorpus2 + spamCorpus3

filtered_words = []
for text in spamCorpuses:
  filtered_words += filter_words_by_threshold(word_tokenize(text))
filtered_words

email_spam_collocations = get_n_grams_collocations_from_words(filtered_words, 120, 40)
all_spam_words = nltk.FreqDist([w for w in filtered_words])
top_spam_words = all_spam_words.most_common(200)

Obtener las colocaciones y palabras comunes en el datasets de ham

hamCorpus1, _ = get_text_labels_from_folders('corpus1', ["ham"])
hamCorpus2, _ = get_text_labels_from_folders('corpus2', ["ham"])
hamCorpus3, _ = get_text_labels_from_folders('corpus3', ["ham"])
hamCorpuses = hamCorpus1 + hamCorpus2 + hamCorpus3

filtered_words = []
for text in hamCorpuses:
  filtered_words += filter_words_by_threshold(word_tokenize(text))
filtered_words

email_ham_collocations = get_n_grams_collocations_from_words(filtered_words, 120, 40)
all_ham_words = nltk.FreqDist([w for w in filtered_words])
top_ham_words = all_ham_words.most_common(200)

Filtrar palabras repetidas de palabras m谩s comunes en spam y ham.

top_ham_words_iterator = top_ham_words
for word in top_ham_words_iterator:
  if word in top_ham_words and word in top_spam_words: 
    top_ham_words.remove(word)
    top_spam_words.remove(word)

Obtener palabras m谩s comunes de todo el dataset

filtered_words = []
for text in data:
  filtered_words += filter_words_by_threshold(word_tokenize(text))
filtered_words
all_words = nltk.FreqDist([w for w in filtered_words])
top_words = all_words.most_common(200)
top_words

Obtener los atributos

def document_attributes(document):
  document_words = set(document)
  atrib = {}
  for word in top_words:
    atrib['contains({})'.format(word)] = (word in document_words)
  
  for word in top_spam_words:
    atrib['contains_spam_word({})'.format(word)] = (word in document_words)

  for word in top_ham_words:
    atrib['contains_ham_word({})'.format(word)] = (word in document_words)

  
  for word in document_words:
    has_spam_word = False
    has_ham_word = False

    for bigram_position_0, bigram_position_1 in email_spam_collocations:
        if word == bigram_position_0 or word == bigram_position_1:
          has_spam_word = True
          break

    for bigram_position_0, bigram_position_1 in email_ham_collocations:
        if word == bigram_position_0 or word == bigram_position_1:
          has_ham_word = True
          break

      
    atrib['spam_word({})'.format(word)] = has_spam_word
    atrib['ham_word({})'.format(word)] = has_ham_word
    
    
  filtered_words = filter_words_by_threshold(document)
  bigrams = get_n_grams_collocations_from_words(filtered_words, n_best=10, freq_filter=5)

  for i in range(len(bigrams)):
    atrib['bigram_collocation({})'.format(i)] = bigrams[i]
    
  return atrib

Separar dataset de prueba y entrenamiento

fset = [(document_attributes(text), labels) for text, labels in zip(dataframe['tokens'], dataframe['labels_num'].values)]
random.shuffle(fset)
print(len(fset))
train, test = fset[:13078], fset[13078:]

Entrenar y calcular accuracy

classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))

Hasta aqui mi reporte Joaquin

Al ejercicio propuesto por el profesor, agregu茅 este atributo.

atrib['count({})'.format(word[0])] = document.count(word[0])

Y mejor贸 el clasificador, vamos en un 0.84

Me decepciona mucho como Platzi permite pasar errores (una vez m谩s) tan mayusculos ( y evidentes) como el que hay en la funci贸n document_features. Con un error asi todo el ejercicio esta mal. Todo el mundo se puede equivocar pero hay que saber rectificar .Estamos pagando por esto!

C贸mo la muestra es tan peque帽a, dependiendo del random.shuffle los resultados pueden variar entre 2-6%, cadavez que se entrene el modelo utilizando el mismo set de datos.

Como varios compa帽eros aclaran con solo cambiar la funci贸n de atributos mejorar enormemente el accuracy

def documento_atributos(document):
  """
  """

  document_words = set(document)
  atrib = {}

  for word in top_words:
    atrib['contains({})'.format( word[0] )] = (word[0] in document_words)

  return atrib

As铆 import茅 todos los textos a un df para seguir con el proceso. Tarda unos minutos (seguro se podr铆a optimizar), pero es un primer paso. (No me convence el tokenizador 鈥淸\w+.]+鈥)

from os import listdir
from zipfile import ZipFile
from nltk.tokenize import RegexpTokenizer
nltk.download('stopwords')
from nltk.corpus import stopwords
stopword = stopwords.words('english')

corp_path = '/content/datasets/email/plaintext/'
files_path = ['{}'.format(corp_path) + f for f in listdir (corp_path)]

df = pd.DataFrame(columns=["clase", "token"])

tokenizer = RegexpTokenizer("[\w+.]+")

for folder in files_path:
  zf = ZipFile(folder)
  files = [f for f in ZipFile.namelist(ZipFile(folder)) if f.endswith('.txt')]
  for i, file_name in enumerate(files):
    spam_ham = -1 if file_name.endswith('spam.txt') else 1
    read = zf.open(file_name).read().decode("ISO-8859-1").lower()
    tokens = tokenizer.tokenize(read)
    token_free = [word for word in tokens if word not in stopword]
    datos = {'clase': spam_ham, 'token': token_free}
    df = df.append(datos, ignore_index=True)

print(df)