You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

1 Días
14 Hrs
4 Min
33 Seg

Naive Bayes en Python: preparación de los datos

22/26
Resources

How to prepare data for Naive Bayes modeling in Python?

Text modeling requires meticulous data preparation, and today we will dive into how to perform it to implement a Naive Bayes algorithm using Python. This time, we will take advantage of the Google Colab environment to write and run the necessary code, starting with the setup and extraction of data from ZIP files. This process is essential for any user who wants to build an email classification model, differentiating between spam and non-spam (ham). Dive into the fascinating world of natural language processing!

What libraries do we need to get started?

For the correct development of our project, you require certain libraries that are crucial for the calculation of probabilities and file manipulation. Here are the main ones:

  • Mathematics (Math): Useful for performing mathematical calculations such as logarithms, essential in the Naive Bayes method.
  • Operating System (OS): Helps us to manage the file system, listing and reading each mail stored in the directory.

How do we organize and read the data?

For the effective training of our model, it is vital to handle the data correctly. We will use a corpus of emails, each stored in separate files. The structure of our data handling can be summarized in the following steps:

  • Extraction of files: From a ZIP to individual folders ("spam" and "ham"), where each file represents an email.
  • List creation: To store both the data and the classes (tags) of the mails.

The process is detailed in the following code:

import os
 # We initialize lists for data and classesdata = []classes = []
 # Reading spam filesfor file in os.listdir("corpus/spam"): with open(f "corpus/spam/{file}",  encoding="latin1") as f: data.append(f.read()) clases.append("spam")
 # Reading ham filesfor file in os.listdir("corpus/ham"): with open(f "corpus/ham/{file}",  encoding="latin1") as f: data.append(f.read()) clases.append("ham")

With the above code, we managed to load and tag more than five thousand emails for our corpus.

How do we introduce the spaCy library and what are its benefits?

Innovation in natural language processing (NLP) is key and spaCy is an exceptional tool for this task. By integrating this library, you will bring robustness and fluency to your NLP pipeline.

Some of the benefits of spaCy include:

  • Advanced tokenization: Allows you to split text into smaller units, greatly improving analysis.
  • Scalable NLP model: Facilitates the transition of projects from development to production efficiently.

Coming up in our next class, we will further explore the spaCy tokenizer and how to effectively integrate it into our NLP pipeline. Keep discovering, don't stop!

Contributions 4

Questions 0

Sort by:

Want to see more contributions, questions and answers from the community?

Librerías para NLP

NLTK: Esta es la lib con la que todos empiezan, sirve mucho para pre-procesamiento, crear los tokens, stemming, POS tagging, etc
https://www.nltk.org/

TextBlob: fue creada encima de NLYK y es fácil de usar. Incluye algunas funcionalidades adicionales como análisis de sentimiento y spell check.
https://textblob.readthedocs.io/en/dev/

Gensim: contruida específicamente para modelado de temas e incluye multiples técnicas (LDA y LSI). También calcula similitud de documentos.
https://radimrehurek.com/gensim/

SpaCy: Puede hacer muchísimas cosas al estilo de NLTK pero es bastante más rápido.
https://spacy.io/

Yo use adapte el código de la clase de SPAM para diferenciar entre textos de diseño y técnicos, conseguí un 88% de eficiencia, me pregunto si conseguiré mas con el código de las siguientes clases.

from sklearn.feature\_extraction.text import CountVectorizerfrom sklearn.model\_selection import train\_test\_splitfrom sklearn.naive\_bayes import MultinomialNBfrom sklearn.metrics import accuracy\_score \# Datos de ejemplodocuments = \['Este es un documento de ejemplo',             'El clima es agradable hoy',             'Spam correo electrónico está llegando',             'Este es un mensaje importante, no es spam'] labels = \['ham', 'ham', 'spam', 'ham'] vectorizer = CountVectorizer()X = vectorizer.fit\_transform(documents)y = labels X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.2, random\_state=42) classifier = MultinomialNB()classifier.fit(X\_train, y\_train) y\_pred = classifier.predict(X\_test)accuracy = accuracy\_score(y\_test, y\_pred) print(f'Precisión del modelo: {accuracy:.2f}')

Excelente