You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesi贸n a prueba de IA

Antes: $249

Currency
$209
Suscr铆bete

Termina en:

0 D铆as
15 Hrs
54 Min
33 Seg

Representaci贸n Vectorial: Bag-of-Words y TF-IDF

6/16
Resources

Sentiment classification in comments is a fundamental task in textual data analysis. Using natural language processing and machine learning techniques, we can transform written opinions into valuable information for decision making. In this content, we will explore how to convert clean text into vector representations that allow us to train effective classification models.

How to prepare our data for model training?

Before starting the vectorization and training process, it is crucial to ensure that our data is properly prepared. To optimize performance, we will use a T4 GPU in Google Colab, which will significantly speed up processing times.

To load our previously cleaned dataset, we follow these steps:

  1. We access the "Files" section in Google Colab.
  2. Select "Upload" and look for our "review_clean_advance" file.
  3. It is recommended to work with compressed files (.zip or .rar) to speed up the uploading process.

Once the file is loaded, we proceed to unzip it and verify its content:

# Unzip the file!unrar x file.rar
 # Load the datasetfilter_data = pd.read_csv('path_to_file')
 # Visualize the first rowsfilter_data.head(3)
 # Verify null valuesfilter_data.isnull().sum()

A crucial aspect is the verification of null values. In our case, we detected a single row with a null value, which we decided to remove due to the abundance of available data:

# remove rows with null valuesfilter_data = filter_data.dropna().

What methods exist to convert text to vector representations?

Transforming text to numeric vectors is essential for machine learning algorithms to process textual information. We will analyze two main methods:

Bag of Words (BoW).

This method converts text into numerical representations by counting word frequencies:

  • How it works: it counts how many times each word appears in a document (row of our dataset).
  • Main features:
    • Ignores word order and grammar.
    • Each document is represented as a vector where each dimension corresponds to a vocabulary word.
    • The value in each dimension is the frequency of that word in the document.

Term Frequency-Inverse Document Frequency (TF-IDF)

This method improves the Bag of Words representation by weighting the frequencies:

  • How it works: it combines two metrics:
    • TF (Term Frequency): frequency of a word in a document.
    • IDF (Inverse Document Frequency): Inverse of the frequency of documents containing that word.
  • Advantages:
    • Reduces the influence of common words (such as "the", "the", "and").
    • Gives more importance to more informative and distinctive terms.
    • Improves the quality of the vector representation for classification tasks.

How to implement these vectorization techniques in practice?

To implement these methods we will use the scikit-learn library, which provides efficient tools for text processing:

# Import the necessary toolsfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 # Define the corpus (set of documents)corpus = filter_data['text_column'].tolist()
 # Implement Bag of Wordsbow_vectorizer = CountVectorizer()bow_matrix = bow_vectorizer.fit_transform(corpus)
 # View dimensions and featuresprint(f "Number of documents: {bow_matrix.shape[0]}")print(f "Vocabulary size: {len(bow_vectorizer.get_feature_names_out())}")print(f "First 10 words of the vocabulary: {bow_vectorizer.get_feature_names_out()[:10]}")
 # Implement TF-IDFtfidf_vectorizer = TfidfVectorizer()tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
 # View dimensions and featuresprint(f "Number of documents: {tfidf_matrix.shape[0]}")print(f "Vocabulary size: {len(tfidf_vectorizer.get_feature_names_out())}")

Both methods generate arrays with the same number of documents (rows of our dataset) and the same vocabulary size (unique words found in the whole corpus). The difference lies in how the values within these matrices are calculated.

These vector representations will be the basis for training classification models that can determine whether a comment expresses positive or negative sentiment. In next steps, we will be able to use these matrices to identify relevant themes that appear throughout all reviews.

Text vectorization is a fundamental step in natural language processing that allows us to transform qualitative information into quantitative data that can be processed by machine learning algorithms. Have you used any of these techniques in your projects? Share your experience in the comments.

Contributions 1

Questions 0

Sort by:

Want to see more contributions, questions and answers from the community?

![](https://static.platzi.com/media/user_upload/upload-7dd78e67-f5ed-4b55-9be4-2dc05707f138.png)![]()