Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

0 Días
12 Hrs
47 Min
29 Seg

Configuración del Entorno y Exploración de Datos

2/16
Resources
Transcript

Data visualization is a powerful tool for understanding customer perception of our products. Word clouds, in particular, provide an immediate visual representation of the most frequent terms in reviews, allowing us to quickly identify trends and sentiment. In this article, we will explore how to build a word cloud from Spanish Amazon reviews, using Python data analysis tools.

How to create a word cloud from customer reviews?

To start our analysis of Amazon reviews, we will use Google Colab, a platform that allows us to run Python code in the cloud. In this first part, we will work with CPU, although in later phases we might require GPU to optimize training and reduce latencies.

The process starts with loading and exploring the dataset. We will follow these steps:

  1. Load the compressed file (.rar) containing our data.
  2. Unzip the file using the !unrar console command.
  3. Read the resulting CSV file using pandas.
  4. Explore the main features of the dataset.
# Import necessary librariesimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns
 # Unzip the file!unrar e file.rar
 # Load the datasetdf = pd.read_csv('review_dataframe_complete.csv')
 # Display the first rowsdf.head(3)

What information does our review dataset contain?

The dataset we are using contains Amazon product reviews in Spanish, with the following columns:

  • ID: Unique identifier of the user who made the review.
  • Product: Identifier of the product reviewed
  • Reviewer: Name of the reviewer
  • Stars: Rating assigned (from 1 to 5 stars)
  • Body: Full text of the review
  • Title: Title of the review
  • Language: Language of the review (Spanish in this case)
  • Category: Type of product (appliances, toys, etc.)

We can scan both the first and last rows to get an idea of the content:

# First 3 rowsdf.head(3)
 # Last 3 rowsdf.tail(3)

In the examples we see, there are TV reviews with comments like "no good, the screen is gone" or "horrible, we had to buy another one. Money down the drain", as well as products from other categories such as "toys" or "wireless devices" with more positive comments such as "I loved the headset".

How to analyze the distribution of scores and categories?

To better understand our data, it is important to visualize the distribution of ratings and product categories. This will give us an overview of customer satisfaction and the types of products most reviewed.

Distribution of scores

We can create a bar chart to visualize the distribution of stars:

plt.figure(figsize=(8, 4))sns.countplot(x='stars',  data=df)plt.title('Distribution of scores')plt.xlabel('Stars')plt.ylabel('Quantity')plt.show().

The result shows that we have approximately 40,000 products with one star, another 40,000 with two stars, and so on. This indicates that our dataset is balanced, which is ideal for analysis as there is no bias towards high or low scores.

Distribution of categories

We can also analyze the distribution of product categories:

# Count categoriescategory_counts = df['category'].value_counts()
 # Take the 9 most frequent categories and group the rest as "Other"top_categories = category_counts.iloc[:9].indexdf['category_grouped'] = df['category'].apply(lambda x: x if x in top_categories else 'Others')
 # Visualizeplt.figure(figsize=(10, 6))sns.countplot(x='category_grouped',  data=df,  palette='skyblue')plt.title('Product distribution: Top 9 plus Others')plt.xlabel('Categories')plt.ylabel('Quantity')plt.xticks(rotation=45)plt.show()

In this visualization, we can see that the most frequent categories include 'home', 'wireless' and 'toys', while the 'other' category contains approximately 80,000 products, indicating a great diversity in our dataset.

What's next in our analysis?

So far, we have managed to load and explore our dataset of Amazon reviews. We know the structure of the data, the distribution of scores and the main product categories. This is the first fundamental step in building our word cloud.

In the next phase, we will dig deeper into the content of the reviews, analyzing the text to identify patterns, sentiment and keywords that will help us better understand customers' perception of the products.

Data exploration is just the beginning of our analytical journey. With these basics in place, we will be ready to apply more advanced natural language processing techniques that will allow us to extract valuable insights from customer opinions.

Have you used word clouds to analyze customer feedback? Share your experiences and results in the comments section.

Contributions 3

Questions 0

Sort by:

Want to see more contributions, questions and answers from the community?

Si quieren agregar el valor de cada barra sobre la misma, les dejo el código: ```python plt.figure(figsize=(8, 4)) ax = sns.countplot(x='stars', data=original_data) for patch in ax.patches: height = patch.get_height() ax.annotate(f'{int(height)}', (patch.get_x() + patch.get_width() / 2, height), ha='center', va='bottom') plt.title('Distribución de Puntuaciones') plt.show() ```plt.figure(figsize=(8, 4))ax = sns.countplot(x='stars', data=original\_data)for patch in ax.patches: height = patch.get\_height() ax.annotate(f'{int(height)}', (patch.get\_x() + patch.get\_width() / 2, height), ha='center', va='bottom')plt.title('Distribución de Puntuaciones')plt.show()
Una forma muy visual de ver las palabras mas relevantes!
Jamas pensé que crear una nube de palabras fuera tan sencillo