You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesi贸n a prueba de IA

Antes: $249

Currency
$209
Suscr铆bete

Termina en:

0 D铆as
6 Hrs
9 Min
1 Seg

Preprocesamiento Inicial

3/16
Resources

Text processing is a fundamental skill in data analysis, especially when working with user reviews. Proper text cleaning not only improves the quality of our analysis, but also optimizes computational resources by focusing only on relevant information. Regular expressions (Regex) are powerful tools that allow us to manipulate text efficiently, identifying specific patterns that we want to preserve or eliminate.

How to prepare a review dataset for analysis?

Before performing any content analysis on reviews, it is crucial to perform a proper preprocessing. This initial step helps us to highlight relevant content and reduce computational costs by eliminating unnecessary information.

To begin, we must select the columns that we really need. In this case, we will focus on:

  • The "review body" column (review content).
  • The "start" column (score)

It is recommended to work with a copy of the original dataframe to keep the original data intact:

filter_data = df.copy()

Then, we can check the first rows to understand the structure of our data:

filter_data[[['review_body', 'start']].head(3)

Checking for null values

A fundamental step is to check for null values in our dataset:

filter_data.isnull().sum().

If we do not find null values, we can proceed directly to text cleaning. Otherwise, we should implement strategies to handle these missing values.

How to use Regex to clean up review texts?

Regular expressions (Regex) are sequences of characters that define search patterns. Thanks to them, we can describe, identify and manipulate text strings efficiently. They are especially useful for finding patterns such as ats, URLs or HTML tags.

To implement cleaning, we need to import the appropriate libraries:

import reimport string

Then, we create a function that applies all the necessary transformations:

def clean(text): # Convert to lowercase text = text.lower()    
 # Remove text in square brackets text = re.sub(r'\[.*?\]', '', text)    
 # Remove URLs text = re.sub(r'https?://www+|www.\S+', '', text)    
 # Remove HTML tags text = re.sub(r'<.*?>', '', text)    
 # Remove punctuation marks text = re.sub(r'[{}]'.format(string.punctuation), '', text)    
 # Remove line breaks text = re.sub(r'\n', '', text)    
 # Remove words containing numbers text = re.sub(r'\w*, '', text)    
 # Remove emojis or special characters text = re.sub(r'[^\x00-\x7F]+', '', text)    
 # Remove leading and trailing whitespace text = text.strip()    
 return text

Applying the cleanup function

Once our function is defined, we apply it to the column containing the reviews:

filter_data['clean_review'] = filter_data['review_body'].apply(clean).

This process may take a few seconds depending on the size of the dataset. In the example, it took approximately 8 seconds.

What specific transformations do we perform on the text?

Our clean function performs several important transformations:

  1. Lowercase conversion: Standardizes all text for easier analysis.
  2. Removal of bracketed text: Removes content that is usually metadata or clarifications.
  3. Removal of URLs: Web addresses rarely add to the sentiment or theme of the review.
  4. HTML tag removal: Cleans up markup text that may be present in reviews copied from websites.
  5. Punctuation mark removal: Simplifies text for further analysis.
  6. Line break removal: Normalizes text formatting.
  7. Removal of words with numbers: Removes references to users or codes that do not contribute to the semantic content.
  8. Removal of emojis and special characters: We focus on pure textual content.
  9. Removing white spaces: Clean up the formatting for a more consistent text.

When comparing the original review with the cleaned one, we can observe significant differences such as the conversion to lowercase and the elimination of punctuation marks, which highlights the most relevant part of each comment.

Text preprocessing is a fundamental step that should not be underestimated in any textual data analysis project. With these techniques, you will be prepared to extract valuable information from your review datasets and get more accurate and relevant insights.

Have you used regular expressions in your data analysis projects? Share your experiences and doubts in the comments section.

Contributions 4

Questions 0

Sort by:

Want to see more contributions, questions and answers from the community?

En el m茅todo clean lo que hice fue compilar las expresiones regualres fuera de la funci贸n, para evitar recompilarlas en cada llamada. ```js BRACKETS_RE = re.compile(r'\[.*?\]') URL_RE = re.compile(r'https?://\S+|www\.\S+') HTML_TAG_RE = re.compile(r'<.*?>+') PUNCTUATION_RE = re.compile('[%s]' % re.escape(string.punctuation)) NEWLINE_RE = re.compile(r'\n') DIGIT_WORDS_RE = re.compile(r'\w*\d\w*') NON_ASCII_RE = re.compile(r'[^\x00-\x7F]+') def clean(text: str) -> str: text = str(text).lower() text = BRACKETS_RE.sub('', text) text = URL_RE.sub('', text) text = HTML_TAG_RE.sub('', text) text = PUNCTUATION_RE.sub('', text) text = NEWLINE_RE.sub(' ', text) text = DIGIT_WORDS_RE.sub('', text) text = NON_ASCII_RE.sub('', text) return text.strip() ```BRACKETS\_RE = re.compile(r'\\\[.\*?\\]')URL\_RE = re.compile(r'https?://\S+|www\\.\S+')HTML\_TAG\_RE = re.compile(r'<.\*?>+')PUNCTUATION\_RE = re.compile('\[%s]' % re.escape(string.punctuation))NEWLINE\_RE = re.compile(r'\n')DIGIT\_WORDS\_RE = re.compile(r'\w\*\d\w\*')NON\_ASCII\_RE = re.compile(r'\[^\x00-\x7F]+') def clean(text: str) -> str: text = str(text).lower() text = BRACKETS\_RE.sub('', text) text = URL\_RE.sub('', text) text = HTML\_TAG\_RE.sub('', text) text = PUNCTUATION\_RE.sub('', text) text = NEWLINE\_RE.sub(' ', text) text = DIGIT\_WORDS\_RE.sub('', text) text = NON\_ASCII\_RE.sub('', text) return text.strip()
Los **horrores** ortogr谩ficos que cometemos los hispanos al escribir son m谩s dif铆siles de detectar 馃槄
por qu茅 es necesario eliminar los n煤meros? en el `head` se ve que el primer comentario ten铆a un n煤mero 8 y ten铆a un contexto diferente y creo que podr铆a ser 煤til para hacer el an谩lisis
seria util por si tienes un mont贸n de info despu茅s de hacer webscrapping o algo as铆