Introducci贸n y Fundamentos del NLP
Procesamiento de Lenguaje Natural
Configuraci贸n del Entorno y Exploraci贸n de Datos
Preprocesamiento Inicial
Quiz: Introducci贸n y Fundamentos del NLP
T茅cnicas Tradicionales de NLP para Documentos Empresariales
Tokenizaci贸n, Stemming y Lematizaci贸n
Visualizaci贸n y generaci贸n de nubes de palabras
Representaci贸n Vectorial: Bag-of-Words y TF-IDF
Extracci贸n de T茅rminos Clave y Modelado de Temas
Clasificaci贸n Tradicional para An谩lisis de Sentimientos y Categor铆as
Quiz: T茅cnicas Tradicionales de NLP para Documentos Empresariales
Introducci贸n y Profundizaci贸n en Transformers para Aplicaciones Empresariales
Fundamentos de Transformers y su Relevancia en NLP
Tokenizaci贸n Avanzada con Transformers y Hugging Face
Uso de Modelos Preentrenados de Transformers para Clasificaci贸n
Reconocimiento de Entidades (NER) en Documentos Corporativos con Transformers
Fine-Tuning de Transformers para Datos Empresariales
Quiz: Introducci贸n y Profundizaci贸n en Transformers para Aplicaciones Empresariales
Proyecto Final y Estrategia Comercial B2B
Desarrollo y Prototipado de la Aplicaci贸n Empresarialparte 1
Desarrollo y Prototipado de la Aplicaci贸n Empresarialparte 2
Despliegue del proyecto en Hugging Face
You don't have access to this class
Keep learning! Join and start boosting your career
Text processing is a fundamental skill in data analysis, especially when working with user reviews. Proper text cleaning not only improves the quality of our analysis, but also optimizes computational resources by focusing only on relevant information. Regular expressions (Regex) are powerful tools that allow us to manipulate text efficiently, identifying specific patterns that we want to preserve or eliminate.
Before performing any content analysis on reviews, it is crucial to perform a proper preprocessing. This initial step helps us to highlight relevant content and reduce computational costs by eliminating unnecessary information.
To begin, we must select the columns that we really need. In this case, we will focus on:
It is recommended to work with a copy of the original dataframe to keep the original data intact:
filter_data = df.copy()
Then, we can check the first rows to understand the structure of our data:
filter_data[[['review_body', 'start']].head(3)
A fundamental step is to check for null values in our dataset:
filter_data.isnull().sum().
If we do not find null values, we can proceed directly to text cleaning. Otherwise, we should implement strategies to handle these missing values.
Regular expressions (Regex) are sequences of characters that define search patterns. Thanks to them, we can describe, identify and manipulate text strings efficiently. They are especially useful for finding patterns such as ats, URLs or HTML tags.
To implement cleaning, we need to import the appropriate libraries:
import reimport string
Then, we create a function that applies all the necessary transformations:
def clean(text): # Convert to lowercase text = text.lower()
# Remove text in square brackets text = re.sub(r'\[.*?\]', '', text)
# Remove URLs text = re.sub(r'https?://www+|www.\S+', '', text)
# Remove HTML tags text = re.sub(r'<.*?>', '', text)
# Remove punctuation marks text = re.sub(r'[{}]'.format(string.punctuation), '', text)
# Remove line breaks text = re.sub(r'\n', '', text)
# Remove words containing numbers text = re.sub(r'\w*, '', text)
# Remove emojis or special characters text = re.sub(r'[^\x00-\x7F]+', '', text)
# Remove leading and trailing whitespace text = text.strip()
return text
Once our function is defined, we apply it to the column containing the reviews:
filter_data['clean_review'] = filter_data['review_body'].apply(clean).
This process may take a few seconds depending on the size of the dataset. In the example, it took approximately 8 seconds.
Our clean function performs several important transformations:
When comparing the original review with the cleaned one, we can observe significant differences such as the conversion to lowercase and the elimination of punctuation marks, which highlights the most relevant part of each comment.
Text preprocessing is a fundamental step that should not be underestimated in any textual data analysis project. With these techniques, you will be prepared to extract valuable information from your review datasets and get more accurate and relevant insights.
Have you used regular expressions in your data analysis projects? Share your experiences and doubts in the comments section.
Contributions 4
Questions 0
Want to see more contributions, questions and answers from the community?