Why is it important to perform an exploratory data analysis?
Exploratory data analysis is crucial to identify relevant patterns and possible correlations between variables in a dataset. This not only helps to improve the understanding of the data, but also optimizes the performance of predictive models by identifying and eliminating variables that could induce noise or collinearity in the data.
How do we analyze the correlation between variables?
In this lesson, a correlation analysis was performed by visualizing a heatmap of the correlations between attributes in the dataset. In this context, correlations can vary between -1 and 1:
- 1 or close to 1: Highly correlated.
- 0 or close to 0: Not correlated.
- -1 or close to -1: Inverse correlation.
The objective is to discover highly correlated variables that could affect the model and decide whether to eliminate them.
Correlation analysis code example:
plt.figure(figsize=(15, 10))sns.heatmap(dtf.corr(), annot=True)plt.show().
Which variables did we eliminate and why?
From the analysis, it was decided to eliminate the variables convex_area
and equidiameter
due to their high correlation with other variables such as area
, perimeter
, length
, and width
, which could lead to overfitting of the model.
Example code for dropping variables:
xOver.drop(['convex_area', 'equidiameter'], axis=1, inplace=True).
How do we visualize the distribution of our variables and classes?
Visualization is a powerful tool in exploratory analysis. By creating scatter plots and Kernel Density Estimation (KDE), we can assess whether classes within the data are linearly separable. This makes it easier to understand the structure of the data and the selection of the classification method.
Example code for visualization:
sns.pairplot(df, hue="class").
Why perform scaling and splitting of the dataset?
Scaling the data and then splitting it into training and test sets are fundamental steps to standardize the data, ensure that the model obtains replicable results, and generalizes correctly to new data that it has not seen.
Example code for scaling and splitting:
X_train, X_test, y_train, y_test = train_test_split(XOver, YOver, test_size=0.2, random_state=42, shuffle=True)
scaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)
Practical conclusions
By applying these steps, not only the quality of the dataset is improved, but also the knowledge about the business and the data on which the model is based is strengthened. This knowledge allows you to fine-tune decisions throughout the modeling process for more accurate and efficient predictions. Ready to keep learning? Let's move on to the next module to continue improving our data science skills!
Want to see more contributions, questions and answers from the community?