Contenido del curso
Operaciones con Vectores y Matrices
Multiplicación de Matrices
Construcción de un Modelo de Regresión Lineal
Fixing Singular Matrices With np.linalg.pinv
Resumen
When your regression model crashes because of a singular matrix, you don't throw it away. Linear algebra hands you a rescue tool: the pseudoinverse, also known as the Moore Penrose inverse, available in NumPy as np.linalg.pinv. This technique lets you train a model with multicollinear data and still get usable predictions, which is gold when you work with messy real world datasets.
What is the pseudoinverse and why does it matter in machine learning?
The pseudoinverse is a generalization of the matrix inverse. Unlike the traditional inverse, it works on any matrix: square, rectangular, or singular. That flexibility is exactly what you need when your features are linearly dependent and the standard inverse simply refuses to cooperate.
In machine learning, you use it to find the best approximate solution when no unique perfect answer exists. For a regression problem with multicollinearity, the pseudoinverse returns the model weights that minimize squared error, the classic least squares approach.
What is the Moore Penrose pseudoinverse? It is a generalized inverse that works for any matrix. In regression, it returns the weight vector that minimizes squared error and, when infinite solutions exist, picks the one with the smallest norm.
And here is the elegant part: when infinite solutions exist (typical in perfect multicollinearity), the pseudoinverse picks the unique solution whose weight vector theta has the smallest possible norm. It is the simplest, most stable answer you can get [00:54].
As a fun fact, the pseudoinverse is a direct application of Singular Value Decomposition, better known as SVD [01:18].
How do you use np.linalg.pinv to solve a singular matrix problem?
When np.linalg.inv or np.linalg.solve raise a Singular matrix error, switch to np.linalg.pinv. Here is the flow used in the project, starting from X_sick_bias, a feature matrix with a duplicated column (square meters and square feet) [01:38].
- Build the normal equation pieces:
A_sick = X_sick_bias.T @ X_sick_biasandB = X_sick_bias.T @ y. - Confirm the failure:
np.linalg.inv(A_sick) @ Breturns aSingular matrixerror [02:14]. - Compute the rescue:
A_pinv = np.linalg.pinv(A_sick). - Solve for weights:
theta = A_pinv @ B[02:46].
This time the code runs without errors and you get a valid weight vector.
Why are the weights of redundant columns so small?
When you print theta, the values for the linearly dependent columns (square meters and square feet) come out very tiny. That is a fingerprint of multicollinearity handled by the minimum norm solution: the pseudoinverse spreads the influence between the redundant features instead of dumping all the weight into one [03:13].
Why does the pseudoinverse return small weights for collinear features? Because it picks the solution with the shortest theta vector. Redundant columns share the load, so each one ends up with a small coefficient.
How accurate are the predictions after applying the pseudoinverse?
With the new weights, you compute predictions = X_sick_bias @ theta and compare against the real prices. The numbers speak for themselves [03:55]:
- Real 310, predicted 308.07.
- Real 390, predicted 407.37.
- Real 325, predicted 317.28.
- Real 530, predicted 522.
For a dataset that was mathematically broken, those approximations are remarkably close. The model is not perfect, but it is functional, and that is the whole point of the rescue.
How do you visualize the regression line with matplotlib?
To plot the fit, you create a small grid of new houses and pass them through the same pipeline [04:42]:
- Build
X_line = np.array([[50],[160]])for square meters. - Concatenate bias and the other columns with
np.c_[np.ones((2,1)), X_line, np.array([[2, 538],[4, 1722]])]. - Predict with
y_line_pred = X_line_bias @ theta. - Draw the scatter of real data and overlay the regression line using
plt.scatterandplt.plot.
The resulting red line is the best fit your sick dataset can produce, and geometrically it represents the orthogonal projection onto the column space, the closest possible approximation to the original problem [05:54].
When should you rely on the pseudoinverse instead of fixing the data?
Knowing how to rescue a mathematically broken model is a crucial skill for any data scientist working with real world data, which is almost never clean. Still, the pseudoinverse is a safety net, not a permanent fix. Redundant features hurt interpretability and stability, so whenever you can, address the multicollinearity at the source by removing or combining columns.
Use np.linalg.pinv when you cannot reshape your data, when you need a quick diagnostic, or when you want the minimum norm solution as a baseline.
Quick challenge on minimum norm
With multicollinearity there are infinite valid weight vectors. Why does the pseudoinverse return the one with the minimum norm? Hint: among all combinations of theta that solve the system, it picks the shortest vector. Investigate the connection with SVD and share your findings in the comments.