Multicollinearity: Why Redundant Features Break Models

Cursos Empresas Blog Live Conf Precios

Contenido del curso

Introducción al Álgebra Lineal para Machine Learning

Operaciones con Vectores y Matrices

Multiplicación de Matrices

Construcción de un Modelo de Regresión Lineal

Tomar examen

Multicollinearity: Why Redundant Features Break Models

Resumen

When two features in your dataset measure the same thing, your model breaks before it even starts learning. That problem has a name: multicollinearity, and it is the most common cause of singularity in machine learning matrices. Understanding it will save you from training models that look fine on paper but collapse the moment you trust their weights.

Why does multicollinearity make a matrix singular?

Multicollinearity happens when you have redundant features. Think of one column in square meters and another in square centimeters, or duplicated room counts. Both measure the same thing, just on a different scale, so they add zero new information [0:35].

When this happens, the columns of your data matrix X become linearly dependent. That dependency travels into the Gram matrix and makes it singular, which means its determinant is zero and it has no inverse. No inverse, no unique solution.

What is a singular matrix in machine learning? It is a matrix whose determinant equals zero, so it cannot be inverted. In training, this means the normal equation has no unique solution for the model weights.

What problems does multicollinearity cause in your model?

Redundant features do not just slow things down. They actively destroy three things you need from any reliable model [1:15].

Solution ambiguity: with redundant features, infinite combinations of weights produce the same prediction, so the normal equation cannot be solved with an inverse.
Model instability: when the determinant is close to zero, a tiny change in your input data triggers drastic, erratic shifts in the learned weights.
Loss of interpretability: unstable weights cannot tell you which feature actually matters, so you lose the ability to explain your model's decisions.

In other words, you stop trusting the math and you stop trusting the story the model tells.

How do you diagnose singularity in Python with NumPy?

The fastest way to see this in action is to deliberately break a healthy dataset. In Google Colab, you can take the original X and add a redundant column [2:30].

python pies2 = X_original[:, 0] * 10.764 X_enfermo = np.c_[X_original, pies2] X_enfermo_bias = np.c_[np.ones((4, 1)), X_enfermo]

The new column pies2 is just the square meters column converted to square feet by multiplying by 10.764. Same information, different scale. Pure redundancy.

Next, build the pieces of the normal equation:

python A_enfermo = X_enfermo_bias.T @ X_enfermo_bias B = X_enfermo_bias.T @ Y

Now run the two diagnostics you already know.

What does a determinant of zero tell you?

When you compute np.linalg.det(A_enfermo), the result is zero [4:10]. That is your first red flag. The matrix is sick.

What does it mean when a determinant equals zero? It means the matrix is singular and has no inverse, so the system of equations has either no solution or infinite solutions, never a unique one.

What does the rank reveal about redundancy?

The second check is the rank with np.linalg.matrix_rank(A_enfermo). The result is 3, even though the matrix is 4x4. That mismatch confirms that one column is a linear combination of the others.

Why does NumPy refuse to solve a singular system?

Knowing the theory is one thing. Watching NumPy throw the error is another. If you try to compute the weights directly:

python theta = np.linalg.inv(A_enfermo) @ B

You get a Singular matrix error [5:20]. Same story if you try the dedicated solver:

python theta = np.linalg.solve(A_enfermo, B)

Same error. NumPy is not being dramatic. It is telling you that no inverse exists, so there is no unique theta to return.

And here is the lesson: a redundant feature, something as innocent as adding square feet next to square meters, was enough to take your training pipeline from working to completely broken.

How can you experiment with near singular matrices?

Take the X_enfermo matrix and tweak one value in the last column so it is no longer exactly the square meters times 10.764. Pick any number you want and rerun the code.

Then answer in the comments:

Is the determinant now exactly zero or just a very small number?
Does the inverse work, or does np.linalg.solve still fail?
If weights appear, do they look stable or chaotic?

This little experiment shows the difference between perfect singularity and near singularity, and why even almost redundant features can wreck your model's reliability.

The good news is that linear algebra has a tool built precisely for these cases: the pseudo inverse. With it, you can find the best possible solution even when a perfect one does not exist. Share what you got in your experiment and tell me which value broke your matrix the most.

Comentarios

Nery Alberto Cano Ortigoza

Estudiante

¿Qué pasa si hay características casi idénticas?

Si las variables no son copias exactas pero están altamente correlacionadas (como el precio de una casa y los impuestos que paga), el determinante no será exactamente cero, pero será un número extremadamente pequeño. Esto genera un problema silencioso pero letal: la inestabilidad extrema. Piensa en ello como intentar equilibrar una pirámide sobre su punta. Cualquier ruido microscópico en los datos de entrada o un ligero cambio en una sola fila hará que los pesos (theta) del modelo salten de manera errática y desproporcionada. Un día una variable parecerá la más importante del mundo y, al reentrenar con un dato nuevo, su peso podría volverse negativo. Esto destruye por completo la interpretabilidad de tu modelo, impidiéndote explicar por qué la inteligencia artificial toma ciertas decisiones.

Nery Alberto Cano Ortigoza

Estudiante

¿Puedo usar esto para limpiar mis datos?

Absolutamente. Entender la dependencia lineal es tu mejor filtro para realizar selección de características (feature selection). En lugar de lanzar cientos de variables a ciegas a tu modelo esperando que haga magia, puedes usar el cálculo del rango y el determinante como un escáner de calidad. Si detectas que tu matriz es singular, sabes que debes realizar una limpieza profunda. Esto te permite tomar decisiones estratégicas: puedes eliminar manualmente la variable redundante (por ejemplo, si tienes año de nacimiento y edad, borras una), o puedes aplicar técnicas de reducción de dimensionalidad como PCA (Análisis de Componentes Principales). PCA comprime esa información correlacionada en nuevas variables independientes, salvando el entrenamiento, reduciendo el costo computacional y asegurando que cada columna aporte valor único.

Alberto Ezequiel Marin Chacon

Estudiante

El rango se mantiene como 4 ya que una de las filas no contiene información redundante.

X_enfermo_bias[1,3] = 55

Nery Alberto Cano Ortigoza

Estudiante

¿Por qué la información redundante rompe modelos?

Imagina que intentas seguir las instrucciones de dos GPS al mismo tiempo, pero uno te da la distancia en kilómetros y el otro en millas. En lugar de ayudarte a llegar más rápido, la información duplicada te confunde porque no sabes a cuál darle prioridad. En machine learning, esto se traduce en ambigüedad matemática. Cuando le das al modelo dos variables que dicen exactamente lo mismo, el algoritmo entra en un bucle infinito de posibilidades tratando de asignarles un peso o importancia. Como ambas son igual de válidas, la ecuación normal colapsa porque no existe una única solución correcta, sino infinitas combinaciones posibles. Esto destruye la estabilidad de la matriz, haciendo que su determinante caiga a cero y sea imposible de invertir. En la práctica, tu modelo se vuelve incapaz de aprender patrones reales y es inútil para hacer predicciones confiables.

José Eder Guzmán Mendoza

Estudiante

La multicolinealidad ocurre cuando dos o más variables de un modelo contienen la misma información (por ejemplo, duplicar una variable o usarla en otra escala). Esto provoca que la matriz XTXX^TXXTX se vuelva singular, es decir, con determinante igual a cero, lo que impide calcular su inversa y, por tanto, rompe el entrenamiento mediante la ecuación normal.

Las consecuencias son críticas: no existe una solución única para los pesos, el modelo se vuelve inestable (pequeños cambios generan grandes variaciones) y la interpretación de los coeficientes pierde sentido. Con herramientas como NumPy, esto se puede diagnosticar usando np.linalg.det (determinante) y np.linalg.matrix_rank (rango), que permiten identificar redundancias antes de que fallen métodos como np.linalg.inv o np.linalg.solve.

Cuando ocurre este problema, una alternativa práctica es usar la seudoinversa, que permite obtener una solución aproximada incluso en presencia de multicolinealidad. En síntesis, detectar y corregir la redundancia en las variables es clave para garantizar modelos estables y confiables en machine learning.

Alex Xiomar Rubio Lopez

Estudiante

Reto: Cambiar un valor de la matriz X, para evitar la redundancia y verificar su determinante y el rango

Darlinson Felipe Polania Camacho

Estudiante

Donde estas haciendo la conversión pies2 = X[:, 0] * 10.764 , entonces modifico la columna sumando un numero, para que no sea lineal pies2[0] +=5, asi la matriz vuelve a ser revertible y el determinanate ya deja de ser cero.

Gabriel Obregón

Estudiante

🧩📘 Multicolinealidad y singularidad en modelos lineales

🎯 IDEA CENTRAL

🧠 Multicolinealidad = características redundantes en X

⬇️

🔗 Dependencia lineal

⬇️

🚫 Matriz singular (no invertible)

⬇️

❌ La ecuación normal no tiene solución única

⚠️ ¿QUÉ ES LA SINGULARIDAD?

Una matriz es singular cuando:

✔️ Determinante = 0

❌ No tiene inversa

❌ No existe una solución única

📌 La multicolinealidad es una causa directa de singularidad.

🔁 ¿CÓMO APARECE LA DEPENDENCIA LINEAL?

Ocurre cuando una columna no aporta información nueva.

Ejemplos típicos

🔹 Misma variable, distinta escala → m² ↔ pies² (factor 10.764)

🔹 Variable duplicada → “habitaciones” + copia de “habitaciones”

➡️ Misma información, diferente forma.

🧮 ¿CÓMO SE VE EN LAS MATRICES?

📐 Matemáticamente ocurre que:

· X transpuesta por X se vuelve singular

· Determinante = 0

· Rango < número de columnas

· La ecuación normal no se puede resolver

🚨 PROBLEMAS EN EL MODELO

❓ Ambigüedad

· Infinitos vectores de pesos explican igual los datos

⚡ Inestabilidad numérica

· Determinante cercano a 0

· Pequeños cambios → grandes variaciones en los pesos

🔍 Mala interpretabilidad

· Los pesos dejan de ser fiables

Alejandro Senger

Estudiante

# Modifiquemos ligeramente un valor de pies2 y repitamos el proceso

pies2 = np.array([ 861, 1291.68, 1076.4, 1614.6 ])

X_enfermo = np.c_[X, pies2]

X_enfermo_bias = np.c_[np.ones((4, 1)), X_enfermo]

A_enfermo = X_enfermo_bias.T @ X_enfermo_bias

B_enfermo = X_enfermo_bias.T @ y

determinante = np.linalg.det(A_enfermo)

print(determinante)

Y el resultado me da 1.44...

Introducción al Álgebra Lineal para Machine Learning

Linear Algebra Behind AI Recommendations

Google Colab Setup for Machine Learning Python

NumPy Arrays and Matplotlib Visualized

Vectors, Matrices, and Tensors in NumPy

Operaciones con Vectores y Matrices

How Models Learn From Their Own Errors

Norma L2 vs L1 en vectores con NumPy

Cosine Similarity Explained With Word Vectors

Orthogonal vs Orthonormal Vectors in NumPy

Multiplicación de Matrices

Matrix-Vector Products for Model Predictions

Matrix-Matrix Product for ML Classification

Inverting Matrices With NumPy

Construcción de un Modelo de Regresión Lineal

Cómo predecir precios con álgebra lineal

Solving Linear Regression with NumPy

Determinant and Rank to Diagnose Your Matrix

Multicollinearity: Why Redundant Features Break Models

Fixing Singular Matrices With np.linalg.pinv

Resumen