You don't have access to this class

Keep learning! Join and start boosting your career

Aprovecha el precio especial y haz tu profesión a prueba de IA

Antes: $249

Currency
$209
Suscríbete

Termina en:

1 Días
4 Hrs
36 Min
37 Seg

Algoritmos de aprendizaje no supervisado

10/16
Resources

What is K-means in unsupervised learning?

Unsupervised learning is a fascinating branch of artificial intelligence focused on finding hidden structures in data without the need for prior labeling. A prominent example within this approach is the K-means algorithm, often used in clustering tasks. Why? It is designed to identify and assign data points to groups or "clusters", enabling effective pattern analysis.

How does K-means work?

The heart of K-means lies in the concept of a "centroid", which acts as a leader or representative of a particular cluster. These centroids may initially be placed randomly in the data space, but the process then adjusts them to best represent the data.

The general steps of the algorithm are:

  1. Random initialization: Random positions are chosen for the centroids.
  2. Membership assignment: Each data point is associated to the nearest centroid, thus forming a cluster.
  3. Centroid update: The centroids are relocated by calculating the average of the points within each cluster.
  4. Repeating steps: Steps 2 and 3 are repeated until the centroids stabilize their position or the cluster assignments no longer change.

What are the key parameters in K-means?

The most critical parameter in K-means is the "K" value, which represents the number of desired clusters. By varying "K", you can obtain clusters with different shapes and structures, which makes it essential to choose an appropriate value.

Practical example and visualization

Imagine running K-means with different values of "K" for the same data set. By increasing "K" from 2 to 4, we observe how the groupings change both in shape and number of points per cluster. To refine this process, performance metrics are used to help determine if the number of "K" is ideal for a specific model.

What is the cost function in K-means?

The main objective of K-means is to optimize the position of centroids so that data points are as close as possible to their assigned centroids. In other words, it minimizes the sum of the squared distances from each point to its corresponding centroid.

This process ensures that the resulting data sets are as compact and distinct as possible.

How are the centroids updated?

Update rule: The centroids are recalculated based on the averages of the cluster points. This recalculation redefines the position of the centroids to better reflect your cluster. The recalculation cycle continues until:

  • The position of the centroids changes insignificantly,
  • Or there is no change in the assignments of the points to the clusters.

How to determine the right value of "K"?

Selecting the correct "K" can be challenging but crucial to a successful model. Some common techniques include:

  • Inertia: Evaluates how clustered the points are to their centroid; this value is sought to be as low as possible.
  • Silhouette score: Measures the separation between clusters; a value close to one indicates good separation.
  • Elbow plot: Plots inertia as a function of "K". The resulting curve will help identify the optimal "K", where adding more clusters does not significantly improve clustering.

Exploration using the Iris dataset

A practical and entertaining way to assimilate these concepts is to test K-means with the Iris dataset, famous in the machine learning world. Having different features of the flowers allows not only to group them effectively, but to experiment with different configurations of the algorithm.

The Iris data is used to predict categorization based on features such as the width and length of sepals and petals.


These features make K-means a powerful tool for structuring and understanding unlabeled data. Its application in various areas of data analysis makes it essential for data scientists and analysts. If you are so inclined, I invite you to experiment with your own datasets and explore the visual and dynamic world of K-means.

Contributions 15

Questions 2

Sort by:

Want to see more contributions, questions and answers from the community?

K-Means performance

We can use elbow for optimal K method.

RESUMEN

Describe el enfoque de aprendizaje no supervisado llamado K-means, que busca encontrar una estructura en los datos mediante la asignación de puntos de datos a grupos específicos, conocidos como clusters. Para hacer esto, K-means utiliza centroides que representan cada cluster y se colocan inicialmente de manera aleatoria. El objetivo es actualizar la posición de los centroides encontrando la media de los puntos de datos que pertenecen a cada cluster, lo que permite actualizar la asignación de puntos de datos y los clusters. El parámetro más importante en K-means es el número de clusters, conocido como “K”. Además, el texto menciona la función de coste que se busca optimizar durante la ejecución de K-means y la regla de actualización que se utiliza para actualizar los centroides. Por último, se describe un ejemplo visual que muestra cómo cambia la asignación de puntos de datos a clusters a lo largo de varias iteraciones.

Muy interesante, de esta forma puedo determinar una ruta optima para la entrega de muchos paquetes

Los centroides in k-means representan posiciones en el espacio que se supone que representan cada una de las features de entrada.

Me parece una buena forma de comprender lo que hace cada algoritmo para que nuestro abordaje de los problemas, como científicos de datos, sea racional y explicable.

Tres ingredientes de un algoritmo:

  • Proceso de decisión: la forma en la que el modelo hace una predicción, generalmente mediante el uso de parámetros.

  • Función de coste: la forma en que la predicción del modelo se compara con el objetivo de salida o meta.

  • Regla de actualización: forma en la que el modelo actualiza y cambia sus parámetros para mejorar las predicciones.

Copienlo en Collab y pruebenlo k-means

from sklearn.cluster import KMeans
import numpy as np
from sklearn import datasets
import pandas as pd

import matplotlib.pyplot as plt
from sklearn import metrics

DATOS = datasets.load_wine()

DataTarget = DATOS.target

col_list = DATOS.feature_names
DataToTrain = pd.DataFrame(DATOS.data, columns = col_list)
print(col_list)

K_optimo = 0;
Mejor = 0

for K in range(2,8):

# Variamos K
model = KMeans(n_clusters= K, max_iter=1000) 

#Se entrena modelo
model.fit(DataToTrain)  

y_labels = model.labels_ 

#Creamos prediccion
y_kmeans = model.predict(DataToTrain) 

# Revisamos la precicion del modelo
accuracy = metrics.adjusted_rand_score(DataTarget, y_kmeans) 
print(K,' ',accuracy)

if accuracy > Mejor:    
    K_optimo = K
    Mejor = accuracy

print(’’)
print(f’Max accuracy: {round(Mejor,3)}% \nUsando K: {K_optimo} ')

K = K_optimo

model = KMeans(n_clusters= K, max_iter=1000)
model.fit(DataToTrain)

y_labels = model.labels_

#Creamos prediccion
y_means = model.predict(DataToTrain)

#GRAFIQUE 2 FEATURES PARA VER SI HAY RELACION

plt.scatter(DataToTrain[‘alcohol’], DataToTrain[‘hue’], c=y_means, s=30)
plt.xlabel(‘Alcohol’, fontsize = 10)
plt.ylabel(‘Hue’, fontsize = 10)

This video makes a good exposition about k means

Un saludo desde El Salvador, os dejo un ejemplo de K-means

from sklearn.cluster import KMeans
import numpy as np
from sklearn import datasets
import pandas as pd

import matplotlib.pyplot as plt
from sklearn import metrics

DATOS = datasets.load_wine()

DataTarget = DATOS.target

col_list = DATOS.feature_names
DataToTrain = pd.DataFrame(DATOS.data, columns = col_list)
print(col_list)

K_optimo = 0;
Mejor = 0

for K in range(2,8):

# Variamos K
model = KMeans(n_clusters= K, max_iter=1000) 

#Se entrena modelo
model.fit(DataToTrain)  

y_labels = model.labels_ 

#Creamos prediccion
y_kmeans = model.predict(DataToTrain) 

# Revisamos la precicion del modelo
accuracy = metrics.adjusted_rand_score(DataTarget, y_kmeans) 
print(K,' ',accuracy)

if accuracy > Mejor:    
    K_optimo = K
    Mejor = accuracy

print(’’)
print(f’Max accuracy: {round(Mejor,3)}% \nUsando K: {K_optimo} ')

#SEGUNDA PARTE PREDICCION
#==========================

K = K_optimo

model = KMeans(n_clusters= K, max_iter=1000)
model.fit(DataToTrain)

y_labels = model.labels_

#Creamos prediccion
y_means = model.predict(DataToTrain)

GRAFIQUE 2 FEATURES

#====================

plt.scatter(DataToTrain[‘alcohol’], DataToTrain[‘hue’], c=y_means, s=30)
plt.xlabel(‘Alcohol’, fontsize = 10)
plt.ylabel(‘Hue’, fontsize = 10)

Un hilo interesante a leer sobre el uso del Elbow Plot ~

![](https://static.platzi.com/media/user_upload/image-52eb7f8a-9e8a-44df-9c93-918d15d8f592.jpg)

Estoy impresionado con el uso de K-means. Es un dato a tener en cuenta. Ademas, su desarrollo permite mejorar el aprendizaje de este.

[**https://www.youtube.com/watch?v=4b5d3muPQmA**](https://www.youtube.com/watch?v=4b5d3muPQmA) **I think this guy explain very easy about the topic.**

He usado en diferentes ejercicios este data set y solamente conocía que eran diferentes especies de un mismo genero de plantas, acá esta la historia del origen de este dataset y para que fue pensando inicialmente. Es una buena practica entender de donde provienen nuestros datos. https://towardsdatascience.com/the-iris-dataset-a-little-bit-of-history-and-biology-fb4812f5a7b5

Los algoritmos de aprendizaje no supervisado son un tipo de algoritmo de aprendizaje automático que se utiliza cuando no se dispone de etiquetas o valores en los datos de entrada. A diferencia de los algoritmos de aprendizaje supervisado que se utilizan para problemas de clasificación y regresión, los algoritmos de aprendizaje no supervisado se centran en encontrar patrones, estructuras ocultas o agrupaciones en los datos sin ninguna orientación previa.

Los algoritmos de clustering son utilizados para determinar si un artículo es sensaionalista o no (fake news)