Instalación de Olama y configuración de clúster local con ExoLabs

Curso de Fundamentos de LLMs

Contenido del curso

Fundamentos de los LLMs

Componentes Avanzandos de los LLMs

Personalización y Optimización

Evaluación de Modelos

20
Benchmarks para evaluar y comparar modelos LLM
12:53 min

Tomar examen

Instalación de Olama y configuración de clúster local con ExoLabs

Resumen

¿Te interesa aprender a ejecutar modelos LLM (Large Language Models) de forma local en tu equipo? A continuación, aprenderás cómo descargar Olama, configurar modelos como DeepSig y utilizar ExoLabs para conectar múltiples computadoras potenciando tu capacidad de procesamiento local para LLMs.

¿Cómo se instala y ejecuta Olama de manera local?

Olama es una herramienta sencilla para ejecutar modelos LLM de forma local:

Descarga Olama: Ve a Google, busca Olama, y descarga la versión adecuada para tu sistema operativo (macOS, Windows o Linux).
Instalación: Una vez descargado el archivo binario, ejecútalo haciendo doble clic y siguiendo los pasos que indica tu sistema operativo (por ejemplo, en macOS selecciona mover a Aplicaciones).
Uso desde terminal: Abre la terminal y ejecuta Olama para ver los comandos disponibles. Copia desde la web de Olama el comando específico del modelo que quieres ejecutar (por ejemplo, DeepSig r1-7b).

¿Cómo interactúa Olama con los modelos LLM?

Al ejecutar un modelo en Olama, este descargará los archivos necesarios (puede tomar varios minutos):

Interacción básica: Realiza preguntas o comandos como "¿cuánto es dos más dos?" para empezar a trabajar.
Consumo de recursos: La descarga y uso del modelo consumen recursos de CPU y memoria, por ejemplo, durante una consulta puede subir el consumo a más de dos gigabytes temporalmente.

¿Qué es ExoLabs y cómo potencia el uso local de LLMs?

ExoLabs permite conectar múltiples computadoras creando un clúster local de potencia computacional compartida:

Concepto principal: Usa redes peer-to-peer (P2P) como la tecnología detrás de plataformas como Torrent para compartir datos y tareas entre máquinas conectadas.
Ventajas: Reduce costos al utilizar varias computadoras económicas, mejora la privacidad (importante en sectores bancarios o jurídicos) y permite controlar específicamente las configuraciones y costos operacionales.

¿Cómo montar un clúster local con ExoLabs?

Estos pasos específicos son clave para conectar varias máquinas con ExoLabs:

Equipamiento: necesitas diversas computadoras conectadas mediante switch y router, pudiendo combinar conexiones Ethernet y WiFi.
Configuración y ejecución: instala ExoLabs en cada máquina; al ejecutar, crea automáticamente una red entre ellas mostrando información como teraflops o memoria disponible.
Funcionamiento: Cada computadora aporta sus capacidades (CPU, GPU, memoria) a la red, haciendo posible ejecutar modelos de gran tamaño que de otro modo requerirían hardware costoso.

¿Qué modelos LLM puedes ejecutar localmente con ExoLabs?

Puedes ejecutar modelos de distintos tamaños según la capacidad de memoria acumulada en tu red:

Modelos como DeepSig r1-70b requieren 70 gigabytes de memoria, completamente alcanzables al conectar varios equipos que sumen dicha capacidad.
Esta forma de trabajo localiza la informática, especialmente útil si tienes requisitos estrictos sobre información privada.

¿Cuáles son las ventajas concretas de ejecutar LLM en local?

Privacidad total: Los datos sensibles permanecen en tu entorno, clave para industrias sensibles al manejo de información.
Control absoluto: Define exactamente qué quieres ejecutar, cuánto gastar, y cómo ajustar o optimizar tu modelo específicamente para tu caso de uso con técnicas como el fine tuning.

Ahora que conoces cómo instalar Olama, manejar modelos locales e integrar ExoLabs para incrementar capacidad local, puedes aprovechar al máximo el potencial de los modelos LLM de manera segura y controlada.

Bryan Castano

Estudiante

•

Hola Chicos, Yo le pregunte a DeepSeek r-1 como funciona el SGD para entrenar NN "Obvio ya l ose" y esto fue su respuesta.

me gusto mucho ver su razonamiento , es fantastico ver como piensa el model oantes de darte una respuesta, auqne esta funcionalidad ya la tienene GPT-o4 y Gemini deep thinking. me gusto mucho verlo en localhost:

<user-question>

>>> hi, explain to me the Stochastic Gradient Descent Algorithm SGD to train a neuronal network, how does it work?

</user-question>

Okay, so I need to understand how Stochastic Gradient Descent (SGD) works

for training a neural network. From what I remember, machine learning

models use some kind of optimization algorithm to adjust their parameters

and minimize the cost function. But I'm not exactly sure how SGD fits into

this.

First, let me think about what gradient descent is. I know it's an

optimization algorithm that helps find the minimum value of a function by

moving in the direction opposite to the gradient (which points towards

increasing values). So, for training a neural network, we need to minimize

the cost or loss function with respect to the model parameters.

But wait, what's the difference between batch gradient descent and

stochastic gradient descent? I think in batch GD, you use all the training

examples to compute the gradient each time. That makes sense because it

gives a precise direction but is slow if you have a large dataset. On the

other hand, SGD must be different. Maybe instead of using all the data

points, it uses just one example per iteration? So, each update to the

parameters is based on a single sample from the training set.

Yes, that sounds right. But then how accurate are those updates? Since

you're only looking at one sample, wouldn't the gradient be noisy or not

representative of the whole dataset? I guess it might cause more

fluctuations in the learning process, but perhaps it's faster and can help

escape local minima because the frequent updates provide a kind of

randomness.

So, if SGD uses one example per iteration, how do we train the entire

model then? Do we go through the training data multiple times, each time

picking different examples (like shuffling them)? That would make sense.

It's called epoch in machine learning terms—each pass through all examples

is an epoch. But with SGD, instead of using the entire dataset per epoch,

you might process just one example at a time or a small batch.

Wait, sometimes I've heard about something called mini-batch GD as well.

So maybe there are different variants: batch (all), stochastic (one), and

mini-batch (a few). Mini-batches balance between the noise of SGD and the

speed of processing large batches. It's probably more efficient than batch

because you can process several samples at once, which is faster on modern

hardware like GPUs.

Now, how does the learning rate come into play with SGD? I think the

learning rate determines how big the steps are when moving towards the

minimum. If it's too high, we might overshoot and diverge; if it's too

low, convergence will be very slow. So finding a good learning rate is

crucial.

Also, how often do we update the weights in SGD? Since each example

triggers an update, that means more frequent updates but also maybe slower

convergence per update. But over time, perhaps with decreasing learning

rates, the model can converge despite the noise.

I'm trying to wrap my head around why we use mini-batches instead of just

single examples. Processing one sample at a time might be computationally

intensive if the dataset is large because each weight update would require

handling all layers and propagating gradients through them. Using

mini-batches allows us to vectorize operations, making computations more

efficient on GPUs.

Let me think about the steps involved in one iteration of SGD:

1. Pick a training example (or a small batch).

2. Compute the loss for this example.

3. Calculate the gradient of the loss with respect to each parameter using

backpropagation.

4. Update each parameter by subtracting the learning rate multiplied by

the computed gradient.

Yes, that seems about right. And we repeat this process multiple times

over the entire dataset (or subsets), adjusting the parameters iteratively

until the model's performance stabilizes or reaches a minimum.

Another thing I'm thinking about is how the order of examples affects the

training. If you shuffle the data every epoch, does it help with

generalization? Maybe by preventing the algorithm from learning the order

too much rather than the features themselves.

I also recall that sometimes people use decayed learning rates or adaptive

optimizers like Adam to combine the benefits of different methods. But

maybe that's beyond the basic SGD explanation.

Putting it all together: SGD is an optimization algorithm where we update

model parameters iteratively by computing gradients on one sample (or a

small batch) at a time, using backpropagation to calculate those gradients

efficiently. The learning rate controls how big each step is towards

minimizing the loss function, and through multiple epochs or passes over

the data, the model's weights are adjusted until they find a good minimum.

I think I have a basic understanding now, but there might be nuances I'm

missing, like handling different types of loss functions or more complex

models. Also, edge cases where the learning rate is too high or low could

cause issues like divergence or slow convergence.

...done thinking.

</model-thinking>

<local-model-r1-answer>

Stochastic Gradient Descent (SGD) is an optimization algorithm used to

train neural networks by minimizing a cost function through iterative

parameter updates. Here's a step-by-step explanation of how it works:

1. **Initialization**: Start with initial values for the model parameters,

often set randomly.

2. **Shuffle Training Data**: Optionally, shuffle the training dataset to

ensure varied examples are used in each iteration.

3. **Iterate Through Examples**:

- For each example (or small batch of examples):

- Compute the loss using the current parameters.

- Calculate the gradient of this loss with respect to each parameter

using backpropagation.

- Update each parameter by subtracting the product of the learning

rate and the computed gradient.

4. **Adjust Learning Rate**: The learning rate controls the step size

towards the minimum. It's crucial for convergence speed and avoiding local

minima; often, it may be decayed or adjusted dynamically.

5. **Repeat Until Convergence**: Continue iterating through the dataset

(epochs) until the loss function stabilizes or meets a predefined

threshold.

**Key Points**:

- **Efficiency**: SGD is computationally efficient, especially with large

datasets, as it processes only one sample (or a small batch) per

iteration.

- **Noise and Fluctuations**: The use of single samples

introduces noise but can help escape local minima due to frequent updates.

- **Mini-batch Processing**: Balances the benefits of batch and stochastic

approaches by processing small batches, enhancing computational efficiency

on GPUs.

In summary, SGD efficiently optimizes neural network parameters by

iteratively adjusting them based on gradients computed from individual or

mini-batches of training examples, using a learning rate to control step size.

</local-model-r1-answer>

\n genial, debo decir que elevo todos mis cores al mscximo de mi CPU.

PD: Ya me habia visto la demostracion del modelo en local cluster en McBooks de platzi en un PlatziLive de un Jueves hacia muchos meses atras, fue genial ver esto al inicico del 2025 cuando llego la fiebre de DeepSeek r1 que impacto la Industria AI WorldWide. fue genial verlo aquella vez y cool verlo de nuevo hoy.

Osvaldo Frias

Ignacio Robles

Daniel Hernandez

Gabriel Obregón

Santiago Pineda Botero

Kevin Fiorentino

Juan Diego

Daniel Andres Rojas Paredes

Mitchell Phillip Bermin Suarez

Andres David Martinez Torres

Luis Rebollo

camilo plata

Gerardo Miguel Pérez Solis

Cynthia Mercedes Gorozabel Villavicencio

Iván Pineda

Jose Daniel Barría Reyes

MARIA TERESA PANIAGUA RIVERA

Frank Stephano Alayza Herrera

Boris Turcios

Diego Ortiz

Henry Stivens Adarme Muñoz

Daniel Alberto Vega Bejarano

Instalación de Olama y configuración de clúster local con ExoLabs

Fundamentos de los LLMs

Funcionamiento interno de los grandes modelos de lenguaje

Qué construirás al terminar el curso de LLMs

Historia de la inteligencia artificial desde Turing hasta GPT-4

Tokenización y embeddings en LLMs

Qué es y cómo aprende un MLP

Cómo funciona la atención en GPT-2

Fundamentos de PyTorch para modelos de machine learning

Componentes Avanzandos de los LLMs

Construye GPT-2 desde cero con PyTorch

Qué es RoPE y cómo mejora GPT

Integración de Rope en GPT-2 con PyTorch

Leyes de escalado y modelos multimodales en inteligencia artificial

Cómo se entrena un LLM paso a paso

Mixture of Experts: cómo funciona MoE

RAM y VRAM para ejecutar LLMs en local