New llama.cpp Server Feature: Dynamic Model Management

New in llama.cpp: Model Management

The llama.cpp server now supports “router mode,” allowing dynamic loading, unloading, and switching between multiple models without requiring server restarts. This addresses a frequent user request for Ollama-style model management within the llama.cpp ecosystem.

This feature improves resource utilization and uptime, as traditional model switching necessitates a full server restart, disrupting active requests and potentially causing downtime. In production environments, even brief interruptions can translate to significant financial losses or degraded user experience.

Key Insights

Ollama-style management: Inspired by the popular Ollama framework, offering a familiar workflow.
Multi-process architecture: Each model runs in its own process, isolating failures and improving stability.
LRU eviction: Least Recently Used models are automatically unloaded when the maximum number of loaded models (--models-max) is reached, freeing up VRAM.

Working Example

# Start the server in router mode (no model specified)
llama-server

# List available models
curl http://localhost:8080/models

# Manually load a model
curl -X POST http://localhost:8080/models/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model.gguf"}'

Practical Applications

A/B Testing: Run different model versions concurrently to compare performance on real-world data.
Multi-tenant deployments: Serve multiple users or applications with different model requirements on a single server.

References:

https://huggingface.co/blog/ggml-org/model-management-in-llamacpp

On This Page

New in llama.cpp: Model Management

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Nemotron 3 Nano - A new Standard for Efficient, Open, and Intelligent Agentic Models

Generalist AI Introduces GEN-θ: A New Era of Embodied Foundation Models for Robotics

Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms