Ollama Integration

Ollama Integration

Use Ollama instead of llama.cpp for LLM inference while maintaining the same Redis queue-based architecture and API compatibility.

Quick Start

1. Configure Environment

Add to your .env file:

1
2
3
4
5
6
7
8
9
10
# Enable Ollama mode
RUN_OLLAMA=true
OLLAMA_MODEL=gemma3:4b
OLLAMA_SERVER_URL=http://ollama:11434

# Disable llama.cpp
RUN_LLAMACPP=false

# Enable Redis workers
API=true

2. Pull Model

1
2
docker exec velesio-ollama ollama pull gemma3:4b
docker exec velesio-ollama ollama list

3. Restart GPU Worker

1
2
docker-compose restart gpu
docker logs velesio-gpu -f

Expected output:

1
2
3
πŸ¦™ RUN_OLLAMA MODE ENABLED
πŸ¦™ Using external Ollama server at http://ollama:11434
πŸ”Œ Starting Ollama LLM worker connected to Redis...

Architecture

The Ollama integration uses a drop-in replacement pattern:

1
2
3
API β†’ Redis Queue β†’ GPU Worker (Ollama Mode)
                    β”œβ”€β”€ ollama_llm.py β†’ Ollama Server
                    └── sd.py         β†’ SD WebUI (shared)

Key Points:

  • ollama_llm.py replaces llm.py when RUN_OLLAMA=true
  • sd.py is shared between both Ollama and llama.cpp modes
  • Same Redis queues, same API endpoints - transparent to clients

Configuration

Environment Variables

Variable Required Default Description
RUN_OLLAMA Yes false Enable Ollama mode
OLLAMA_SERVER_URL No http://ollama:11434 Ollama API endpoint
OLLAMA_MODEL No gemma2:2b Default model name
RUN_LLAMACPP No true Disable when using Ollama
RUN_SD No false Enable Stable Diffusion
API Yes - Enable Redis workers

Docker Compose

The GPU service includes Ollama configuration:

1
2
3
4
5
6
7
gpu:
  environment:
    - RUN_OLLAMA=${RUN_OLLAMA:-false}
    - OLLAMA_SERVER_URL=${OLLAMA_SERVER_URL:-http://ollama:11434}
    - OLLAMA_MODEL=${OLLAMA_MODEL:-gemma2:2b}
    - RUN_LLAMACPP=${RUN_LLAMACPP}
    - API=${API}

Usage Modes

Ollama LLM Only

1
2
3
4
RUN_OLLAMA=true
RUN_LLAMACPP=false
RUN_SD=false
API=true

Runs only Ollama LLM inference.

Ollama + Stable Diffusion

1
2
3
4
RUN_OLLAMA=true
RUN_LLAMACPP=false
RUN_SD=true
API=true

Runs Ollama for text and SD WebUI for images.

Traditional (llama.cpp)

1
2
3
4
RUN_OLLAMA=false
RUN_LLAMACPP=true
RUN_SD=true
API=true

Default mode using llama.cpp.

API Compatibility

Ollama workers maintain full compatibility with existing Unity and API clients.

Request Format

1
2
3
4
5
6
7
8
{
  "prompt": "Hello, how are you?",
  "temperature": 0.7,
  "top_k": 40,
  "top_p": 0.9,
  "n_predict": 128,
  "stop": ["</s>", "\n\n"]
}

Response Format

1
2
3
4
5
6
{
  "content": "I'm doing well, thank you!",
  "multimodal": false,
  "slot_id": 0,
  "stop": true
}

Parameter Mapping

The worker automatically converts between formats:

Unity/LLaMA.cpp Ollama Description
prompt prompt Input text
temperature temperature Randomness (0-1)
top_k top_k Token sampling
top_p top_p Nucleus sampling
n_predict num_predict Max tokens
repeat_penalty repeat_penalty Repetition penalty
seed seed Random seed
stop stop Stop sequences

Available Models

Model Size Use Case GPU RAM
gemma2:2b ~1.5GB Fast, lightweight ~2GB
gemma3:4b ~3GB Balanced ~4GB
llama2:7b ~4GB Good quality ~6GB
llama2:13b ~7GB High quality ~12GB
mistral:7b ~4GB Code-focused ~6GB

Pull Models

1
2
3
4
5
6
7
8
9
10
11
# Pull any model
docker exec velesio-ollama ollama pull <model-name>

# List available models
docker exec velesio-ollama ollama list

# Update .env
OLLAMA_MODEL=<model-name>

# Restart worker
docker-compose restart gpu

Testing

Check Logs

1
docker logs velesio-gpu | grep OLLAMA

Test Ollama Connection

1
curl http://localhost:11434/api/tags

Send API Request

1
2
3
4
curl -X POST http://localhost:8000/completion \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "n_predict": 50}'

Troubleshooting

Ollama Server Connection Failed

Check container is running:

1
2
docker ps | grep ollama
curl http://localhost:11434/api/tags

Check network connectivity:

1
docker exec velesio-gpu curl http://ollama:11434/api/tags

Worker Not Starting

Check environment variables:

1
2
docker exec velesio-gpu env | grep OLLAMA
docker exec velesio-gpu env | grep API

Model Not Found

Pull model first:

1
2
docker exec velesio-ollama ollama pull gemma3:4b
docker exec velesio-ollama ollama list

Redis Connection Failed

Check Redis credentials:

1
2
docker ps | grep redis
docker exec velesio-gpu env | grep REDIS

Migration

From llama.cpp to Ollama

  1. Update .env:
    1
    2
    3
    
    RUN_OLLAMA=true
    OLLAMA_MODEL=gemma3:4b
    RUN_LLAMACPP=false
    
  2. Pull model:
    1
    
    docker exec velesio-ollama ollama pull gemma3:4b
    
  3. Restart:
    1
    2
    
    docker-compose restart gpu
    docker logs velesio-gpu | grep "OLLAMA MODE"
    

From Ollama to llama.cpp

  1. Update .env:
    1
    2
    
    RUN_OLLAMA=false
    RUN_LLAMACPP=true
    
  2. Restart:
    1
    
    docker-compose restart gpu
    

Performance Comparison

Feature llama.cpp Ollama
Model Management Manual Automatic
Setup Complexity High Low
Custom Binaries βœ… ❌
Mac Support Good Better
GPU Acceleration βœ… βœ…
Context Slots βœ… ❌
API Compatibility Native Converted

When to Use Ollama

  • βœ… Easy model management
  • βœ… Mac/Apple Silicon support
  • βœ… Quick model switching
  • βœ… Simple setup

When to Use llama.cpp

  • βœ… Custom server builds
  • βœ… Slot-based caching
  • βœ… Maximum performance tuning
  • βœ… Advanced features

Advanced Configuration

Multiple Ollama Instances

1
2
3
4
5
6
7
# Worker 1
OLLAMA_SERVER_URL=http://ollama1:11434
OLLAMA_MODEL=gemma2:2b

# Worker 2
OLLAMA_SERVER_URL=http://ollama2:11434
OLLAMA_MODEL=llama2:13b

Custom Model Parameters

Adjust via API request:

1
2
3
4
5
6
7
{
  "prompt": "Your prompt",
  "temperature": 0.1,
  "top_k": 20,
  "top_p": 0.95,
  "repeat_penalty": 1.2
}

Monitoring

1
2
3
4
5
6
7
8
9
10
11
# Worker status
docker logs velesio-gpu -f

# Ollama status
docker exec velesio-ollama ollama ps

# Redis queue depth
docker exec redis redis-cli -a $REDIS_PASS LLEN llama_queue

# GPU usage
nvidia-smi

Implementation Details

Components

  • ollama_llm.py: Redis worker that wraps Ollama API
  • sd.py: Shared SD worker (unchanged)
  • entrypoint.sh: Detects RUN_OLLAMA and starts appropriate workers

Dependencies

No additional dependencies required:

  • redis>=4.5.0
  • rq==1.13.0
  • requests

Worker Behavior

ollama_llm.py:

  • Listens to Redis gpu_tasks channel
  • Converts Unity/LLaMA.cpp format to Ollama format
  • Supports streaming and non-streaming
  • Handles: completion, template, tokenize, slots

sd.py (shared):

  • Works with both Ollama and llama.cpp modes
  • No modifications needed
  • Handles: txt2img, img2img

Important Notes

  1. Exclusive Modes: Can’t run Ollama and llama.cpp simultaneously in same worker
  2. Model Storage: Ollama stores models separately from llama.cpp
  3. API Transparency: Clients don’t need to know which backend is used
  4. No Slot Support: Ollama doesn’t support llama.cpp’s context slot caching

Need help? Check the troubleshooting guide or review container logs.