Troubleshooting

Troubleshooting Guide

Common issues and their solutions when running Velesio AI Server.

Quick Diagnostics

Start with these commands to check system status:

1
2
3
4
5
6
7
8
9
10
11
# Check all services
docker-compose ps

# Check logs for errors
docker-compose logs --tail=50

# Check GPU availability
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

# Test API health
curl http://localhost:8000/health

Installation Issues

Docker GPU Runtime Not Found

Error: could not select device driver "" with capabilities: [[gpu]]

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
# Install NVIDIA Docker runtime
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

Model Download Failures

Error: Failed to download model from URL

Symptoms:

  • Container logs show download errors
  • Models directory is empty
  • Workers fail to start

Solutions:

  1. Check Internet Connection:
    1
    2
    
    # Test from container
    docker run --rm alpine ping -c 4 huggingface.co
    
  2. Manual Model Download:
    1
    2
    3
    4
    5
    6
    
    # Download models manually
    cd gpu/data/models/text
    wget https://huggingface.co/your-model/resolve/main/model.gguf
       
    cd ../image/models/Stable-diffusion
    wget https://huggingface.co/your-sd-model/resolve/main/model.safetensors
    
  3. Check Disk Space:
    1
    2
    
    df -h
    # Ensure sufficient space (models can be 4-20GB)
    
  4. Verify Model URLs:
    1
    2
    
    # Test URL accessibility
    curl -I $MODEL_URL
    

Permission Issues

Error: Permission denied or Operation not permitted

Solution:

1
2
3
4
5
6
7
8
# Fix ownership of data directory
sudo chown -R $(id -u):$(id -g) gpu/data/

# Fix permissions
chmod -R 755 gpu/data/

# For SELinux systems
sudo setsebool -P container_manage_cgroup true

Runtime Issues

API Returns 401 Unauthorized

Symptoms:

  • All API calls return 401
  • Authentication header is provided

Solutions:

  1. Check API Token Configuration:
    1
    2
    3
    4
    5
    
    # Verify environment variable
    docker-compose exec api env | grep API_TOKENS
       
    # Check if token matches
    echo "your-token-here" | base64
    
  2. Verify Bearer Token Format:
    1
    2
    3
    4
    
    # Correct format
    curl -H "Authorization: Bearer your-token-here" http://localhost:8000/health
       
    # NOT: "Authorization: your-token-here"
    
  3. Check Token in Environment File:
    1
    2
    3
    
    # In .env file
    API_TOKENS=token1,token2,token3
    # No spaces around commas
    

Workers Not Processing Jobs

Symptoms:

  • API accepts requests but returns timeouts
  • Queue depth increases continuously
  • No worker activity in logs

Diagnostics:

1
2
3
4
5
6
7
8
# Check Redis connection
docker-compose exec redis redis-cli ping

# Check queue status
docker-compose exec redis redis-cli LLEN llama_queue

# Check worker logs
docker-compose logs Velesio-gpu

Solutions:

  1. Restart Workers:
    1
    
    docker-compose restart Velesio-gpu
    
  2. Check Worker Configuration:
    1
    2
    
    # Verify worker environment
    docker-compose exec Velesio-gpu env | grep REDIS
    
  3. Clear Stuck Jobs:
    1
    2
    
    # Clear Redis queue
    docker-compose exec redis redis-cli FLUSHDB
    

GPU Out of Memory

Error: CUDA out of memory or RuntimeError: CUDA error: out of memory

Solutions:

  1. Reduce GPU Layers:
    1
    2
    
    # In .env file
    GPU_LAYERS=20  # Reduce from default 35
    
  2. Use Smaller Model:
    1
    2
    
    # Switch to quantized model
    MODEL_URL=https://huggingface.co/model-q4_k_m.gguf
    
  3. Reduce Batch Size:
    1
    2
    
    # For Stable Diffusion
    SD_BATCH_SIZE=1
    
  4. Check GPU Memory:
    1
    2
    
    # Monitor GPU usage
    watch -n 1 nvidia-smi
    

Slow Inference Speed

Symptoms:

  • Text generation takes >30 seconds
  • Image generation takes >5 minutes

Solutions:

  1. Optimize GPU Layers:
    1
    2
    
    # Increase GPU layers if memory allows
    GPU_LAYERS=40
    
  2. Check CPU Usage:
    1
    2
    
    # If GPU_LAYERS is low, CPU becomes bottleneck
    htop
    
  3. Use Flash Attention:
    1
    2
    
    # For Stable Diffusion
    SD_FLASH_ATTENTION=true
    
  4. Model Optimization:
    1
    2
    3
    
    # Use optimized model formats
    # GGUF with Q4_K_M quantization for LLM
    # SafeTensors for Stable Diffusion
    

Service-Specific Issues

Redis Connection Issues

Error: ConnectionError: Error connecting to Redis

Solutions:

  1. Check Redis Service:
    1
    2
    
    docker-compose ps redis
    docker-compose logs redis
    
  2. Test Redis Connectivity:
    1
    2
    3
    4
    5
    
    # From within network
    docker-compose exec api ping redis
       
    # Test Redis directly
    docker-compose exec redis redis-cli ping
    
  3. Check Port Binding:
    1
    2
    
    # Verify Redis port
    netstat -tlnp | grep 6379
    

FastAPI Service Issues

Error: 502 Bad Gateway or API not responding

Solutions:

  1. Check API Service Health:
    1
    2
    
    docker-compose logs api
    curl http://localhost:8000/health
    
  2. Verify Port Binding:
    1
    2
    
    docker-compose ps api
    netstat -tlnp | grep 8000
    
  3. Check Resource Usage:
    1
    
    docker stats
    

Stable Diffusion Issues

Error: Stable Diffusion worker fails to start

Solutions:

  1. Check SD Dependencies:
    1
    2
    
    # Verify CUDA version compatibility
    docker-compose exec Velesio-gpu nvidia-smi
    
  2. Disable SD if Not Needed:
    1
    2
    
    # In .env file
    RUN_SD=false
    
  3. Check SD Model Loading:
    1
    2
    
    # SD worker logs
    docker-compose logs Velesio-gpu | grep -i "stable"
    

Network Issues

Cannot Access API from External Host

Solutions:

  1. Check Firewall:
    1
    2
    3
    4
    5
    
    # Allow API port
    sudo ufw allow 8000
       
    # Check iptables
    sudo iptables -L
    
  2. Verify Docker Port Binding:
    1
    2
    
    # Should show 0.0.0.0:8000
    docker port Velesio-api
    
  3. Test from Different Network:
    1
    2
    
    # From external host
    curl http://your-server-ip:8000/health
    

SSL/TLS Issues

Error: Certificate verification failed

Solutions:

  1. Check Certificate:
    1
    2
    
    # Verify certificate chain
    openssl s_client -connect your-domain.com:443 -servername your-domain.com
    
  2. Update Nginx Configuration:
    1
    2
    3
    
    # In nginx.conf
    ssl_certificate /etc/nginx/ssl/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/privkey.pem;
    

Performance Issues

High Memory Usage

Solutions:

  1. Monitor Memory Usage:
    1
    2
    3
    4
    5
    
    # Check container memory
    docker stats
       
    # Check host memory
    free -h
    
  2. Reduce Model Context:
    1
    2
    
    # Limit context length
    MAX_CONTEXT_LENGTH=2048
    
  3. Implement Memory Cleanup:
    1
    2
    
    # Clear model cache periodically
    docker-compose exec Velesio-gpu pkill -f undreamai_server
    

Queue Backup

Symptoms:

  • Requests pile up in queue
  • Response times increase

Solutions:

  1. Scale Workers:
    1
    2
    
    # Add more worker containers
    docker-compose up -d --scale Velesio-gpu=3
    
  2. Implement Rate Limiting:
    1
    2
    
    # In nginx.conf
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/m;
    
  3. Monitor Queue Depth:
    1
    2
    
    # Check queue status
    curl http://localhost:8000/queue/status
    

Monitoring and Debugging

Enable Debug Logging

1
2
3
4
5
# In .env file
LOG_LEVEL=DEBUG

# Restart services
docker-compose restart

Health Check Script

Create scripts/health-check.sh:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash

echo "=== Velesio AI Server Health Check ==="

# Check Docker
if ! docker --version >/dev/null 2>&1; then
    echo "❌ Docker not installed or not running"
    exit 1
fi

# Check GPU
if ! docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi >/dev/null 2>&1; then
    echo "❌ GPU not accessible from Docker"
    exit 1
fi

# Check services
echo "📋 Service Status:"
docker-compose ps

# Check API health
echo "🔍 API Health:"
curl -s http://localhost:8000/health | jq . || echo "❌ API not responding"

# Check Redis
echo "🔍 Redis Status:"
docker-compose exec -T redis redis-cli ping || echo "❌ Redis not responding"

# Check GPU memory
echo "🎮 GPU Status:"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits

echo "✅ Health check complete"

Log Analysis

1
2
3
4
5
6
7
8
# Find errors in logs
docker-compose logs --since="1h" | grep -i error

# Monitor real-time logs
docker-compose logs -f | grep -E "(error|exception|failed)"

# Analyze API response times
docker-compose logs api | grep "completion_request" | tail -100

Getting Help

Debug Information to Collect

When seeking help, provide:

  1. System Information:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # OS and version
    cat /etc/os-release
       
    # Docker version
    docker --version
    docker-compose --version
       
    # GPU information
    nvidia-smi
    
  2. Service Status:
    1
    2
    
    docker-compose ps
    docker-compose logs --tail=100
    
  3. Configuration:
    1
    2
    
    # Environment (remove sensitive data)
    cat .env | sed 's/API_TOKENS=.*/API_TOKENS=***REDACTED***/'
    
  4. Error Messages: Full error messages and stack traces

Community Support

  • GitHub Issues: https://github.com/Velesio/Velesio-aiserver/issues
  • Documentation: This documentation site
  • Discord: Join our community Discord server

Enterprise Support

For production deployments and enterprise support:

  • Email: support@Velesio.com
  • Priority support available for enterprise customers

Preventive Measures

Regular Maintenance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Weekly maintenance script
#!/bin/bash

# Clean up old containers
docker system prune -f

# Update images
docker-compose pull

# Restart services
docker-compose down && docker-compose up -d

# Check disk space
df -h

# Verify GPU health
nvidia-smi

Monitoring Setup

Set up alerts for:

  • High GPU memory usage (>90%)
  • Queue depth (>10 jobs)
  • API response time (>30 seconds)
  • Disk space (>80% full)
  • Service downtime

Backup Strategy

1
2
3
4
5
6
7
8
9
10
11
12
# Daily backup script
#!/bin/bash

# Backup configuration
cp .env /backups/env-$(date +%Y%m%d).backup

# Backup Redis data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb

# Backup models (if custom)
tar -czf /backups/models-$(date +%Y%m%d).tar.gz gpu/data/models/

Still having issues? Check our GitHub Issues or contact support.