Troubleshooting
Troubleshooting Guide
Common issues and their solutions when running Velesio AI Server.
Quick Diagnostics
Start with these commands to check system status:
1
2
3
4
5
6
7
8
9
10
11
# Check all services
docker-compose ps
# Check logs for errors
docker-compose logs --tail=50
# Check GPU availability
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi
# Test API health
curl http://localhost:8000/health
Installation Issues
Docker GPU Runtime Not Found
Error: could not select device driver "" with capabilities: [[gpu]]
Solution:
1
2
3
4
5
6
7
8
9
10
11
12
# Install NVIDIA Docker runtime
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi
Model Download Failures
Error: Failed to download model from URL
Symptoms:
- Container logs show download errors
- Models directory is empty
- Workers fail to start
Solutions:
- Check Internet Connection:
1 2
# Test from container docker run --rm alpine ping -c 4 huggingface.co
- Manual Model Download:
1 2 3 4 5 6
# Download models manually cd gpu/data/models/text wget https://huggingface.co/your-model/resolve/main/model.gguf cd ../image/models/Stable-diffusion wget https://huggingface.co/your-sd-model/resolve/main/model.safetensors
- Check Disk Space:
1 2
df -h # Ensure sufficient space (models can be 4-20GB)
- Verify Model URLs:
1 2
# Test URL accessibility curl -I $MODEL_URL
Permission Issues
Error: Permission denied
or Operation not permitted
Solution:
1
2
3
4
5
6
7
8
# Fix ownership of data directory
sudo chown -R $(id -u):$(id -g) gpu/data/
# Fix permissions
chmod -R 755 gpu/data/
# For SELinux systems
sudo setsebool -P container_manage_cgroup true
Runtime Issues
API Returns 401 Unauthorized
Symptoms:
- All API calls return 401
- Authentication header is provided
Solutions:
- Check API Token Configuration:
1 2 3 4 5
# Verify environment variable docker-compose exec api env | grep API_TOKENS # Check if token matches echo "your-token-here" | base64
- Verify Bearer Token Format:
1 2 3 4
# Correct format curl -H "Authorization: Bearer your-token-here" http://localhost:8000/health # NOT: "Authorization: your-token-here"
- Check Token in Environment File:
1 2 3
# In .env file API_TOKENS=token1,token2,token3 # No spaces around commas
Workers Not Processing Jobs
Symptoms:
- API accepts requests but returns timeouts
- Queue depth increases continuously
- No worker activity in logs
Diagnostics:
1
2
3
4
5
6
7
8
# Check Redis connection
docker-compose exec redis redis-cli ping
# Check queue status
docker-compose exec redis redis-cli LLEN llama_queue
# Check worker logs
docker-compose logs Velesio-gpu
Solutions:
- Restart Workers:
1
docker-compose restart Velesio-gpu
- Check Worker Configuration:
1 2
# Verify worker environment docker-compose exec Velesio-gpu env | grep REDIS
- Clear Stuck Jobs:
1 2
# Clear Redis queue docker-compose exec redis redis-cli FLUSHDB
GPU Out of Memory
Error: CUDA out of memory
or RuntimeError: CUDA error: out of memory
Solutions:
- Reduce GPU Layers:
1 2
# In .env file GPU_LAYERS=20 # Reduce from default 35
- Use Smaller Model:
1 2
# Switch to quantized model MODEL_URL=https://huggingface.co/model-q4_k_m.gguf
- Reduce Batch Size:
1 2
# For Stable Diffusion SD_BATCH_SIZE=1
- Check GPU Memory:
1 2
# Monitor GPU usage watch -n 1 nvidia-smi
Slow Inference Speed
Symptoms:
- Text generation takes >30 seconds
- Image generation takes >5 minutes
Solutions:
- Optimize GPU Layers:
1 2
# Increase GPU layers if memory allows GPU_LAYERS=40
- Check CPU Usage:
1 2
# If GPU_LAYERS is low, CPU becomes bottleneck htop
- Use Flash Attention:
1 2
# For Stable Diffusion SD_FLASH_ATTENTION=true
- Model Optimization:
1 2 3
# Use optimized model formats # GGUF with Q4_K_M quantization for LLM # SafeTensors for Stable Diffusion
Service-Specific Issues
Redis Connection Issues
Error: ConnectionError: Error connecting to Redis
Solutions:
- Check Redis Service:
1 2
docker-compose ps redis docker-compose logs redis
- Test Redis Connectivity:
1 2 3 4 5
# From within network docker-compose exec api ping redis # Test Redis directly docker-compose exec redis redis-cli ping
- Check Port Binding:
1 2
# Verify Redis port netstat -tlnp | grep 6379
FastAPI Service Issues
Error: 502 Bad Gateway
or API not responding
Solutions:
- Check API Service Health:
1 2
docker-compose logs api curl http://localhost:8000/health
- Verify Port Binding:
1 2
docker-compose ps api netstat -tlnp | grep 8000
- Check Resource Usage:
1
docker stats
Stable Diffusion Issues
Error: Stable Diffusion worker fails to start
Solutions:
- Check SD Dependencies:
1 2
# Verify CUDA version compatibility docker-compose exec Velesio-gpu nvidia-smi
- Disable SD if Not Needed:
1 2
# In .env file RUN_SD=false
- Check SD Model Loading:
1 2
# SD worker logs docker-compose logs Velesio-gpu | grep -i "stable"
Network Issues
Cannot Access API from External Host
Solutions:
- Check Firewall:
1 2 3 4 5
# Allow API port sudo ufw allow 8000 # Check iptables sudo iptables -L
- Verify Docker Port Binding:
1 2
# Should show 0.0.0.0:8000 docker port Velesio-api
- Test from Different Network:
1 2
# From external host curl http://your-server-ip:8000/health
SSL/TLS Issues
Error: Certificate verification failed
Solutions:
- Check Certificate:
1 2
# Verify certificate chain openssl s_client -connect your-domain.com:443 -servername your-domain.com
- Update Nginx Configuration:
1 2 3
# In nginx.conf ssl_certificate /etc/nginx/ssl/fullchain.pem; ssl_certificate_key /etc/nginx/ssl/privkey.pem;
Performance Issues
High Memory Usage
Solutions:
- Monitor Memory Usage:
1 2 3 4 5
# Check container memory docker stats # Check host memory free -h
- Reduce Model Context:
1 2
# Limit context length MAX_CONTEXT_LENGTH=2048
- Implement Memory Cleanup:
1 2
# Clear model cache periodically docker-compose exec Velesio-gpu pkill -f undreamai_server
Queue Backup
Symptoms:
- Requests pile up in queue
- Response times increase
Solutions:
- Scale Workers:
1 2
# Add more worker containers docker-compose up -d --scale Velesio-gpu=3
- Implement Rate Limiting:
1 2
# In nginx.conf limit_req_zone $binary_remote_addr zone=api:10m rate=10r/m;
- Monitor Queue Depth:
1 2
# Check queue status curl http://localhost:8000/queue/status
Monitoring and Debugging
Enable Debug Logging
1
2
3
4
5
# In .env file
LOG_LEVEL=DEBUG
# Restart services
docker-compose restart
Health Check Script
Create scripts/health-check.sh
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
echo "=== Velesio AI Server Health Check ==="
# Check Docker
if ! docker --version >/dev/null 2>&1; then
echo "❌ Docker not installed or not running"
exit 1
fi
# Check GPU
if ! docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi >/dev/null 2>&1; then
echo "❌ GPU not accessible from Docker"
exit 1
fi
# Check services
echo "📋 Service Status:"
docker-compose ps
# Check API health
echo "🔍 API Health:"
curl -s http://localhost:8000/health | jq . || echo "❌ API not responding"
# Check Redis
echo "🔍 Redis Status:"
docker-compose exec -T redis redis-cli ping || echo "❌ Redis not responding"
# Check GPU memory
echo "🎮 GPU Status:"
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
echo "✅ Health check complete"
Log Analysis
1
2
3
4
5
6
7
8
# Find errors in logs
docker-compose logs --since="1h" | grep -i error
# Monitor real-time logs
docker-compose logs -f | grep -E "(error|exception|failed)"
# Analyze API response times
docker-compose logs api | grep "completion_request" | tail -100
Getting Help
Debug Information to Collect
When seeking help, provide:
- System Information:
1 2 3 4 5 6 7 8 9
# OS and version cat /etc/os-release # Docker version docker --version docker-compose --version # GPU information nvidia-smi
- Service Status:
1 2
docker-compose ps docker-compose logs --tail=100
- Configuration:
1 2
# Environment (remove sensitive data) cat .env | sed 's/API_TOKENS=.*/API_TOKENS=***REDACTED***/'
- Error Messages: Full error messages and stack traces
Community Support
- GitHub Issues: https://github.com/Velesio/Velesio-aiserver/issues
- Documentation: This documentation site
- Discord: Join our community Discord server
Enterprise Support
For production deployments and enterprise support:
- Email: support@Velesio.com
- Priority support available for enterprise customers
Preventive Measures
Regular Maintenance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Weekly maintenance script
#!/bin/bash
# Clean up old containers
docker system prune -f
# Update images
docker-compose pull
# Restart services
docker-compose down && docker-compose up -d
# Check disk space
df -h
# Verify GPU health
nvidia-smi
Monitoring Setup
Set up alerts for:
- High GPU memory usage (>90%)
- Queue depth (>10 jobs)
- API response time (>30 seconds)
- Disk space (>80% full)
- Service downtime
Backup Strategy
1
2
3
4
5
6
7
8
9
10
11
12
# Daily backup script
#!/bin/bash
# Backup configuration
cp .env /backups/env-$(date +%Y%m%d).backup
# Backup Redis data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb
# Backup models (if custom)
tar -czf /backups/models-$(date +%Y%m%d).tar.gz gpu/data/models/
Still having issues? Check our GitHub Issues or contact support.