Home
A microservice-based AI inference server with Redis queue-based worker architecture
Velesio AI Server
A high-performance, microservice-based AI inference server designed for scalable LLM and Stable Diffusion workloads.
Overview
Velesio AI Server is a production-ready AI inference platform that provides:
- LLM Text Generation via llama.cpp or Ollama
- Stable Diffusion Image Generation with WebUI support
- Redis Queue Architecture for scalable job processing
- Docker-based Deployment with GPU acceleration
- Built-in Monitoring with Grafana and Prometheus
- Unity Integration ready endpoints
Architecture
1
2
3
4
5
6
7
8
9
βββββββββββββββ βββββββββββ βββββββββββββββ
β API ββββββ Redis ββββββ GPU Workers β
β (FastAPI) β β Queue β β (LLM + SD) β
βββββββββββββββ βββββββββββ βββββββββββββββ
β β
β βββββββββββββββ β
βββββββββββββ Monitoring ββββββββββ
β(Grafana+Prom)β
βββββββββββββββ
Key Features
π High Performance
- Standard llama.cpp server with CUDA support or Ollama for flexible LLM deployment
- GPU acceleration with CUDA support
- Asynchronous job processing via Redis Queue
π§ Easy Setup
- Docker Compose deployment
- Automatic model downloading
- Ollama for simplified model management
- Pre-configured monitoring stack
π― Unity Ready
- Compatible with βLLM for Unityβ asset
- Base64 image encoding for seamless integration
- Standardized API endpoints
π Production Monitoring
- Real-time metrics with Prometheus
- Visual dashboards in Grafana
- Redis queue monitoring
- GPU utilization tracking
Services
| Service | Port | Description |
|---|---|---|
| API | 8000 | FastAPI web server |
| Redis | 6379 | Message queue |
| LLM Worker | 1337 | Direct LLM access (when API=false) |
| Ollama | 11434 | Ollama API server (when RUN_OLLAMA=true) |
| Stable Diffusion | 7860 | WebUI interface (when RUN_SD=true) |
| Grafana | 3000 | Monitoring dashboard |
| Prometheus | 9090 | Metrics collection |
Next Steps
- Quickstart Cloud Infra - Cloud deployment guide
- Quickstart Self hosted - Self-hosted setup
- Ollama Integration - Use Ollama for LLM inference
- Architecture - System design deep dive
- Deployment Guide - Production deployment strategies
- Model Templates - Model configurations
Need help? Check our troubleshooting guide or open an issue on GitHub.