Architecture
Architecture Overview
Velesio AI Server follows a microservice architecture designed for scalability, reliability, and maintainability.
System Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Client Applications โ
โ (Unity Games, Web Apps, Mobile Apps, CLI Tools, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HTTP/REST API
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ API Gateway โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ FastAPI โ โ Auth โ โ Request Validation โโ
โ โ Server โ โ Middleware โ โ & Routing โโ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Job Queue
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Message Broker โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Redis Queue โโ
โ โ โข Job queuing & distribution โโ
โ โ โข Result storage & retrieval โโ
โ โ โข Worker coordination โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Pull Jobs โ Pull Jobs
โ โ
โโโโโโโโโโโผโโโโโโโโโโโโโโ โโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ LLM Worker โ โ Stable Diffusion Worker โ
โ โโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ undreamai_serverโ โ โ โ Automatic1111 โ โ
โ โ (llama.cpp fork)โ โ โ โ WebUI โ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ GGUF Models โ โ โ โ Stable Diffusion โ โ
โ โ (Text) โ โ โ โ Models โ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ Results
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Monitoring Stack โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Prometheus โ โ Grafana โ โ Log Aggregation โ โ
โ โ Metrics โ โ Dashboards โ โ (Loki) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Independent Components
Velesio AI Server consists of three independent, deployable components that can run together or separately:
Component | Description | Documentation |
---|---|---|
๐ API Service | FastAPI gateway handling HTTP requests and job queuing | api/ |
๐ฎ GPU Workers | AI inference engines for LLM and Stable Diffusion | gpu/ |
๐ Monitoring Stack | Observability with Grafana dashboards and metrics | monitoring/ |
Each component can be:
- Deployed together for a complete, integrated solution
- Deployed separately across multiple servers for scalability
- Mixed and matched based on your infrastructure requirements
Core Components
1. API Gateway (FastAPI)
Location: api/
The API Gateway serves as the entry point for all client requests:
- Authentication: Bearer token validation
- Request Validation: Pydantic models ensure data integrity
- Job Queuing: Converts HTTP requests to Redis jobs
- Response Handling: Polls Redis for job results and returns to clients
- Documentation: Auto-generated OpenAPI docs at
/docs
Key Features:
- Async/await for high concurrency
- Structured logging
- Health checks
- Rate limiting (configurable)
- CORS support for web clients
2. Message Broker (Redis)
Technology: Redis with RQ (Redis Queue)
Redis serves multiple roles:
- Job Queue: Stores pending inference jobs
- Result Storage: Temporary storage for completed results
- Worker Coordination: Manages worker pool and load balancing
- Session Management: Optional session state for stateful conversations
Queue Structure:
1
2
3
4
5
llama_queue:
โโโ pending_jobs # Jobs waiting for workers
โโโ active_jobs # Currently processing jobs
โโโ completed_jobs # Finished jobs (TTL: 1 hour)
โโโ failed_jobs # Failed jobs for debugging
3. GPU Workers
LLM Worker (gpu/llm.py
)
Core Technology: undreamai_server
(custom llama.cpp fork)
- Model Format: GGUF (GPT-Generated Unified Format)
- Inference Engine: Custom llama.cpp binary optimized for undream.ai
- GPU Acceleration: CUDA with configurable layer offloading
- Context Management: Configurable context windows (2K-32K tokens)
Features:
- Streaming text generation
- Configurable sampling parameters
- Multi-turn conversation support
- Memory-efficient attention mechanisms
Stable Diffusion Worker (gpu/sd.py
)
Core Technology: Automatic1111 WebUI
- Model Format: Safetensors, CKPT
- Inference Engine: Diffusers + PyTorch
- Acceleration: xFormers, Flash Attention
- Extensions: LoRA, ControlNet, VAE support
Features:
- Text-to-image generation
- Image-to-image transformation
- Inpainting and outpainting
- Batch processing
- Custom model loading
4. Monitoring Stack (Optional Component)
Location: monitoring/
Deployment: Independent, can run separately or alongside main application
The monitoring stack is a standalone component that provides comprehensive observability across all system components. It can be deployed:
- Integrated: Alongside the main application for unified monitoring
- Centralized: On a dedicated monitoring server for multiple Velesio instances
- Development: Locally for debugging and performance analysis
Prometheus
- Metrics Collection: System, Redis, GPU, and custom application metrics
- Alerting: Configurable alerts for system health and performance
- Data Retention: Configurable retention (default: 200h of metric history)
- Remote Monitoring: Can monitor multiple Velesio deployments from a central location
Grafana
- Pre-configured Dashboards: Four auto-provisioned dashboards
- Redis performance and queue metrics
- System resources (CPU, memory, disk, network)
- NVIDIA GPU metrics (utilization, memory, temperature, power)
- Centralized log viewing and analysis
- Multi-Instance Support: Monitor multiple API/GPU worker deployments
- Alerting: Visual alerts with multiple notification channels
Exporters & Collectors (Auto-deployed)
- Node Exporter: System metrics collection
- Redis Exporter: Queue and performance metrics
- NVIDIA GPU Exporter: GPU utilization and health metrics
- Promtail: Log aggregation from all services
Deployment Flexibility: The monitoring stack can be deployed independently from the main application, making it suitable for:
- Multi-environment monitoring (dev, staging, production)
- Cross-datacenter observability
- Centralized monitoring of distributed Velesio deployments
For detailed setup and deployment options, see the Monitoring Stack Component.
Data Flow
Text Generation Request Flow
- Client Request
1 2 3 4 5 6 7
POST /completion Authorization: Bearer <token> { "prompt": "Write a story about...", "max_tokens": 150, "temperature": 0.7 }
- API Processing
- Validate bearer token against
API_TOKENS
- Validate request schema with Pydantic
- Generate unique job ID
- Enqueue job to Redis
llama_queue
- Validate bearer token against
- Worker Processing
- LLM worker polls Redis queue
- Loads request parameters
- Calls
undreamai_server
binary - Streams or batches response
- Stores result in Redis with TTL
- Response
- API polls Redis for job completion
- Returns structured OpenAI-compatible response
- Cleans up job data
Image Generation Request Flow
- Client Request
1 2 3 4 5 6 7 8
POST /sdapi/v1/txt2img Authorization: Bearer <token> { "prompt": "A futuristic cityscape", "width": 512, "height": 512, "steps": 20 }
- Processing Similar to Text, but:
- Job routed to SD worker
- Image generated by Automatic1111 WebUI
- Result encoded as base64
- Metadata included (seed, model, parameters)
Deployment Modes
Standalone Mode (REMOTE=false
)
- All services run on single machine
- Redis exposed on localhost:6379
- Direct worker access on ports 1337/7860
- Ideal for development and small deployments
Distributed Mode (REMOTE=true
)
- Workers connect to remote Redis instance
- Horizontal scaling of GPU workers
- Load balancing across multiple machines
- Production-ready architecture
Security Architecture
Authentication
- Bearer Token: Simple, stateless authentication
- Multi-tenant: Support for multiple API keys
- Rate Limiting: Per-token request limits (optional)
Network Security
- Internal Communication: Services communicate via Docker network
- Firewall: Only necessary ports exposed
- TLS: HTTPS termination at reverse proxy (nginx)
Data Security
- No Persistent Storage: Requests/responses not logged by default
- Memory Isolation: Workers run in separate containers
- Model Security: Models stored in protected volumes
Performance Characteristics
Throughput
- LLM: 10-50 tokens/second (depending on model size)
- Stable Diffusion: 1-5 images/minute (depending on resolution)
- Concurrent Requests: Limited by GPU memory and worker count
Latency
- API Response: <100ms for job queuing
- LLM First Token: 200-500ms (cold start: 2-5s)
- SD Image: 10-60 seconds depending on parameters
Scalability
- Horizontal: Add more GPU workers
- Vertical: Increase GPU memory/compute
- Queue Depth: Redis handles thousands of queued jobs
Deployment Patterns
Velesio AI Server is designed with component independence in mind. Each component can be deployed separately or together based on your infrastructure needs:
๐ Monolithic Deployment
Deploy all components on a single server:
1
2
docker-compose up -d # API + GPU Workers + Redis
cd monitoring && docker-compose up -d # Optional monitoring
๐ฒ Distributed Deployment
Deploy components across multiple servers:
Server 1 (API + Redis):
1
2
3
4
5
6
# api-only docker-compose.yml
services:
api:
# ... API configuration
redis:
# ... Redis configuration
Server 2-N (GPU Workers):
1
2
3
4
5
6
# gpu-worker docker-compose.yml
services:
gpu_worker:
environment:
- REMOTE=true
- REDIS_HOST=server1:6379
Monitoring Server (Optional):
1
2
3
4
# Central monitoring for all deployments
cd monitoring
# Configure prometheus.yml with remote targets
docker-compose up -d
๐ฏ Hybrid Deployment
Mix local and remote components:
- API + Redis locally
- GPU workers on dedicated GPU servers
- Centralized monitoring for multiple environments
Technology Stack
Component | Technology | Purpose |
---|---|---|
API | FastAPI + Uvicorn | High-performance async web framework |
Queue | Redis + RQ | Message broker and job queue |
LLM Engine | undreamai_server | Custom llama.cpp fork |
SD Engine | Automatic1111 WebUI | Feature-rich SD implementation |
Monitoring | Prometheus + Grafana | Metrics and visualization |
Orchestration | Docker Compose | Service coordination |
Proxy | Nginx (optional) | Load balancing and TLS |
Next Steps
- API Reference - Detailed endpoint documentation
- Deployment Guide - Production deployment strategies
- Component Documentation - Deep dive into each service