Architecture

Architecture Overview

Velesio AI Server follows a microservice architecture designed for scalability, reliability, and maintainability.

System Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Client Applications                      โ”‚
โ”‚  (Unity Games, Web Apps, Mobile Apps, CLI Tools, etc.)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚ HTTP/REST API
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   API Gateway                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚   FastAPI   โ”‚ โ”‚    Auth     โ”‚ โ”‚    Request Validation   โ”‚โ”‚
โ”‚  โ”‚   Server    โ”‚ โ”‚  Middleware โ”‚ โ”‚      & Routing          โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚ Job Queue
                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Message Broker                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
โ”‚  โ”‚                   Redis Queue                           โ”‚โ”‚
โ”‚  โ”‚  โ€ข Job queuing & distribution                           โ”‚โ”‚
โ”‚  โ”‚  โ€ข Result storage & retrieval                           โ”‚โ”‚
โ”‚  โ”‚  โ€ข Worker coordination                                  โ”‚โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                               โ”‚
          โ”‚ Pull Jobs                     โ”‚ Pull Jobs
          โ”‚                               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    LLM Worker         โ”‚      โ”‚  Stable Diffusion Worker     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚      โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ undreamai_serverโ”‚  โ”‚      โ”‚  โ”‚    Automatic1111        โ”‚  โ”‚
โ”‚  โ”‚ (llama.cpp fork)โ”‚  โ”‚      โ”‚  โ”‚      WebUI              โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚      โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚      โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  GGUF Models    โ”‚  โ”‚      โ”‚  โ”‚  Stable Diffusion      โ”‚  โ”‚
โ”‚  โ”‚     (Text)      โ”‚  โ”‚      โ”‚  โ”‚      Models             โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚      โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                               โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚ Results
                        โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 Monitoring Stack                              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Prometheus โ”‚ โ”‚   Grafana   โ”‚ โ”‚      Log Aggregation    โ”‚  โ”‚
โ”‚  โ”‚   Metrics   โ”‚ โ”‚ Dashboards  โ”‚ โ”‚        (Loki)           โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Independent Components

Velesio AI Server consists of three independent, deployable components that can run together or separately:

Component Description Documentation
๐ŸŒ API Service FastAPI gateway handling HTTP requests and job queuing api/
๐ŸŽฎ GPU Workers AI inference engines for LLM and Stable Diffusion gpu/
๐Ÿ“Š Monitoring Stack Observability with Grafana dashboards and metrics monitoring/

Each component can be:

  • Deployed together for a complete, integrated solution
  • Deployed separately across multiple servers for scalability
  • Mixed and matched based on your infrastructure requirements

Core Components

1. API Gateway (FastAPI)

Location: api/

The API Gateway serves as the entry point for all client requests:

  • Authentication: Bearer token validation
  • Request Validation: Pydantic models ensure data integrity
  • Job Queuing: Converts HTTP requests to Redis jobs
  • Response Handling: Polls Redis for job results and returns to clients
  • Documentation: Auto-generated OpenAPI docs at /docs

Key Features:

  • Async/await for high concurrency
  • Structured logging
  • Health checks
  • Rate limiting (configurable)
  • CORS support for web clients

2. Message Broker (Redis)

Technology: Redis with RQ (Redis Queue)

Redis serves multiple roles:

  • Job Queue: Stores pending inference jobs
  • Result Storage: Temporary storage for completed results
  • Worker Coordination: Manages worker pool and load balancing
  • Session Management: Optional session state for stateful conversations

Queue Structure:

1
2
3
4
5
llama_queue:
โ”œโ”€โ”€ pending_jobs    # Jobs waiting for workers
โ”œโ”€โ”€ active_jobs     # Currently processing jobs  
โ”œโ”€โ”€ completed_jobs  # Finished jobs (TTL: 1 hour)
โ””โ”€โ”€ failed_jobs     # Failed jobs for debugging

3. GPU Workers

LLM Worker (gpu/llm.py)

Core Technology: undreamai_server (custom llama.cpp fork)

  • Model Format: GGUF (GPT-Generated Unified Format)
  • Inference Engine: Custom llama.cpp binary optimized for undream.ai
  • GPU Acceleration: CUDA with configurable layer offloading
  • Context Management: Configurable context windows (2K-32K tokens)

Features:

  • Streaming text generation
  • Configurable sampling parameters
  • Multi-turn conversation support
  • Memory-efficient attention mechanisms

Stable Diffusion Worker (gpu/sd.py)

Core Technology: Automatic1111 WebUI

  • Model Format: Safetensors, CKPT
  • Inference Engine: Diffusers + PyTorch
  • Acceleration: xFormers, Flash Attention
  • Extensions: LoRA, ControlNet, VAE support

Features:

  • Text-to-image generation
  • Image-to-image transformation
  • Inpainting and outpainting
  • Batch processing
  • Custom model loading

4. Monitoring Stack (Optional Component)

Location: monitoring/
Deployment: Independent, can run separately or alongside main application

The monitoring stack is a standalone component that provides comprehensive observability across all system components. It can be deployed:

  • Integrated: Alongside the main application for unified monitoring
  • Centralized: On a dedicated monitoring server for multiple Velesio instances
  • Development: Locally for debugging and performance analysis

Prometheus

  • Metrics Collection: System, Redis, GPU, and custom application metrics
  • Alerting: Configurable alerts for system health and performance
  • Data Retention: Configurable retention (default: 200h of metric history)
  • Remote Monitoring: Can monitor multiple Velesio deployments from a central location

Grafana

  • Pre-configured Dashboards: Four auto-provisioned dashboards
    • Redis performance and queue metrics
    • System resources (CPU, memory, disk, network)
    • NVIDIA GPU metrics (utilization, memory, temperature, power)
    • Centralized log viewing and analysis
  • Multi-Instance Support: Monitor multiple API/GPU worker deployments
  • Alerting: Visual alerts with multiple notification channels

Exporters & Collectors (Auto-deployed)

  • Node Exporter: System metrics collection
  • Redis Exporter: Queue and performance metrics
  • NVIDIA GPU Exporter: GPU utilization and health metrics
  • Promtail: Log aggregation from all services

Deployment Flexibility: The monitoring stack can be deployed independently from the main application, making it suitable for:

  • Multi-environment monitoring (dev, staging, production)
  • Cross-datacenter observability
  • Centralized monitoring of distributed Velesio deployments

For detailed setup and deployment options, see the Monitoring Stack Component.

Data Flow

Text Generation Request Flow

  1. Client Request
    1
    2
    3
    4
    5
    6
    7
    
    POST /completion
    Authorization: Bearer <token>
    {
      "prompt": "Write a story about...",
      "max_tokens": 150,
      "temperature": 0.7
    }
    
  2. API Processing
    • Validate bearer token against API_TOKENS
    • Validate request schema with Pydantic
    • Generate unique job ID
    • Enqueue job to Redis llama_queue
  3. Worker Processing
    • LLM worker polls Redis queue
    • Loads request parameters
    • Calls undreamai_server binary
    • Streams or batches response
    • Stores result in Redis with TTL
  4. Response
    • API polls Redis for job completion
    • Returns structured OpenAI-compatible response
    • Cleans up job data

Image Generation Request Flow

  1. Client Request
    1
    2
    3
    4
    5
    6
    7
    8
    
    POST /sdapi/v1/txt2img
    Authorization: Bearer <token>
    {
      "prompt": "A futuristic cityscape",
      "width": 512,
      "height": 512,
      "steps": 20
    }
    
  2. Processing Similar to Text, but:
    • Job routed to SD worker
    • Image generated by Automatic1111 WebUI
    • Result encoded as base64
    • Metadata included (seed, model, parameters)

Deployment Modes

Standalone Mode (REMOTE=false)

  • All services run on single machine
  • Redis exposed on localhost:6379
  • Direct worker access on ports 1337/7860
  • Ideal for development and small deployments

Distributed Mode (REMOTE=true)

  • Workers connect to remote Redis instance
  • Horizontal scaling of GPU workers
  • Load balancing across multiple machines
  • Production-ready architecture

Security Architecture

Authentication

  • Bearer Token: Simple, stateless authentication
  • Multi-tenant: Support for multiple API keys
  • Rate Limiting: Per-token request limits (optional)

Network Security

  • Internal Communication: Services communicate via Docker network
  • Firewall: Only necessary ports exposed
  • TLS: HTTPS termination at reverse proxy (nginx)

Data Security

  • No Persistent Storage: Requests/responses not logged by default
  • Memory Isolation: Workers run in separate containers
  • Model Security: Models stored in protected volumes

Performance Characteristics

Throughput

  • LLM: 10-50 tokens/second (depending on model size)
  • Stable Diffusion: 1-5 images/minute (depending on resolution)
  • Concurrent Requests: Limited by GPU memory and worker count

Latency

  • API Response: <100ms for job queuing
  • LLM First Token: 200-500ms (cold start: 2-5s)
  • SD Image: 10-60 seconds depending on parameters

Scalability

  • Horizontal: Add more GPU workers
  • Vertical: Increase GPU memory/compute
  • Queue Depth: Redis handles thousands of queued jobs

Deployment Patterns

Velesio AI Server is designed with component independence in mind. Each component can be deployed separately or together based on your infrastructure needs:

๐Ÿ”— Monolithic Deployment

Deploy all components on a single server:

1
2
docker-compose up -d  # API + GPU Workers + Redis
cd monitoring && docker-compose up -d  # Optional monitoring

๐Ÿ”ฒ Distributed Deployment

Deploy components across multiple servers:

Server 1 (API + Redis):

1
2
3
4
5
6
# api-only docker-compose.yml
services:
  api:
    # ... API configuration
  redis:
    # ... Redis configuration

Server 2-N (GPU Workers):

1
2
3
4
5
6
# gpu-worker docker-compose.yml
services:
  gpu_worker:
    environment:
      - REMOTE=true
      - REDIS_HOST=server1:6379

Monitoring Server (Optional):

1
2
3
4
# Central monitoring for all deployments
cd monitoring
# Configure prometheus.yml with remote targets
docker-compose up -d

๐ŸŽฏ Hybrid Deployment

Mix local and remote components:

  • API + Redis locally
  • GPU workers on dedicated GPU servers
  • Centralized monitoring for multiple environments

Technology Stack

Component Technology Purpose
API FastAPI + Uvicorn High-performance async web framework
Queue Redis + RQ Message broker and job queue
LLM Engine undreamai_server Custom llama.cpp fork
SD Engine Automatic1111 WebUI Feature-rich SD implementation
Monitoring Prometheus + Grafana Metrics and visualization
Orchestration Docker Compose Service coordination
Proxy Nginx (optional) Load balancing and TLS

Next Steps