Home

A microservice-based AI inference server with Redis queue-based worker architecture

Velesio AI Server

A high-performance, microservice-based AI inference server designed for scalable LLM and Stable Diffusion workloads.

Overview

Velesio AI Server is a production-ready AI inference platform that provides:

LLM Text Generation via custom llama.cpp integration
Stable Diffusion Image Generation with WebUI support
Redis Queue Architecture for scalable job processing
Docker-based Deployment with GPU acceleration
Built-in Monitoring with Grafana and Prometheus
Unity Integration ready endpoints

Architecture

┌─────────────┐    ┌─────────┐    ┌─────────────┐
│    API      │────│  Redis  │────│ GPU Workers │
│  (FastAPI)  │    │ Queue   │    │ (LLM + SD)  │
└─────────────┘    └─────────┘    └─────────────┘
       │                                  │
       │           ┌─────────────┐        │
       └───────────│ Monitoring  │────────┘
                   │(Grafana+Prom)│
                   └─────────────┘

Key Features

🚀 High Performance

Custom llama.cpp binary (undreamai_server) for optimized inference
GPU acceleration with CUDA support
Asynchronous job processing via Redis Queue

🔧 Easy Setup

Docker Compose deployment
Automatic model downloading
Pre-configured monitoring stack

🎯 Unity Ready

Compatible with “LLM for Unity” asset
Base64 image encoding for seamless integration
Standardized API endpoints

📊 Production Monitoring

Real-time metrics with Prometheus
Visual dashboards in Grafana
Redis queue monitoring
GPU utilization tracking

Quick Start

Clone and Configure

git clone https://github.com/Velesio/Velesio-aiserver.git
cd Velesio-aiserver
cp .env.example .env

Set API Tokens

# Edit .env file
API_TOKENS=your-secret-token-here

Deploy

docker-compose up -d --build

Test API

curl -X POST http://localhost:8000/completion \
  -H "Authorization: Bearer your-secret-token-here" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!", "max_tokens": 50}'

Services

Service	Port	Description
API	8000	FastAPI web server
Redis	6379	Message queue
LLM Worker	1337	Direct LLM access (when REMOTE=false)
Stable Diffusion	7860	WebUI interface (when RUN_SD=true)
Grafana	3000	Monitoring dashboard
Prometheus	9090	Metrics collection

Next Steps

Getting Started Guide - Detailed setup instructions
Architecture Overview - Deep dive into system design
API Reference - Complete endpoint documentation
Deployment Guide - Production deployment strategies
Troubleshooting - Common issues and solutions

Need help? Check our troubleshooting guide or open an issue on GitHub.

Quick Navigation

🚀 Getting Started

Set up your development environment and run your first API call

🏗️ Architecture

Understand the microservice design and component interactions

🧩 Components

Independent deployable components: API, GPU Workers, and Monitoring

� API Reference

Complete documentation for all endpoints and parameters

� Deployment

Production deployment strategies and best practices