Monitoring Stack

A comprehensive observability solution that provides real-time monitoring, metrics collection, and alerting for all Velesio AI Server components.

Overview

Location: monitoring/
Technology: Grafana + Prometheus + Exporters
Container: monitoring-stack
Ports: 3000 (Grafana), 9090 (Prometheus)

The monitoring stack is a standalone, optional component that can be deployed:

Alongside the main application for integrated monitoring
Separately for external monitoring of multiple Velesio deployments
Independently for development and testing environments

Deployment Flexibility

🔗 Integrated Deployment

Deploy with the main application for seamless monitoring:

# Start main application
docker-compose up -d

# Start monitoring stack
cd monitoring
docker-compose up -d

🔲 Standalone Deployment

Deploy monitoring independently on a dedicated monitoring server:

# On monitoring server
git clone https://github.com/Velesio/Velesio-aiserver.git
cd Velesio-aiserver/monitoring

# Configure remote targets in prometheus.yml
docker-compose up -d

🎯 Selective Monitoring

Monitor only specific components by configuring Prometheus targets:

# Monitor only API service
- job_name: 'Velesio-api'
  static_configs:
    - targets: ['remote-host:8000']

Overview

The monitoring stack provides real-time insights into:

System Performance: CPU, memory, disk, and network utilization
GPU Metrics: NVIDIA GPU utilization, memory, temperature, and power consumption
Redis Performance: Queue depth, memory usage, connections, and command statistics
Application Logs: Centralized log aggregation and analysis
Container Metrics: Docker container resource usage and health

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Exporters  │────│ Prometheus  │────│   Grafana   │
│ (Metrics)   │    │ (Storage)   │    │ (Dashboard) │
└─────────────┘    └─────────────┘    └─────────────┘
       │                                      │
       │           ┌─────────────┐            │
       └───────────│    Loki     │────────────┘
                   │(Log Storage)│
                   └─────────────┘
                          │
                   ┌─────────────┐
                   │  Promtail   │
                   │(Log Collect)│
                   └─────────────┘

Components

Core Services

Service	Port	Purpose
Grafana	3000	Visualization dashboards and alerts
Prometheus	9090	Metrics collection and time-series storage
Loki	3100	Log aggregation and storage
Promtail	-	Log collection agent

Exporters

Exporter	Port	Metrics
Node Exporter	9100	System metrics (CPU, memory, disk, network)
Redis Exporter	9121	Redis performance and queue metrics
NVIDIA GPU Exporter	9835	GPU utilization, memory, temperature, power

Pre-configured Dashboards

The monitoring stack includes four auto-provisioned dashboards:

🖥️ Node Exporter Full

CPU utilization and load averages
Memory usage and swap statistics
Disk I/O and filesystem usage
Network traffic and interface statistics
System uptime and process counts

📊 Redis Dashboard

Memory usage and keyspace statistics
Command execution rates and latency
Client connections and blocked clients
Replication status and lag
Queue depth monitoring for job processing

🎮 NVIDIA GPU Metrics

GPU utilization percentage
Memory usage and allocation
Temperature monitoring
Power consumption and limits
Fan speed and clock frequencies

📝 Velesio Logs

Centralized log viewing from all services
Log level filtering (INFO, WARNING, ERROR)
Search and filtering capabilities
Real-time log streaming

Quick Start

Prerequisites

Docker and Docker Compose installed
NVIDIA drivers (for GPU monitoring)
Running Velesio AI Server instance

Setup

Configure Redis connection (if using external Redis):

cd monitoring
cp .env.example .env
# Edit .env with your Redis connection details

Start monitoring stack:

cd monitoring
docker-compose up -d

Access Grafana:
- URL: http://localhost:3000
- Username: admin
- Password: admin

For GPU Monitoring

Enable GPU monitoring by starting with the GPU profile:

docker-compose --profile gpu up -d

Configuration

Environment Variables

# Redis connection (optional, defaults to localhost)
REDIS_HOST=localhost:6379
REDIS_PASS=your_redis_password

Dashboard Customization

Dashboards are automatically loaded from grafana/dashboards/. To add custom dashboards:

Export dashboard JSON from Grafana
Place JSON file in monitoring/grafana/dashboards/
Restart Grafana: docker-compose restart grafana

Retention Settings

Data retention can be configured in prometheus.yml:

command:
  - '--storage.tsdb.retention.time=200h'  # Adjust as needed

Usage

Monitoring Velesio AI Server

API Performance: Monitor request rates and response times
Queue Health: Check Redis queue depth and processing rates
GPU Utilization: Track inference workload and memory usage
System Resources: Ensure adequate CPU, memory, and disk space

Alert Setup

Configure alerts in Grafana for:

High GPU memory usage (>90%)
Redis queue backlog (>1000 jobs)
High system load (>80%)
Disk space usage (>85%)

Log Analysis

Use the Velesio Logs dashboard to:

Debug API request failures
Monitor worker job processing
Track model loading times
Investigate performance issues

Troubleshooting

Common Issues

Grafana not loading dashboards:

# Check dashboard provisioning
docker logs grafana
# Restart with clean data
docker-compose down -v && docker-compose up -d

GPU metrics not appearing:

# Verify NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Check exporter logs
docker logs nvidia-gpu-exporter

Redis connection failed:

# Verify Redis accessibility
docker exec redis-exporter redis-cli -h $REDIS_HOST ping
# Check network connectivity
docker-compose logs redis-exporter

Performance Tuning

For high-volume environments:

Increase scrape intervals in prometheus.yml
Adjust retention periods based on storage capacity
Configure log rotation for Loki
Set resource limits in Docker Compose

Integration

With Main Application

The monitoring stack is designed to work alongside the main Velesio AI Server:

# Start main application
docker-compose up -d

# Start monitoring in separate terminal
cd monitoring
docker-compose up -d

Custom Metrics

Add application-specific metrics by:

Exposing metrics endpoint in your service
Adding scrape config to prometheus.yml
Creating custom Grafana dashboard

Security

Production Deployment

For production use:

Change default credentials: ```yaml environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} ```
Enable HTTPS with reverse proxy
Restrict network access to monitoring ports
Configure authentication (LDAP, OAuth, etc.)

Maintenance

Backup

Important data to backup:

Grafana dashboards: grafana_data:/var/lib/grafana
Prometheus data: prometheus_data:/prometheus
Configuration files: monitoring/

Updates

Update to latest versions:

docker-compose pull
docker-compose up -d

Monitoring Stack

Overview

Deployment Flexibility

🔗 Integrated Deployment

🔲 Standalone Deployment

🎯 Selective Monitoring

Overview

Architecture

Components

Core Services

Exporters

Pre-configured Dashboards

🖥️ Node Exporter Full

📊 Redis Dashboard

🎮 NVIDIA GPU Metrics

📝 Velesio Logs

Quick Start

Prerequisites

Setup

For GPU Monitoring

Configuration

Environment Variables

Dashboard Customization

Retention Settings

Usage

Monitoring Velesio AI Server

Alert Setup

Log Analysis

Troubleshooting

Common Issues

Performance Tuning

Integration

With Main Application

Custom Metrics

Security

Production Deployment

Maintenance

Backup

Updates

Resources