Monitoring Stack
Monitoring Stack
A comprehensive observability solution that provides real-time monitoring, metrics collection, and alerting for all Velesio AI Server components.
Overview
Location: monitoring/
Technology: Grafana + Prometheus + Exporters
Container: monitoring-stack
Ports: 3000 (Grafana), 9090 (Prometheus)
The monitoring stack is a standalone, optional component that can be deployed:
- Alongside the main application for integrated monitoring
- Separately for external monitoring of multiple Velesio deployments
- Independently for development and testing environments
Deployment Flexibility
🔗 Integrated Deployment
Deploy with the main application for seamless monitoring:
1
2
3
4
5
6
# Start main application
docker-compose up -d
# Start monitoring stack
cd monitoring
docker-compose up -d
🔲 Standalone Deployment
Deploy monitoring independently on a dedicated monitoring server:
1
2
3
4
5
6
# On monitoring server
git clone https://github.com/Velesio/Velesio-aiserver.git
cd Velesio-aiserver/monitoring
# Configure remote targets in prometheus.yml
docker-compose up -d
🎯 Selective Monitoring
Monitor only specific components by configuring Prometheus targets:
1
2
3
4
# Monitor only API service
- job_name: 'Velesio-api'
static_configs:
- targets: ['remote-host:8000']
Overview
The monitoring stack provides real-time insights into:
- System Performance: CPU, memory, disk, and network utilization
- GPU Metrics: NVIDIA GPU utilization, memory, temperature, and power consumption
- Redis Performance: Queue depth, memory usage, connections, and command statistics
- Application Logs: Centralized log aggregation and analysis
- Container Metrics: Docker container resource usage and health
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Exporters │────│ Prometheus │────│ Grafana │
│ (Metrics) │ │ (Storage) │ │ (Dashboard) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │
│ ┌─────────────┐ │
└───────────│ Loki │────────────┘
│(Log Storage)│
└─────────────┘
│
┌─────────────┐
│ Promtail │
│(Log Collect)│
└─────────────┘
Components
Core Services
Service | Port | Purpose |
---|---|---|
Grafana | 3000 | Visualization dashboards and alerts |
Prometheus | 9090 | Metrics collection and time-series storage |
Loki | 3100 | Log aggregation and storage |
Promtail | - | Log collection agent |
Exporters
Exporter | Port | Metrics |
---|---|---|
Node Exporter | 9100 | System metrics (CPU, memory, disk, network) |
Redis Exporter | 9121 | Redis performance and queue metrics |
NVIDIA GPU Exporter | 9835 | GPU utilization, memory, temperature, power |
Pre-configured Dashboards
The monitoring stack includes four auto-provisioned dashboards:
🖥️ Node Exporter Full
- CPU utilization and load averages
- Memory usage and swap statistics
- Disk I/O and filesystem usage
- Network traffic and interface statistics
- System uptime and process counts
📊 Redis Dashboard
- Memory usage and keyspace statistics
- Command execution rates and latency
- Client connections and blocked clients
- Replication status and lag
- Queue depth monitoring for job processing
🎮 NVIDIA GPU Metrics
- GPU utilization percentage
- Memory usage and allocation
- Temperature monitoring
- Power consumption and limits
- Fan speed and clock frequencies
📝 Velesio Logs
- Centralized log viewing from all services
- Log level filtering (INFO, WARNING, ERROR)
- Search and filtering capabilities
- Real-time log streaming
Quick Start
Prerequisites
- Docker and Docker Compose installed
- NVIDIA drivers (for GPU monitoring)
- Running Velesio AI Server instance
Setup
- Configure Redis connection (if using external Redis):
1 2 3
cd monitoring cp .env.example .env # Edit .env with your Redis connection details
- Start monitoring stack:
1 2
cd monitoring docker-compose up -d
- Access Grafana:
- URL: http://localhost:3000
- Username:
admin
- Password:
admin
For GPU Monitoring
Enable GPU monitoring by starting with the GPU profile:
1
docker-compose --profile gpu up -d
Configuration
Environment Variables
1
2
3
# Redis connection (optional, defaults to localhost)
REDIS_HOST=localhost:6379
REDIS_PASS=your_redis_password
Dashboard Customization
Dashboards are automatically loaded from grafana/dashboards/
. To add custom dashboards:
- Export dashboard JSON from Grafana
- Place JSON file in
monitoring/grafana/dashboards/
- Restart Grafana:
docker-compose restart grafana
Retention Settings
Data retention can be configured in prometheus.yml
:
1
2
command:
- '--storage.tsdb.retention.time=200h' # Adjust as needed
Usage
Monitoring Velesio AI Server
- API Performance: Monitor request rates and response times
- Queue Health: Check Redis queue depth and processing rates
- GPU Utilization: Track inference workload and memory usage
- System Resources: Ensure adequate CPU, memory, and disk space
Alert Setup
Configure alerts in Grafana for:
- High GPU memory usage (>90%)
- Redis queue backlog (>1000 jobs)
- High system load (>80%)
- Disk space usage (>85%)
Log Analysis
Use the Velesio Logs dashboard to:
- Debug API request failures
- Monitor worker job processing
- Track model loading times
- Investigate performance issues
Troubleshooting
Common Issues
Grafana not loading dashboards:
1
2
3
4
# Check dashboard provisioning
docker logs grafana
# Restart with clean data
docker-compose down -v && docker-compose up -d
GPU metrics not appearing:
1
2
3
4
# Verify NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Check exporter logs
docker logs nvidia-gpu-exporter
Redis connection failed:
1
2
3
4
# Verify Redis accessibility
docker exec redis-exporter redis-cli -h $REDIS_HOST ping
# Check network connectivity
docker-compose logs redis-exporter
Performance Tuning
For high-volume environments:
- Increase scrape intervals in
prometheus.yml
- Adjust retention periods based on storage capacity
- Configure log rotation for Loki
- Set resource limits in Docker Compose
Integration
With Main Application
The monitoring stack is designed to work alongside the main Velesio AI Server:
1
2
3
4
5
6
# Start main application
docker-compose up -d
# Start monitoring in separate terminal
cd monitoring
docker-compose up -d
Custom Metrics
Add application-specific metrics by:
- Exposing metrics endpoint in your service
- Adding scrape config to
prometheus.yml
- Creating custom Grafana dashboard
Security
Production Deployment
For production use:
- Change default credentials:
```yaml
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} ```
- Enable HTTPS with reverse proxy
- Restrict network access to monitoring ports
- Configure authentication (LDAP, OAuth, etc.)
Maintenance
Backup
Important data to backup:
- Grafana dashboards:
grafana_data:/var/lib/grafana
- Prometheus data:
prometheus_data:/prometheus
- Configuration files:
monitoring/
Updates
Update to latest versions:
1
2
docker-compose pull
docker-compose up -d