RAM Monitoring
Comprehensive monitoring solution for RAM PostgreSQL clustering with Prometheus metrics and Grafana dashboards.
Prometheus Integration
Metrics Endpoint
# RAM metrics endpoint
curl http://localhost:9090/metrics
# Example metrics output
# HELP ram_cluster_nodes_total Total number of nodes in cluster
# TYPE ram_cluster_nodes_total gauge
ram_cluster_nodes_total{cluster="production"} 3
# HELP ram_cluster_leader_term Current leader term
# TYPE ram_cluster_leader_term gauge
ram_cluster_leader_term{cluster="production"} 42
# HELP ram_failover_events_total Total number of failover events
# TYPE ram_failover_events_total counter
ram_failover_events_total{cluster="production"} 5
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'ram-cluster'
static_configs:
- targets: ['node1:9090', 'node2:9090', 'node3:9090']
metrics_path: '/metrics'
scrape_interval: 5s
- job_name: 'postgresql'
static_configs:
- targets: ['node1:9187', 'node2:9187', 'node3:9187']
Key Metrics
Cluster Metrics
Metric | Type | Description |
---|---|---|
ram_cluster_nodes_total | Gauge | Total number of nodes in cluster |
ram_cluster_leader_term | Gauge | Current leader term |
ram_cluster_commit_index | Gauge | Current commit index |
ram_failover_events_total | Counter | Total number of failover events |
ram_election_duration_seconds | Histogram | Leader election duration |
Node Metrics
Metric | Type | Description |
---|---|---|
ram_node_state | Gauge | Node state (0=follower, 1=candidate, 2=leader) |
ram_node_log_size | Gauge | Size of node's log |
ram_node_uptime_seconds | Counter | Node uptime in seconds |
ram_node_requests_total | Counter | Total requests processed by node |
Grafana Dashboards
Cluster Overview Dashboard
Key Panels
- • Cluster health status
- • Current leader and term
- • Node count and status
- • Failover events timeline
- • Request rate and latency
Alerts
- • Cluster split-brain detection
- • High failover frequency
- • Node unreachable
- • High replication lag
Performance Dashboard
Metrics
- • Election duration histogram
- • Log replication latency
- • Commit index progression
- • Network throughput
Trends
- • Request rate over time
- • Error rate trends
- • Resource utilization
- • Performance degradation
Health Monitoring
Health Check Endpoints
# Cluster health check
curl http://localhost:8080/api/v1/health
# Node health check
curl http://localhost:8080/api/v1/health/node
# PostgreSQL health check
curl http://localhost:8080/api/v1/health/postgres
# Comprehensive health status
curl http://localhost:8080/api/v1/health/full
Health Check Configuration
[monitoring]
health_check_enabled = true
health_check_interval = 5s
health_check_timeout = 3s
health_check_retries = 3
# Health check endpoints
health_check_endpoints = [
"http://localhost:8080/api/v1/health",
"http://localhost:8080/api/v1/health/postgres"
]
Alerting Rules
Prometheus Alert Rules
groups:
- name: ram-cluster
rules:
- alert: RAMClusterDown
expr: up{job="ram-cluster"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RAM cluster node is down"
- alert: RAMHighFailoverRate
expr: rate(ram_failover_events_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High failover rate detected"
Common Alerts
Critical Alerts
- • Cluster node down
- • No leader elected
- • Split-brain detected
- • PostgreSQL connection lost
Warning Alerts
- • High failover rate
- • Replication lag high
- • High election duration
- • Node resource usage high
Monitoring Commands
Real-time Monitoring
# Live cluster monitoring
ramctrl monitor --cluster production-cluster
# Monitor with auto-refresh
ramctrl monitor --cluster production-cluster --refresh 5s
# Monitor specific metrics
ramctrl metrics --cluster production-cluster --format prometheus
# Health check all nodes
ramctrl health --cluster production-cluster --all
Historical Analysis
# Export metrics for analysis
ramctrl metrics --cluster production-cluster --output metrics.json
# Historical failover events
ramctrl failover history --cluster production-cluster
# Performance trends
ramctrl metrics --cluster production-cluster --since 1h