RAM Monitoring

Comprehensive monitoring solution for RAM PostgreSQL clustering with Prometheus metrics and Grafana dashboards.

Prometheus Integration

Metrics Endpoint

# RAM metrics endpoint
curl http://localhost:9090/metrics

# Example metrics output
# HELP ram_cluster_nodes_total Total number of nodes in cluster
# TYPE ram_cluster_nodes_total gauge
ram_cluster_nodes_total{cluster="production"} 3

# HELP ram_cluster_leader_term Current leader term
# TYPE ram_cluster_leader_term gauge
ram_cluster_leader_term{cluster="production"} 42

# HELP ram_failover_events_total Total number of failover events
# TYPE ram_failover_events_total counter
ram_failover_events_total{cluster="production"} 5

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'ram-cluster'
    static_configs:
      - targets: ['node1:9090', 'node2:9090', 'node3:9090']
    metrics_path: '/metrics'
    scrape_interval: 5s

  - job_name: 'postgresql'
    static_configs:
      - targets: ['node1:9187', 'node2:9187', 'node3:9187']

Key Metrics

Cluster Metrics

Metric	Type	Description
ram_cluster_nodes_total	Gauge	Total number of nodes in cluster
ram_cluster_leader_term	Gauge	Current leader term
ram_cluster_commit_index	Gauge	Current commit index
ram_failover_events_total	Counter	Total number of failover events
ram_election_duration_seconds	Histogram	Leader election duration

Node Metrics

Metric	Type	Description
ram_node_state	Gauge	Node state (0=follower, 1=candidate, 2=leader)
ram_node_log_size	Gauge	Size of node's log
ram_node_uptime_seconds	Counter	Node uptime in seconds
ram_node_requests_total	Counter	Total requests processed by node

Grafana Dashboards

Cluster Overview Dashboard

Key Panels

• Cluster health status
• Current leader and term
• Node count and status
• Failover events timeline
• Request rate and latency

Alerts

• Cluster split-brain detection
• High failover frequency
• Node unreachable
• High replication lag

Performance Dashboard

Metrics

• Election duration histogram
• Log replication latency
• Commit index progression
• Network throughput

Trends

• Request rate over time
• Error rate trends
• Resource utilization
• Performance degradation

Health Monitoring

Health Check Endpoints

# Cluster health check
curl http://localhost:8080/api/v1/health

# Node health check
curl http://localhost:8080/api/v1/health/node

# PostgreSQL health check
curl http://localhost:8080/api/v1/health/postgres

# Comprehensive health status
curl http://localhost:8080/api/v1/health/full

Health Check Configuration

[monitoring]
health_check_enabled = true
health_check_interval = 5s
health_check_timeout = 3s
health_check_retries = 3

# Health check endpoints
health_check_endpoints = [
  "http://localhost:8080/api/v1/health",
  "http://localhost:8080/api/v1/health/postgres"
]

Alerting Rules

Prometheus Alert Rules

groups:
- name: ram-cluster
  rules:
  - alert: RAMClusterDown
    expr: up{job="ram-cluster"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "RAM cluster node is down"

  - alert: RAMHighFailoverRate
    expr: rate(ram_failover_events_total[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High failover rate detected"

Common Alerts

Critical Alerts

• Cluster node down
• No leader elected
• Split-brain detected
• PostgreSQL connection lost

Warning Alerts

• High failover rate
• Replication lag high
• High election duration
• Node resource usage high

Monitoring Commands

Real-time Monitoring

# Live cluster monitoring
ramctrl monitor --cluster production-cluster

# Monitor with auto-refresh
ramctrl monitor --cluster production-cluster --refresh 5s

# Monitor specific metrics
ramctrl metrics --cluster production-cluster --format prometheus

# Health check all nodes
ramctrl health --cluster production-cluster --all

Historical Analysis

# Export metrics for analysis
ramctrl metrics --cluster production-cluster --output metrics.json

# Historical failover events
ramctrl failover history --cluster production-cluster

# Performance trends
ramctrl metrics --cluster production-cluster --since 1h