RAM Monitoring

Comprehensive monitoring solution for RAM PostgreSQL clustering with Prometheus metrics and Grafana dashboards.

Prometheus Integration

Metrics Endpoint

# RAM metrics endpoint
curl http://localhost:9090/metrics

# Example metrics output
# HELP ram_cluster_nodes_total Total number of nodes in cluster
# TYPE ram_cluster_nodes_total gauge
ram_cluster_nodes_total{cluster="production"} 3

# HELP ram_cluster_leader_term Current leader term
# TYPE ram_cluster_leader_term gauge
ram_cluster_leader_term{cluster="production"} 42

# HELP ram_failover_events_total Total number of failover events
# TYPE ram_failover_events_total counter
ram_failover_events_total{cluster="production"} 5

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'ram-cluster'
    static_configs:
      - targets: ['node1:9090', 'node2:9090', 'node3:9090']
    metrics_path: '/metrics'
    scrape_interval: 5s

  - job_name: 'postgresql'
    static_configs:
      - targets: ['node1:9187', 'node2:9187', 'node3:9187']

Key Metrics

Cluster Metrics

MetricTypeDescription
ram_cluster_nodes_totalGaugeTotal number of nodes in cluster
ram_cluster_leader_termGaugeCurrent leader term
ram_cluster_commit_indexGaugeCurrent commit index
ram_failover_events_totalCounterTotal number of failover events
ram_election_duration_secondsHistogramLeader election duration

Node Metrics

MetricTypeDescription
ram_node_stateGaugeNode state (0=follower, 1=candidate, 2=leader)
ram_node_log_sizeGaugeSize of node's log
ram_node_uptime_secondsCounterNode uptime in seconds
ram_node_requests_totalCounterTotal requests processed by node

Grafana Dashboards

Cluster Overview Dashboard

Key Panels

  • • Cluster health status
  • • Current leader and term
  • • Node count and status
  • • Failover events timeline
  • • Request rate and latency

Alerts

  • • Cluster split-brain detection
  • • High failover frequency
  • • Node unreachable
  • • High replication lag

Performance Dashboard

Metrics

  • • Election duration histogram
  • • Log replication latency
  • • Commit index progression
  • • Network throughput

Trends

  • • Request rate over time
  • • Error rate trends
  • • Resource utilization
  • • Performance degradation

Health Monitoring

Health Check Endpoints

# Cluster health check
curl http://localhost:8080/api/v1/health

# Node health check
curl http://localhost:8080/api/v1/health/node

# PostgreSQL health check
curl http://localhost:8080/api/v1/health/postgres

# Comprehensive health status
curl http://localhost:8080/api/v1/health/full

Health Check Configuration

[monitoring]
health_check_enabled = true
health_check_interval = 5s
health_check_timeout = 3s
health_check_retries = 3

# Health check endpoints
health_check_endpoints = [
  "http://localhost:8080/api/v1/health",
  "http://localhost:8080/api/v1/health/postgres"
]

Alerting Rules

Prometheus Alert Rules

groups: - name: ram-cluster rules: - alert: RAMClusterDown expr: up{job="ram-cluster"} == 0 for: 1m labels: severity: critical annotations: summary: "RAM cluster node is down" - alert: RAMHighFailoverRate expr: rate(ram_failover_events_total[5m]) > 0.1 for: 2m labels: severity: warning annotations: summary: "High failover rate detected"

Common Alerts

Critical Alerts

  • • Cluster node down
  • • No leader elected
  • • Split-brain detected
  • • PostgreSQL connection lost

Warning Alerts

  • • High failover rate
  • • Replication lag high
  • • High election duration
  • • Node resource usage high

Monitoring Commands

Real-time Monitoring

# Live cluster monitoring
ramctrl monitor --cluster production-cluster

# Monitor with auto-refresh
ramctrl monitor --cluster production-cluster --refresh 5s

# Monitor specific metrics
ramctrl metrics --cluster production-cluster --format prometheus

# Health check all nodes
ramctrl health --cluster production-cluster --all

Historical Analysis

# Export metrics for analysis
ramctrl metrics --cluster production-cluster --output metrics.json

# Historical failover events
ramctrl failover history --cluster production-cluster

# Performance trends
ramctrl metrics --cluster production-cluster --since 1h