DocumentationpgBalancer Documentation

Monitoring & Metrics

Configure Prometheus Scraping

pgBalancer exposes Prometheus metrics on the /metrics endpoint:

Prometheus Configuration

# prometheus.yml configuration
global:
  scrape_interval: 15s      # Scrape every 15 seconds
  evaluation_interval: 15s  # Evaluate rules every 15 seconds

scrape_configs:
  # pgBalancer metrics
  - job_name: 'pgbalancer'
    static_configs:
      - targets:
          - 'pgbalancer1.internal:8080'
          - 'pgbalancer2.internal:8080'
          - 'pgbalancer3.internal:8080'
    metrics_path: '/metrics'
    scrape_interval: 15s
    scrape_timeout: 10s

# Load alert rules
rule_files:
  - 'pgbalancer-alerts.yml'

Verify Metrics Endpoint

# Test metrics endpoint
curl -s http://localhost:8080/metrics

# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload

Monitor Key Metrics

Track critical pgBalancer metrics:

Backend Health Metrics

# Number of backends currently up
sum(pgbalancer_backend_up)

# Backend uptime percentage (last 24 hours)
avg_over_time(pgbalancer_backend_up[24h]) * 100

# Backends currently down (for alerting)
pgbalancer_backend_up == 0

Load Distribution Metrics

# Queries per second by backend
rate(pgbalancer_backend_queries_total[5m])

# Total cluster queries per second
sum(rate(pgbalancer_backend_queries_total[5m]))

Configure Alert Rules

Set up critical alerts for pgBalancer monitoring:

pgbalancer-alerts.yml

groups:
  - name: pgbalancer_alerts
    interval: 30s
    rules:
      # Critical: pgBalancer server down
      - alert: PgbalancerDown
        expr: pgbalancer_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "pgBalancer server is down"
          description: "pgBalancer instance {{ $labels.instance }} is down"

      # Critical: Backend node down
      - alert: PgbalancerBackendDown
        expr: pgbalancer_backend_up == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Backend node {{ $labels.node_id }} is down"
          description: "Backend {{ $labels.hostname }}:{{ $labels.port }} (node {{ $labels.node_id }}) has been down for 2 minutes"

Grafana Dashboard

Import the pre-built pgBalancer Grafana dashboard:

Import Dashboard

# Download dashboard JSON
wget https://raw.githubusercontent.com/pgElephant/pgbalancer/main/monitoring/grafana/pgbalancer-dashboard.json

# Import via Grafana UI
# 1. Go to: http://localhost:3000/dashboard/import
# 2. Upload pgbalancer-dashboard.json
# 3. Select Prometheus data source
# 4. Click "Import"

Alertmanager Notifications

Configure alert notifications via Slack, email, or PagerDuty:

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'pgbalancer-alerts'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

Monitoring Best Practices

✓ DO

  • • Scrape metrics every 15-30 seconds (balance freshness vs load)
  • • Retain metrics for 30+ days for trend analysis
  • • Use recording rules for expensive queries in dashboards
  • • Configure Alertmanager for critical alerts
  • • Test failover and alert pipelines regularly

✗ DON'T

  • • Don't scrape faster than 10 seconds (adds unnecessary load)
  • • Don't set alert for duration too low (avoid flapping)
  • • Don't create high-cardinality metrics (per-connection labels)
  • • Don't ignore warning alerts for extended periods
  • • Don't rely only on metrics - monitor logs too

Metrics Reference

MetricTypeDescription
pgbalancer_upGaugeServer status (1=up)
pgbalancer_backend_upGaugeBackend status by node_id
pgbalancer_backend_queries_totalCounterTotal queries per backend
pgbalancer_pool_utilization_percentGaugePool utilization (0-100)