Documentationpgraft Documentation

Restore pgRaft Cluster Health

Fast Triage Checklist

Before making changes, follow these steps:

  • Run the cluster health SQL below on the leader before making changes
  • Confirm only one node reports state = leader in pgraft_cluster_state
  • Capture journalctl -u postgresql output from leaders and lagging followers
  • Back up pgraft.data_dir contents prior to aggressive snapshot cleanup

Perform remediation in a staging or maintenance window whenever possible. Revert temporary settings after the incident.

Cluster Health Baseline

Collect cluster state, replication backlog, and heartbeat metrics before applying fixes.

Gather health snapshot

Baseline diagnostics

-- Verify cluster health from any node
SELECT * FROM pgraft_get_cluster_status();
SELECT * FROM pgraft_log_get_replication_status();
SELECT * FROM pgraft_get_nodes();

Look for exactly one leader, low lag_entries, and matching current_term across nodes.

Interpretation cues

  • If lag_entries exceeds 100, start recovery on the slow follower.
  • Quorum must equal the expected cluster size; fewer nodes indicate connectivity or identity drift.
  • Large gaps between messages_processed on leader vs followers highlight stalled workers.

Connectivity & Identity

Resolve node identity mismatches and network blocks that prevent Raft replication.

Validate node identity

Catalog review

SELECT node_id,
       cluster_id,
       address,
       port,
       data_dir
  FROM pgraft_nodes_catalog
 ORDER BY node_id;

Network reachability

Connectivity commands

# Verify Raft ports
nc -vz 10.0.0.12 7002

# Confirm pg_hba allows replication connections
psql -c "SELECT * FROM pg_hba_file_rules WHERE user_name = 'pgraft_cluster';"

Remediation steps

  • Ensure pgraft.cluster_id matches on every node; mismatched IDs form separate quorums.
  • Assign unique pgraft.node_id values and restart nodes after updates.
  • Reload PostgreSQL when altering pg_hba.conf or replication credentials.

Replication Lag & Follower Recovery

Bring slow followers back into quorum and alert on backlogs before they become critical.

Inspect lagging followers

Lag diagnostics

SELECT node_id,
       state,
       lag_entries,
       replication_lag_bytes,
       last_apply_lsn
  FROM pgraft_log_get_replication_status()
 ORDER BY lag_entries DESC;

Force follower resync

Follower catch-up

-- Run on lagging follower once connectivity is restored
SELECT pgraft_log_sync_with_leader();

Alert on backlog

Lag alert script

#!/usr/bin/env bash
LAG=$(psql -t -c "SELECT COALESCE(MAX(lag_entries), 0) FROM pgraft_log_get_replication_status();")
if [[ "$LAG" -gt 1000 ]]; then
  echo "$(date --iso-8601=seconds) CRITICAL replication lag: $LAG entries" >> /var/log/pgraft-alerts.log
fi

Leadership Stability

Reduce election churn and heartbeat noise that introduce latency spikes.

Detect election drift

Election analysis

SELECT node_id,
       current_term,
       elections_triggered,
       elections_triggered::float / GREATEST(current_term, 1) AS elections_per_term
  FROM pgraft_get_cluster_status()
 WHERE elections_triggered::float / GREATEST(current_term, 1) > 2.0;

Tune timing parameters

Adjust timers

psql -c "SELECT pgraft_set_config('election_timeout', '1200ms');"
psql -c "SELECT pgraft_set_config('heartbeat_interval', '60ms');"
psql -c "SELECT pgraft_save_config();"

Stabilization tips

  • Increase election_timeout during heavy write bursts.
  • Ensure leaders have sufficient CPU headroom for heartbeats.
  • Temporarily disable aggressive failover automation during maintenance.

Snapshots & Storage

Clear snapshot backlogs and monitor disk usage for Raft metadata.

Check snapshot backlog

Snapshot metrics

SELECT total_entries,
       pending_snapshots,
       last_snapshot_term,
       last_snapshot_index
  FROM pgraft_log_get_stats();

Inspect snapshot directory

Disk usage

du -sh /var/lib/postgresql/pgraft/snapshots
ls -lh /var/lib/postgresql/pgraft/snapshots | tail -10

Remediation tips

  • Lower pgraft.snapshot_threshold to shorten log retention during churn.
  • Move snapshot storage to SSD tiers for faster compaction.
  • Archive old snapshots after confirming recent copies exist elsewhere.

KV Store Integrity

Validate replicated KV operations and watch for oversized payloads.

Roundtrip test

Health probe

DO $$
DECLARE
  k TEXT := 'troubleshoot_' || extract(epoch FROM now());
  v JSONB := jsonb_build_object('status', 'probe');
  roundtrip JSONB;
BEGIN
  PERFORM pgraft_kv_put(k, v);
  SELECT pgraft_kv_get(k) INTO roundtrip;
  IF roundtrip IS DISTINCT FROM v THEN
    RAISE EXCEPTION 'KV roundtrip failed: %', roundtrip;
  END IF;
  PERFORM pgraft_kv_delete(k);
END;
$$;

Detect skewed entries

Large value audit

SELECT key,
       pg_column_size(value) AS value_bytes,
       updated_at
  FROM pgraft.kv
 ORDER BY updated_at DESC
 LIMIT 20;

Support Bundle

Collect logs and catalog snapshots before contacting pgElephant support.

Generate support bundle

#!/usr/bin/env bash
DEST=/tmp/pgraft-support-$(date +%s)
mkdir -p "$DEST"

psql -f - <<'SQL'
\o ${DEST}/cluster_status.txt
SELECT * FROM pgraft_get_cluster_status();
SELECT * FROM pgraft_log_get_replication_status();
SELECT * FROM pgraft_get_nodes();
SELECT * FROM pgraft_log_get_stats();
SQL

cp /var/log/postgresql/postgresql-*-main.log "$DEST"/
tar -C /tmp -czf pgraft-support.tar.gz "$(basename "$DEST")"