Restore pgRaft Cluster Health
Fast Triage Checklist
Before making changes, follow these steps:
- Run the cluster health SQL below on the leader before making changes
- Confirm only one node reports
state = leaderinpgraft_cluster_state - Capture
journalctl -u postgresqloutput from leaders and lagging followers - Back up
pgraft.data_dircontents prior to aggressive snapshot cleanup
Perform remediation in a staging or maintenance window whenever possible. Revert temporary settings after the incident.
Cluster Health Baseline
Collect cluster state, replication backlog, and heartbeat metrics before applying fixes.
Gather health snapshot
Baseline diagnostics
-- Verify cluster health from any node
SELECT * FROM pgraft_get_cluster_status();
SELECT * FROM pgraft_log_get_replication_status();
SELECT * FROM pgraft_get_nodes();Look for exactly one leader, low lag_entries, and matching current_term across nodes.
Interpretation cues
- If
lag_entriesexceeds 100, start recovery on the slow follower. - Quorum must equal the expected cluster size; fewer nodes indicate connectivity or identity drift.
- Large gaps between
messages_processedon leader vs followers highlight stalled workers.
Connectivity & Identity
Resolve node identity mismatches and network blocks that prevent Raft replication.
Validate node identity
Catalog review
SELECT node_id,
cluster_id,
address,
port,
data_dir
FROM pgraft_nodes_catalog
ORDER BY node_id;Network reachability
Connectivity commands
# Verify Raft ports
nc -vz 10.0.0.12 7002
# Confirm pg_hba allows replication connections
psql -c "SELECT * FROM pg_hba_file_rules WHERE user_name = 'pgraft_cluster';"Remediation steps
- Ensure
pgraft.cluster_idmatches on every node; mismatched IDs form separate quorums. - Assign unique
pgraft.node_idvalues and restart nodes after updates. - Reload PostgreSQL when altering
pg_hba.confor replication credentials.
Replication Lag & Follower Recovery
Bring slow followers back into quorum and alert on backlogs before they become critical.
Inspect lagging followers
Lag diagnostics
SELECT node_id,
state,
lag_entries,
replication_lag_bytes,
last_apply_lsn
FROM pgraft_log_get_replication_status()
ORDER BY lag_entries DESC;Force follower resync
Follower catch-up
-- Run on lagging follower once connectivity is restored
SELECT pgraft_log_sync_with_leader();Alert on backlog
Lag alert script
#!/usr/bin/env bash
LAG=$(psql -t -c "SELECT COALESCE(MAX(lag_entries), 0) FROM pgraft_log_get_replication_status();")
if [[ "$LAG" -gt 1000 ]]; then
echo "$(date --iso-8601=seconds) CRITICAL replication lag: $LAG entries" >> /var/log/pgraft-alerts.log
fiLeadership Stability
Reduce election churn and heartbeat noise that introduce latency spikes.
Detect election drift
Election analysis
SELECT node_id,
current_term,
elections_triggered,
elections_triggered::float / GREATEST(current_term, 1) AS elections_per_term
FROM pgraft_get_cluster_status()
WHERE elections_triggered::float / GREATEST(current_term, 1) > 2.0;Tune timing parameters
Adjust timers
psql -c "SELECT pgraft_set_config('election_timeout', '1200ms');"
psql -c "SELECT pgraft_set_config('heartbeat_interval', '60ms');"
psql -c "SELECT pgraft_save_config();"Stabilization tips
- Increase
election_timeoutduring heavy write bursts. - Ensure leaders have sufficient CPU headroom for heartbeats.
- Temporarily disable aggressive failover automation during maintenance.
Snapshots & Storage
Clear snapshot backlogs and monitor disk usage for Raft metadata.
Check snapshot backlog
Snapshot metrics
SELECT total_entries,
pending_snapshots,
last_snapshot_term,
last_snapshot_index
FROM pgraft_log_get_stats();Inspect snapshot directory
Disk usage
du -sh /var/lib/postgresql/pgraft/snapshots
ls -lh /var/lib/postgresql/pgraft/snapshots | tail -10Remediation tips
- Lower
pgraft.snapshot_thresholdto shorten log retention during churn. - Move snapshot storage to SSD tiers for faster compaction.
- Archive old snapshots after confirming recent copies exist elsewhere.
KV Store Integrity
Validate replicated KV operations and watch for oversized payloads.
Roundtrip test
Health probe
DO $$
DECLARE
k TEXT := 'troubleshoot_' || extract(epoch FROM now());
v JSONB := jsonb_build_object('status', 'probe');
roundtrip JSONB;
BEGIN
PERFORM pgraft_kv_put(k, v);
SELECT pgraft_kv_get(k) INTO roundtrip;
IF roundtrip IS DISTINCT FROM v THEN
RAISE EXCEPTION 'KV roundtrip failed: %', roundtrip;
END IF;
PERFORM pgraft_kv_delete(k);
END;
$$;Detect skewed entries
Large value audit
SELECT key,
pg_column_size(value) AS value_bytes,
updated_at
FROM pgraft.kv
ORDER BY updated_at DESC
LIMIT 20;Support Bundle
Collect logs and catalog snapshots before contacting pgElephant support.
Generate support bundle
#!/usr/bin/env bash
DEST=/tmp/pgraft-support-$(date +%s)
mkdir -p "$DEST"
psql -f - <<'SQL'
\o ${DEST}/cluster_status.txt
SELECT * FROM pgraft_get_cluster_status();
SELECT * FROM pgraft_log_get_replication_status();
SELECT * FROM pgraft_get_nodes();
SELECT * FROM pgraft_log_get_stats();
SQL
cp /var/log/postgresql/postgresql-*-main.log "$DEST"/
tar -C /tmp -czf pgraft-support.tar.gz "$(basename "$DEST")"