Disaster Recovery & High Availability

Systems Will Fail

Hard drives die. Network cables get cut. Data centers flood. Cloud providers have outages. Software has bugs. Humans make mistakes. The question is not whether your system will fail, but how it fails and how quickly it recovers.

High availability (HA) is about minimizing downtime. Disaster recovery (DR) is about recovering from catastrophic failures. Together, they ensure your system can survive everything from a single server crash to an entire region going offline.

This article covers the fundamental concepts, practical strategies, and testing approaches for building resilient systems.

RTO and RPO

Every disaster recovery plan starts with two numbers.

RPO (Recovery Point Objective):
  How much data can you afford to lose?
  "We can lose at most 5 minutes of data"

  ──────────────────┬──────────────────┬─────────
  Last backup       │ Data written     │ Disaster
  (safe)            │ (lost)           │
                    ◀─── RPO ────────▶│

RTO (Recovery Time Objective):
  How long can you be down?
  "We must be back online within 1 hour"

  ──────────────────┬──────────────────┬─────────
  Disaster          │ Recovery work    │ Back online
  occurs            │                  │
                    ◀─── RTO ────────▶│

Setting RTO and RPO

Tier	RPO	RTO	Example	Cost
Tier 1	Near-zero (seconds)	Minutes	Payment processing, trading	Very high
Tier 2	Minutes	< 1 hour	E-commerce checkout, SaaS	High
Tier 3	Hours	< 4 hours	Internal tools, CMS	Moderate
Tier 4	24 hours	< 24 hours	Analytics, batch processing	Low

Key insight: Lower RTO and RPO cost exponentially more. A system with 0 RPO and 5-minute RTO requires synchronous replication, automatic failover, and redundant infrastructure in multiple regions. A system with 24-hour RPO needs only daily backups.

Backup Strategies

Backups are the foundation of disaster recovery. Without them, any data loss is permanent.

Full Backups

A complete copy of all data. Simple but expensive in time and storage.

Day 1: Full backup (100 GB)          ████████████████████
Day 2: Full backup (102 GB)          ████████████████████
Day 3: Full backup (105 GB)          ████████████████████
Day 4: Full backup (108 GB)          ████████████████████

Total storage: 415 GB
Restore time: Fast (single restore)
Backup time: Slow (copies everything)

Incremental Backups

Only backs up data that changed since the last backup (full or incremental).

Day 1: Full backup (100 GB)          ████████████████████
Day 2: Incremental (2 GB)            ██
Day 3: Incremental (3 GB)            ███
Day 4: Incremental (1 GB)            █

Total storage: 106 GB
Restore time: Slower (must apply full + all incrementals in order)
Backup time: Fast (only changes)

Differential Backups

Only backs up data that changed since the last full backup.

Day 1: Full backup (100 GB)          ████████████████████
Day 2: Differential (2 GB)           ██
Day 3: Differential (5 GB)           █████  (all changes since Day 1)
Day 4: Differential (6 GB)           ██████ (all changes since Day 1)

Total storage: 113 GB
Restore time: Moderate (full + latest differential only)
Backup time: Moderate (growing diffs)

Comparison

Strategy	Storage Usage	Backup Speed	Restore Speed	Restore Complexity
Full	Highest	Slowest	Fastest	Simplest (one file)
Incremental	Lowest	Fastest	Slowest	Most complex (chain)
Differential	Moderate	Moderate	Moderate	Moderate (full + latest diff)

Practical Backup Implementation

#!/bin/bash
# PostgreSQL backup script with retention policy

DB_HOST="${DB_HOST:-localhost}"
DB_NAME="${DB_NAME:-myapp}"
BACKUP_DIR="/backups"
S3_BUCKET="s3://myapp-backups"
RETENTION_DAYS=30

DATE=$(date +%Y-%m-%d_%H-%M-%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${DATE}.sql.gz"

# Create compressed backup
pg_dump -h "$DB_HOST" -U postgres -d "$DB_NAME" \
  --format=custom \
  --compress=9 \
  --file="$BACKUP_FILE"

# Verify backup integrity
pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "ERROR: Backup verification failed!"
  exit 1
fi

# Upload to S3 with server-side encryption
aws s3 cp "$BACKUP_FILE" "${S3_BUCKET}/daily/${DATE}.sql.gz" \
  --sse AES256

# Clean up local backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete

# Clean up S3 backups older than retention period
aws s3 ls "${S3_BUCKET}/daily/" | while read -r line; do
  FILE_DATE=$(echo "$line" | awk '{print $1}')
  AGE=$(( ($(date +%s) - $(date -d "$FILE_DATE" +%s)) / 86400 ))
  if [ "$AGE" -gt "$RETENTION_DAYS" ]; then
    FILE_NAME=$(echo "$line" | awk '{print $4}')
    aws s3 rm "${S3_BUCKET}/daily/${FILE_NAME}"
  fi
done

echo "Backup completed: $BACKUP_FILE"

# Kubernetes CronJob for automated backups
apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: postgres:16
              command: ["/bin/bash", "/scripts/backup.sh"]
              envFrom:
                - secretRef:
                    name: database-credentials
              volumeMounts:
                - name: scripts
                  mountPath: /scripts
                - name: backup-storage
                  mountPath: /backups
          restartPolicy: OnFailure
          volumes:
            - name: scripts
              configMap:
                name: backup-scripts
            - name: backup-storage
              persistentVolumeClaim:
                claimName: backup-pvc

The 3-2-1 Backup Rule

3 copies of your data
2 different storage media (local disk + cloud)
1 offsite copy (different region or provider)

Example:
  Copy 1: Primary database (local SSD)
  Copy 2: Local backup server (NAS/SAN)
  Copy 3: Cloud storage in different region (S3 cross-region replication)

Database Replication

Replication keeps multiple copies of your database synchronized. It serves dual purposes: high availability (if one copy fails, another takes over) and read scaling (distribute reads across replicas).

Master-Slave (Primary-Replica) Replication

One primary handles all writes. Changes replicate to one or more read replicas.

          Writes
            │
     ┌──────▼──────┐
     │   Primary    │
     │   (master)   │
     └──────┬───────┘
            │ Replication stream
     ┌──────┼──────────┐
     │      │          │
     ▼      ▼          ▼
┌────────┐┌────────┐┌────────┐
│Replica ││Replica ││Replica │
│  1     ││  2     ││  3     │
└────────┘└────────┘└────────┘
     ▲         ▲         ▲
     └─────────┴─────────┘
            Reads

-- PostgreSQL streaming replication setup

-- On primary: postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'

-- On primary: pg_hba.conf (allow replication connections)
host replication replicator 10.0.0.0/8 md5

-- On replica: create base backup from primary
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/data -P --wal-method=stream

-- On replica: postgresql.conf
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
hot_standby = on

Synchronous vs asynchronous replication:

Asynchronous (default):
  Primary commits ──▶ Returns to client ──▶ Replicates (eventually)
  Faster writes, but replica may be behind (data loss risk on primary failure)

Synchronous:
  Primary commits ──▶ Waits for replica commit ──▶ Returns to client
  No data loss, but higher write latency (must wait for replica)

Semi-synchronous:
  Primary commits ──▶ Waits for at least 1 replica ──▶ Returns to client
  Balance between safety and performance

Master-Master (Multi-Primary) Replication

Multiple nodes accept writes. Changes replicate bidirectionally. Significantly more complex due to conflict resolution.

     Writes              Writes
       │                   │
┌──────▼──────┐     ┌──────▼──────┐
│  Primary A  │◀───▶│  Primary B  │
│  (US-East)  │     │  (EU-West)  │
└─────────────┘     └─────────────┘
  Bidirectional replication

Conflict: What if both update the same row simultaneously?
  Primary A: UPDATE users SET name = 'Alice' WHERE id = 1;
  Primary B: UPDATE users SET name = 'Alicia' WHERE id = 1;

Resolution strategies:
  - Last write wins (timestamp-based)
  - Application-level conflict resolution
  - CRDTs (Conflict-free Replicated Data Types)

When to use multi-primary:

Multi-region deployments where local writes must be fast
Systems that can tolerate eventual consistency
Workloads where conflicts are rare (each region writes to different data)

Avoid multi-primary when:

Strong consistency is required (financial transactions)
Conflict resolution is complex for your data model
You do not have expertise to manage it

Failover Mechanisms

Failover is the process of switching to a backup system when the primary fails.

Automatic Failover

Normal operation:
  Client ──▶ Primary DB (reads + writes)
              │
              └──▶ Replica (standby)

Primary fails:
  Client ──▶ Primary DB ✗ (down!)
              │
              └──▶ Replica (promoted to primary)

After failover:
  Client ──▶ New Primary (was replica)
              │
              └──▶ New Replica (provisioned)

# PostgreSQL automatic failover with Patroni
# patroni.yml
scope: my-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.1:8008

etcd:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB max replication lag
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    superuser:
      username: postgres
      password: secret
    replication:
      username: replicator
      password: secret

Application-Level Failover

// Database connection with automatic failover
import { Pool } from 'pg';

class FailoverPool {
  private primary: Pool;
  private replica: Pool;
  private usingReplica = false;

  constructor() {
    this.primary = new Pool({
      host: process.env.DB_PRIMARY_HOST,
      database: 'myapp',
      max: 20,
    });

    this.replica = new Pool({
      host: process.env.DB_REPLICA_HOST,
      database: 'myapp',
      max: 20,
    });
  }

  async query(sql: string, params?: unknown[]) {
    try {
      if (this.usingReplica) {
        // Try primary again periodically
        return await this.tryPrimary(sql, params);
      }
      return await this.primary.query(sql, params);
    } catch (error) {
      if (this.isConnectionError(error)) {
        console.error('Primary database failed, switching to replica');
        this.usingReplica = true;
        this.startPrimaryHealthCheck();
        return this.replica.query(sql, params);
      }
      throw error;
    }
  }

  private async tryPrimary(sql: string, params?: unknown[]) {
    try {
      const result = await this.primary.query(sql, params);
      this.usingReplica = false;
      console.log('Primary database recovered, switching back');
      return result;
    } catch {
      return this.replica.query(sql, params);
    }
  }

  private isConnectionError(error: unknown): boolean {
    const code = (error as { code?: string }).code;
    return code === 'ECONNREFUSED' || code === 'ENOTFOUND' || code === 'ETIMEDOUT';
  }

  private startPrimaryHealthCheck() {
    const interval = setInterval(async () => {
      try {
        await this.primary.query('SELECT 1');
        this.usingReplica = false;
        console.log('Primary database recovered');
        clearInterval(interval);
      } catch {
        // Primary still down
      }
    }, 10000); // Check every 10 seconds
  }
}

Failover Types

Cold failover:
  Standby is off. On failure, start it up, restore data, switch traffic.
  RTO: Hours
  Cost: Lowest

Warm failover:
  Standby is running but not serving traffic. Data replicates async.
  On failure, promote standby, switch traffic.
  RTO: Minutes
  Cost: Moderate

Hot failover:
  Standby is running and data replicates synchronously.
  On failure, traffic switches instantly.
  RTO: Seconds
  Cost: Highest

┌────────────┬─────────────┬──────────────┬──────────────┐
│ Type       │ Standby     │ Data Sync    │ Switch Time  │
├────────────┼─────────────┼──────────────┼──────────────┤
│ Cold       │ Off         │ Restore from │ Hours        │
│            │             │ backup       │              │
│ Warm       │ Running     │ Async repli- │ Minutes      │
│            │             │ cation       │              │
│ Hot        │ Running +   │ Sync repli-  │ Seconds      │
│            │ ready       │ cation       │              │
└────────────┴─────────────┴──────────────┴──────────────┘

Testing Disaster Recovery

An untested disaster recovery plan is not a plan — it is a hope. Testing reveals gaps, timing issues, and broken assumptions before a real disaster does.

Types of DR Tests

Tabletop exercise: Walk through the DR plan on paper. "If the primary database fails at 3 AM, what happens? Who gets paged? What is the first command they run?" Low cost, reveals process gaps.

Component test: Test individual components — restore a backup, promote a replica, failover a load balancer. Validates that each piece works independently.

Full simulation: Simulate a complete disaster scenario in a staging environment. Time the recovery. Measure data loss. Identify bottlenecks.

Chaos engineering (production): Intentionally cause failures in production to verify resilience. This is the gold standard.

Chaos Engineering

# Chaos Monkey for Kubernetes (LitmusChaos)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-test
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"        # Kill pods for 60 seconds
            - name: CHAOS_INTERVAL
              value: "10"        # Every 10 seconds
            - name: FORCE
              value: "false"     # Graceful termination

// Simple chaos testing script
async function chaosTest() {
  console.log('Starting chaos test: database failover');
  const startTime = Date.now();

  // Step 1: Verify system is healthy
  const healthBefore = await checkSystemHealth();
  assert(healthBefore.healthy, 'System must be healthy before chaos test');

  // Step 2: Simulate primary database failure
  console.log('Simulating primary database failure...');
  await simulateDatabaseFailure('primary');

  // Step 3: Measure recovery time
  let recovered = false;
  while (!recovered && Date.now() - startTime < 300000) { // 5 min timeout
    await sleep(5000);
    const health = await checkSystemHealth();
    if (health.healthy) {
      recovered = true;
    }
  }

  const recoveryTime = (Date.now() - startTime) / 1000;

  // Step 4: Verify data integrity
  const dataCheck = await verifyDataIntegrity();

  // Step 5: Report results
  console.log({
    test: 'database-failover',
    recovered,
    recoveryTimeSeconds: recoveryTime,
    rtoTarget: 300, // 5 minutes
    rtoMet: recoveryTime <= 300,
    dataIntegrity: dataCheck.passed,
    dataLoss: dataCheck.missingRecords,
    rpoTarget: 0,
    rpoMet: dataCheck.missingRecords === 0,
  });

  // Step 6: Restore original configuration
  await restoreDatabase('primary');
}

DR Test Checklist

Preparation:
  [ ] Notify stakeholders of planned test
  [ ] Ensure backups are current
  [ ] Have rollback plan ready
  [ ] Set monitoring/alerting to test mode

During test:
  [ ] Time the detection (how long until failure is noticed?)
  [ ] Time the response (how long until recovery starts?)
  [ ] Time the recovery (how long until service is restored?)
  [ ] Measure data loss (how much data was lost?)
  [ ] Test client reconnection (do clients recover automatically?)
  [ ] Verify data integrity after recovery

After test:
  [ ] Compare actual RTO vs target RTO
  [ ] Compare actual RPO vs target RPO
  [ ] Document what went wrong
  [ ] Update runbooks with lessons learned
  [ ] Schedule next test

Multi-Region Strategies

For the highest level of availability, deploy across multiple geographic regions. This protects against region-wide failures (natural disasters, fiber cuts, cloud provider outages).

Active-Passive (one region serves traffic):
┌──────────────────┐     ┌──────────────────┐
│    US-East       │     │    EU-West       │
│    (active)      │     │    (passive)     │
│                  │     │                  │
│  ┌────────────┐  │     │  ┌────────────┐  │
│  │ App Servers│  │     │  │ App Servers│  │
│  │ (serving)  │  │     │  │ (standby)  │  │
│  └─────┬──────┘  │     │  └─────┬──────┘  │
│        │         │     │        │         │
│  ┌─────▼──────┐  │────▶│  ┌─────▼──────┐  │
│  │  Database  │  │repli│  │  Database  │  │
│  │  (primary) │  │cate │  │  (replica) │  │
│  └────────────┘  │     │  └────────────┘  │
└──────────────────┘     └──────────────────┘

If US-East fails: DNS switches to EU-West
RPO: Depends on replication lag
RTO: DNS propagation time (minutes to hours with low TTL)


Active-Active (both regions serve traffic):
┌──────────────────┐     ┌──────────────────┐
│    US-East       │     │    EU-West       │
│    (active)      │     │    (active)      │
│                  │     │                  │
│  ┌────────────┐  │     │  ┌────────────┐  │
│  │ App Servers│  │     │  │ App Servers│  │
│  │ (serving)  │  │     │  │ (serving)  │  │
│  └─────┬──────┘  │     │  └─────┬──────┘  │
│        │         │     │        │         │
│  ┌─────▼──────┐  │◀──▶│  ┌─────▼──────┐  │
│  │  Database  │  │bidi │  │  Database  │  │
│  │  (primary) │  │repli│  │  (primary) │  │
│  └────────────┘  │cate │  └────────────┘  │
└──────────────────┘     └──────────────────┘

Both regions serve traffic. Users routed to nearest region.
If US-East fails: EU-West handles all traffic.
Complexity: Conflict resolution, data consistency.

Global Load Balancing

┌──────────────────────────────────────────────────────┐
│              Global DNS Load Balancer                  │
│              (Route53, Cloudflare)                     │
│                                                      │
│  US users    ──▶  US-East  (latency-based routing)   │
│  EU users    ──▶  EU-West  (latency-based routing)   │
│  AP users    ──▶  AP-Southeast (latency-based)       │
│                                                      │
│  Health checks: If US-East fails, route US to EU-West│
└──────────────────────────────────────────────────────┘

# AWS Route53 health check and failover
aws route53 create-health-check --caller-reference $(date +%s) \
  --health-check-config '{
    "IPAddress": "203.0.113.1",
    "Port": 443,
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3,
    "EnableSNI": true,
    "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
  }'

Multi-Region Decision Matrix

Strategy	Complexity	Cost	RTO	RPO	Best For
Single region	Low	Low	Hours	Hours	Non-critical apps
Active-passive	Moderate	Moderate	Minutes	Minutes	Most production apps
Active-active	High	High	Seconds	Near-zero	Global user base, Tier 1

Putting It Together: DR Plan Template

1. CLASSIFICATION
   Service: Payment Processing API
   Tier: 1 (business critical)
   RTO: 5 minutes
   RPO: 0 (zero data loss)

2. BACKUP STRATEGY
   - Synchronous replication to standby in secondary region
   - Hourly snapshots to S3 (cross-region replication)
   - Daily full backup to cold storage (30-day retention)

3. FAILOVER PROCEDURE
   a. Automatic health check detects primary failure
   b. DNS failover to secondary region (TTL: 60s)
   c. Replica promoted to primary (Patroni automatic)
   d. Application reconnects to new primary
   e. Alert: on-call engineer notified

4. RECOVERY PROCEDURE
   a. Investigate root cause of original failure
   b. Provision new replica from promoted primary
   c. Verify replication is current
   d. Optionally fail back to original region
   e. Post-incident review within 48 hours

5. TESTING SCHEDULE
   - Backup restore test: Monthly
   - Failover test: Quarterly
   - Full DR simulation: Biannually
   - Chaos engineering: Ongoing (automated)

6. CONTACTS
   - Primary on-call: [PagerDuty rotation]
   - Database team: [Slack channel]
   - Infrastructure: [Slack channel]
   - Management escalation: [after 30 min unresolved]

Key Takeaways

Define RTO and RPO before building your DR strategy. These numbers drive every architecture decision.
Follow the 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite. Test restores monthly. A backup you have never restored is not a backup.
Start with primary-replica replication and automatic failover. This covers most failure scenarios at reasonable cost.
Multi-region active-active is the gold standard for availability but introduces significant complexity (data conflicts, network latency, operational overhead). Most applications do fine with active-passive.
Test your disaster recovery plan regularly. An untested plan will fail when you need it most. Chaos engineering in production validates resilience under real conditions.
Failover is not just a database concern. Every stateful component needs a failover strategy: caches, queues, file storage, and third-party service integrations.
Document everything: runbooks, escalation paths, recovery procedures. At 3 AM during an outage, nobody remembers the correct sequence of commands.
The cost of downtime almost always exceeds the cost of redundancy. Calculate the business impact of an hour of downtime and use that to justify your DR investment.