Systems Will Fail
Hard drives die. Network cables get cut. Data centers flood. Cloud providers have outages. Software has bugs. Humans make mistakes. The question is not whether your system will fail, but how it fails and how quickly it recovers.
High availability (HA) is about minimizing downtime. Disaster recovery (DR) is about recovering from catastrophic failures. Together, they ensure your system can survive everything from a single server crash to an entire region going offline.
This article covers the fundamental concepts, practical strategies, and testing approaches for building resilient systems.
RTO and RPO
Every disaster recovery plan starts with two numbers.
RPO (Recovery Point Objective):
How much data can you afford to lose?
"We can lose at most 5 minutes of data"
βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββ
Last backup β Data written β Disaster
(safe) β (lost) β
ββββ RPO βββββββββΆβ
RTO (Recovery Time Objective):
How long can you be down?
"We must be back online within 1 hour"
βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββ
Disaster β Recovery work β Back online
occurs β β
ββββ RTO βββββββββΆβ
Setting RTO and RPO
| Tier | RPO | RTO | Example | Cost |
|---|---|---|---|---|
| Tier 1 | Near-zero (seconds) | Minutes | Payment processing, trading | Very high |
| Tier 2 | Minutes | < 1 hour | E-commerce checkout, SaaS | High |
| Tier 3 | Hours | < 4 hours | Internal tools, CMS | Moderate |
| Tier 4 | 24 hours | < 24 hours | Analytics, batch processing | Low |
Key insight: Lower RTO and RPO cost exponentially more. A system with 0 RPO and 5-minute RTO requires synchronous replication, automatic failover, and redundant infrastructure in multiple regions. A system with 24-hour RPO needs only daily backups.
Backup Strategies
Backups are the foundation of disaster recovery. Without them, any data loss is permanent.
Full Backups
A complete copy of all data. Simple but expensive in time and storage.
Day 1: Full backup (100 GB) ββββββββββββββββββββ
Day 2: Full backup (102 GB) ββββββββββββββββββββ
Day 3: Full backup (105 GB) ββββββββββββββββββββ
Day 4: Full backup (108 GB) ββββββββββββββββββββ
Total storage: 415 GB
Restore time: Fast (single restore)
Backup time: Slow (copies everything)
Incremental Backups
Only backs up data that changed since the last backup (full or incremental).
Day 1: Full backup (100 GB) ββββββββββββββββββββ
Day 2: Incremental (2 GB) ββ
Day 3: Incremental (3 GB) βββ
Day 4: Incremental (1 GB) β
Total storage: 106 GB
Restore time: Slower (must apply full + all incrementals in order)
Backup time: Fast (only changes)
Differential Backups
Only backs up data that changed since the last full backup.
Day 1: Full backup (100 GB) ββββββββββββββββββββ
Day 2: Differential (2 GB) ββ
Day 3: Differential (5 GB) βββββ (all changes since Day 1)
Day 4: Differential (6 GB) ββββββ (all changes since Day 1)
Total storage: 113 GB
Restore time: Moderate (full + latest differential only)
Backup time: Moderate (growing diffs)
Comparison
| Strategy | Storage Usage | Backup Speed | Restore Speed | Restore Complexity |
|---|---|---|---|---|
| Full | Highest | Slowest | Fastest | Simplest (one file) |
| Incremental | Lowest | Fastest | Slowest | Most complex (chain) |
| Differential | Moderate | Moderate | Moderate | Moderate (full + latest diff) |
Practical Backup Implementation
#!/bin/bash
# PostgreSQL backup script with retention policy
DB_HOST="${DB_HOST:-localhost}"
DB_NAME="${DB_NAME:-myapp}"
BACKUP_DIR="/backups"
S3_BUCKET="s3://myapp-backups"
RETENTION_DAYS=30
DATE=$(date +%Y-%m-%d_%H-%M-%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${DATE}.sql.gz"
# Create compressed backup
pg_dump -h "$DB_HOST" -U postgres -d "$DB_NAME" \
--format=custom \
--compress=9 \
--file="$BACKUP_FILE"
# Verify backup integrity
pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "ERROR: Backup verification failed!"
exit 1
fi
# Upload to S3 with server-side encryption
aws s3 cp "$BACKUP_FILE" "${S3_BUCKET}/daily/${DATE}.sql.gz" \
--sse AES256
# Clean up local backups older than 7 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +7 -delete
# Clean up S3 backups older than retention period
aws s3 ls "${S3_BUCKET}/daily/" | while read -r line; do
FILE_DATE=$(echo "$line" | awk '{print $1}')
AGE=$(( ($(date +%s) - $(date -d "$FILE_DATE" +%s)) / 86400 ))
if [ "$AGE" -gt "$RETENTION_DAYS" ]; then
FILE_NAME=$(echo "$line" | awk '{print $4}')
aws s3 rm "${S3_BUCKET}/daily/${FILE_NAME}"
fi
done
echo "Backup completed: $BACKUP_FILE"
# Kubernetes CronJob for automated backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command: ["/bin/bash", "/scripts/backup.sh"]
envFrom:
- secretRef:
name: database-credentials
volumeMounts:
- name: scripts
mountPath: /scripts
- name: backup-storage
mountPath: /backups
restartPolicy: OnFailure
volumes:
- name: scripts
configMap:
name: backup-scripts
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
The 3-2-1 Backup Rule
3 copies of your data
2 different storage media (local disk + cloud)
1 offsite copy (different region or provider)
Example:
Copy 1: Primary database (local SSD)
Copy 2: Local backup server (NAS/SAN)
Copy 3: Cloud storage in different region (S3 cross-region replication)
Database Replication
Replication keeps multiple copies of your database synchronized. It serves dual purposes: high availability (if one copy fails, another takes over) and read scaling (distribute reads across replicas).
Master-Slave (Primary-Replica) Replication
One primary handles all writes. Changes replicate to one or more read replicas.
Writes
β
ββββββββΌβββββββ
β Primary β
β (master) β
ββββββββ¬ββββββββ
β Replication stream
ββββββββΌβββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββ
βReplica ββReplica ββReplica β
β 1 ββ 2 ββ 3 β
ββββββββββββββββββββββββββββββ
β² β² β²
βββββββββββ΄ββββββββββ
Reads
-- PostgreSQL streaming replication setup
-- On primary: postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'
-- On primary: pg_hba.conf (allow replication connections)
host replication replicator 10.0.0.0/8 md5
-- On replica: create base backup from primary
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/data -P --wal-method=stream
-- On replica: postgresql.conf
primary_conninfo = 'host=primary-host port=5432 user=replicator password=secret'
hot_standby = on
Synchronous vs asynchronous replication:
Asynchronous (default):
Primary commits βββΆ Returns to client βββΆ Replicates (eventually)
Faster writes, but replica may be behind (data loss risk on primary failure)
Synchronous:
Primary commits βββΆ Waits for replica commit βββΆ Returns to client
No data loss, but higher write latency (must wait for replica)
Semi-synchronous:
Primary commits βββΆ Waits for at least 1 replica βββΆ Returns to client
Balance between safety and performance
Master-Master (Multi-Primary) Replication
Multiple nodes accept writes. Changes replicate bidirectionally. Significantly more complex due to conflict resolution.
Writes Writes
β β
ββββββββΌβββββββ ββββββββΌβββββββ
β Primary A ββββββΆβ Primary B β
β (US-East) β β (EU-West) β
βββββββββββββββ βββββββββββββββ
Bidirectional replication
Conflict: What if both update the same row simultaneously?
Primary A: UPDATE users SET name = 'Alice' WHERE id = 1;
Primary B: UPDATE users SET name = 'Alicia' WHERE id = 1;
Resolution strategies:
- Last write wins (timestamp-based)
- Application-level conflict resolution
- CRDTs (Conflict-free Replicated Data Types)
When to use multi-primary:
- Multi-region deployments where local writes must be fast
- Systems that can tolerate eventual consistency
- Workloads where conflicts are rare (each region writes to different data)
Avoid multi-primary when:
- Strong consistency is required (financial transactions)
- Conflict resolution is complex for your data model
- You do not have expertise to manage it
Failover Mechanisms
Failover is the process of switching to a backup system when the primary fails.
Automatic Failover
Normal operation:
Client βββΆ Primary DB (reads + writes)
β
ββββΆ Replica (standby)
Primary fails:
Client βββΆ Primary DB β (down!)
β
ββββΆ Replica (promoted to primary)
After failover:
Client βββΆ New Primary (was replica)
β
ββββΆ New Replica (provisioned)
# PostgreSQL automatic failover with Patroni
# patroni.yml
scope: my-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.1:8008
etcd:
hosts: etcd1:2379,etcd2:2379,etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB max replication lag
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.1:5432
data_dir: /var/lib/postgresql/data
authentication:
superuser:
username: postgres
password: secret
replication:
username: replicator
password: secret
Application-Level Failover
// Database connection with automatic failover
import { Pool } from 'pg';
class FailoverPool {
private primary: Pool;
private replica: Pool;
private usingReplica = false;
constructor() {
this.primary = new Pool({
host: process.env.DB_PRIMARY_HOST,
database: 'myapp',
max: 20,
});
this.replica = new Pool({
host: process.env.DB_REPLICA_HOST,
database: 'myapp',
max: 20,
});
}
async query(sql: string, params?: unknown[]) {
try {
if (this.usingReplica) {
// Try primary again periodically
return await this.tryPrimary(sql, params);
}
return await this.primary.query(sql, params);
} catch (error) {
if (this.isConnectionError(error)) {
console.error('Primary database failed, switching to replica');
this.usingReplica = true;
this.startPrimaryHealthCheck();
return this.replica.query(sql, params);
}
throw error;
}
}
private async tryPrimary(sql: string, params?: unknown[]) {
try {
const result = await this.primary.query(sql, params);
this.usingReplica = false;
console.log('Primary database recovered, switching back');
return result;
} catch {
return this.replica.query(sql, params);
}
}
private isConnectionError(error: unknown): boolean {
const code = (error as { code?: string }).code;
return code === 'ECONNREFUSED' || code === 'ENOTFOUND' || code === 'ETIMEDOUT';
}
private startPrimaryHealthCheck() {
const interval = setInterval(async () => {
try {
await this.primary.query('SELECT 1');
this.usingReplica = false;
console.log('Primary database recovered');
clearInterval(interval);
} catch {
// Primary still down
}
}, 10000); // Check every 10 seconds
}
}
Failover Types
Cold failover:
Standby is off. On failure, start it up, restore data, switch traffic.
RTO: Hours
Cost: Lowest
Warm failover:
Standby is running but not serving traffic. Data replicates async.
On failure, promote standby, switch traffic.
RTO: Minutes
Cost: Moderate
Hot failover:
Standby is running and data replicates synchronously.
On failure, traffic switches instantly.
RTO: Seconds
Cost: Highest
ββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β Type β Standby β Data Sync β Switch Time β
ββββββββββββββΌββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Cold β Off β Restore from β Hours β
β β β backup β β
β Warm β Running β Async repli- β Minutes β
β β β cation β β
β Hot β Running + β Sync repli- β Seconds β
β β ready β cation β β
ββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
Testing Disaster Recovery
An untested disaster recovery plan is not a plan β it is a hope. Testing reveals gaps, timing issues, and broken assumptions before a real disaster does.
Types of DR Tests
Tabletop exercise: Walk through the DR plan on paper. "If the primary database fails at 3 AM, what happens? Who gets paged? What is the first command they run?" Low cost, reveals process gaps.
Component test: Test individual components β restore a backup, promote a replica, failover a load balancer. Validates that each piece works independently.
Full simulation: Simulate a complete disaster scenario in a staging environment. Time the recovery. Measure data loss. Identify bottlenecks.
Chaos engineering (production): Intentionally cause failures in production to verify resilience. This is the gold standard.
Chaos Engineering
# Chaos Monkey for Kubernetes (LitmusChaos)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-test
spec:
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # Kill pods for 60 seconds
- name: CHAOS_INTERVAL
value: "10" # Every 10 seconds
- name: FORCE
value: "false" # Graceful termination
// Simple chaos testing script
async function chaosTest() {
console.log('Starting chaos test: database failover');
const startTime = Date.now();
// Step 1: Verify system is healthy
const healthBefore = await checkSystemHealth();
assert(healthBefore.healthy, 'System must be healthy before chaos test');
// Step 2: Simulate primary database failure
console.log('Simulating primary database failure...');
await simulateDatabaseFailure('primary');
// Step 3: Measure recovery time
let recovered = false;
while (!recovered && Date.now() - startTime < 300000) { // 5 min timeout
await sleep(5000);
const health = await checkSystemHealth();
if (health.healthy) {
recovered = true;
}
}
const recoveryTime = (Date.now() - startTime) / 1000;
// Step 4: Verify data integrity
const dataCheck = await verifyDataIntegrity();
// Step 5: Report results
console.log({
test: 'database-failover',
recovered,
recoveryTimeSeconds: recoveryTime,
rtoTarget: 300, // 5 minutes
rtoMet: recoveryTime <= 300,
dataIntegrity: dataCheck.passed,
dataLoss: dataCheck.missingRecords,
rpoTarget: 0,
rpoMet: dataCheck.missingRecords === 0,
});
// Step 6: Restore original configuration
await restoreDatabase('primary');
}
DR Test Checklist
Preparation:
[ ] Notify stakeholders of planned test
[ ] Ensure backups are current
[ ] Have rollback plan ready
[ ] Set monitoring/alerting to test mode
During test:
[ ] Time the detection (how long until failure is noticed?)
[ ] Time the response (how long until recovery starts?)
[ ] Time the recovery (how long until service is restored?)
[ ] Measure data loss (how much data was lost?)
[ ] Test client reconnection (do clients recover automatically?)
[ ] Verify data integrity after recovery
After test:
[ ] Compare actual RTO vs target RTO
[ ] Compare actual RPO vs target RPO
[ ] Document what went wrong
[ ] Update runbooks with lessons learned
[ ] Schedule next test
Multi-Region Strategies
For the highest level of availability, deploy across multiple geographic regions. This protects against region-wide failures (natural disasters, fiber cuts, cloud provider outages).
Active-Passive (one region serves traffic):
ββββββββββββββββββββ ββββββββββββββββββββ
β US-East β β EU-West β
β (active) β β (passive) β
β β β β
β ββββββββββββββ β β ββββββββββββββ β
β β App Serversβ β β β App Serversβ β
β β (serving) β β β β (standby) β β
β βββββββ¬βββββββ β β βββββββ¬βββββββ β
β β β β β β
β βββββββΌβββββββ ββββββΆβ βββββββΌβββββββ β
β β Database β βrepliβ β Database β β
β β (primary) β βcate β β (replica) β β
β ββββββββββββββ β β ββββββββββββββ β
ββββββββββββββββββββ ββββββββββββββββββββ
If US-East fails: DNS switches to EU-West
RPO: Depends on replication lag
RTO: DNS propagation time (minutes to hours with low TTL)
Active-Active (both regions serve traffic):
ββββββββββββββββββββ ββββββββββββββββββββ
β US-East β β EU-West β
β (active) β β (active) β
β β β β
β ββββββββββββββ β β ββββββββββββββ β
β β App Serversβ β β β App Serversβ β
β β (serving) β β β β (serving) β β
β βββββββ¬βββββββ β β βββββββ¬βββββββ β
β β β β β β
β βββββββΌβββββββ βββββΆβ βββββββΌβββββββ β
β β Database β βbidi β β Database β β
β β (primary) β βrepliβ β (primary) β β
β ββββββββββββββ βcate β ββββββββββββββ β
ββββββββββββββββββββ ββββββββββββββββββββ
Both regions serve traffic. Users routed to nearest region.
If US-East fails: EU-West handles all traffic.
Complexity: Conflict resolution, data consistency.
Global Load Balancing
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Global DNS Load Balancer β
β (Route53, Cloudflare) β
β β
β US users βββΆ US-East (latency-based routing) β
β EU users βββΆ EU-West (latency-based routing) β
β AP users βββΆ AP-Southeast (latency-based) β
β β
β Health checks: If US-East fails, route US to EU-Westβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# AWS Route53 health check and failover
aws route53 create-health-check --caller-reference $(date +%s) \
--health-check-config '{
"IPAddress": "203.0.113.1",
"Port": 443,
"Type": "HTTPS",
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3,
"EnableSNI": true,
"Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"]
}'
Multi-Region Decision Matrix
| Strategy | Complexity | Cost | RTO | RPO | Best For |
|---|---|---|---|---|---|
| Single region | Low | Low | Hours | Hours | Non-critical apps |
| Active-passive | Moderate | Moderate | Minutes | Minutes | Most production apps |
| Active-active | High | High | Seconds | Near-zero | Global user base, Tier 1 |
Putting It Together: DR Plan Template
1. CLASSIFICATION
Service: Payment Processing API
Tier: 1 (business critical)
RTO: 5 minutes
RPO: 0 (zero data loss)
2. BACKUP STRATEGY
- Synchronous replication to standby in secondary region
- Hourly snapshots to S3 (cross-region replication)
- Daily full backup to cold storage (30-day retention)
3. FAILOVER PROCEDURE
a. Automatic health check detects primary failure
b. DNS failover to secondary region (TTL: 60s)
c. Replica promoted to primary (Patroni automatic)
d. Application reconnects to new primary
e. Alert: on-call engineer notified
4. RECOVERY PROCEDURE
a. Investigate root cause of original failure
b. Provision new replica from promoted primary
c. Verify replication is current
d. Optionally fail back to original region
e. Post-incident review within 48 hours
5. TESTING SCHEDULE
- Backup restore test: Monthly
- Failover test: Quarterly
- Full DR simulation: Biannually
- Chaos engineering: Ongoing (automated)
6. CONTACTS
- Primary on-call: [PagerDuty rotation]
- Database team: [Slack channel]
- Infrastructure: [Slack channel]
- Management escalation: [after 30 min unresolved]
Key Takeaways
- Define RTO and RPO before building your DR strategy. These numbers drive every architecture decision.
- Follow the 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite. Test restores monthly. A backup you have never restored is not a backup.
- Start with primary-replica replication and automatic failover. This covers most failure scenarios at reasonable cost.
- Multi-region active-active is the gold standard for availability but introduces significant complexity (data conflicts, network latency, operational overhead). Most applications do fine with active-passive.
- Test your disaster recovery plan regularly. An untested plan will fail when you need it most. Chaos engineering in production validates resilience under real conditions.
- Failover is not just a database concern. Every stateful component needs a failover strategy: caches, queues, file storage, and third-party service integrations.
- Document everything: runbooks, escalation paths, recovery procedures. At 3 AM during an outage, nobody remembers the correct sequence of commands.
- The cost of downtime almost always exceeds the cost of redundancy. Calculate the business impact of an hour of downtime and use that to justify your DR investment.