Scaling Fundamentals
Every successful application eventually faces the same problem: more users, more data, more traffic than a single server can handle. Scaling is how you solve that problem. Load balancing is how you distribute work across the scaled infrastructure.
This article covers the full picture β from the decision between vertical and horizontal scaling, through load balancer types and algorithms, to database scaling patterns and auto-scaling policies that keep your infrastructure right-sized.
Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up):
ββββββββββββ ββββββββββββββββββββ
β 4 CPU β β 32 CPU β
β 8 GB RAM β βββΆ β 128 GB RAM β
β 100 GB β β 2 TB SSD β
ββββββββββββ ββββββββββββββββββββ
Small VM Bigger VM
Horizontal Scaling (Scale Out):
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β 4 CPU β β 4 CPU β β 4 CPU β β 4 CPU β
β 8 GB RAM β βββΆ β 8 GB RAM β β 8 GB RAM β β 8 GB RAM β
β 100 GB β β 100 GB β β 100 GB β β 100 GB β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
1 server 3 identical servers
Comparison
| Aspect | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Approach | Bigger machine | More machines |
| Downtime | Usually requires restart | Zero downtime (add servers) |
| Cost curve | Exponential (premium hardware) | Linear (commodity hardware) |
| Ceiling | Hardware limits (largest available machine) | Practically unlimited |
| Complexity | Simple (no code changes) | Complex (distributed system concerns) |
| Failure impact | Single point of failure | One node fails, others continue |
| Data consistency | Trivial (single machine) | Challenging (distributed state) |
| Best for | Databases, quick fixes | Stateless application servers |
When to Use Each
Start vertical when:
- You have a single database that is not yet at capacity
- Your application is not designed for horizontal scaling
- You need a quick fix while redesigning for horizontal scale
- The cost of vertical scaling is still reasonable
Go horizontal when:
- You hit the ceiling of the largest available machine
- You need high availability (no single point of failure)
- Your traffic patterns require elastic scaling
- Your application is stateless or can be made stateless
Load Balancers
A load balancer sits between clients and backend servers, distributing incoming requests across multiple servers.
βββββββββββββββββββ
β Load Balancer β
β β
Clients ββββββββΆβ Health checks β
β Routing rules β
β SSL terminationβ
ββββββ¬βββ¬βββ¬βββββββ
β β β
ββββββββββββ β ββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Server 1 β β Server 2 β β Server 3 β
β (healthy)β β (healthy)β β (healthy)β
ββββββββββββ ββββββββββββ ββββββββββββ
Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different layers of the OSI model, and this distinction matters significantly.
OSI Model (relevant layers):
βββββββββββββββββββββββββββββββββββββββ
β Layer 7: Application (HTTP, gRPC) β βββ L7 load balancer operates here
βββββββββββββββββββββββββββββββββββββββ€
β Layer 6: Presentation (SSL/TLS) β
βββββββββββββββββββββββββββββββββββββββ€
β Layer 5: Session β
βββββββββββββββββββββββββββββββββββββββ€
β Layer 4: Transport (TCP, UDP) β βββ L4 load balancer operates here
βββββββββββββββββββββββββββββββββββββββ€
β Layer 3: Network (IP) β
βββββββββββββββββββββββββββββββββββββββ€
β Layer 2: Data Link β
βββββββββββββββββββββββββββββββββββββββ€
β Layer 1: Physical β
βββββββββββββββββββββββββββββββββββββββ
Layer 4 (Transport) load balancers route based on IP address and TCP/UDP port. They do not inspect the content of the request. Fast, simple, and efficient.
L4 Load Balancer:
Client βββΆ [src: 1.2.3.4:54321, dst: LB:443]
β
β NAT/DNAT β changes destination
βΌ
[src: 1.2.3.4:54321, dst: Server2:8080]
β
βΌ
Server 2 processes request
# HAProxy L4 configuration
frontend tcp_front
bind *:3306
mode tcp
default_backend mysql_servers
backend mysql_servers
mode tcp
balance roundrobin
server db1 10.0.1.1:3306 check
server db2 10.0.1.2:3306 check
server db3 10.0.1.3:3306 check
Layer 7 (Application) load balancers inspect the HTTP request (URL path, headers, cookies, body) and make routing decisions based on content. More flexible, slightly slower.
L7 Load Balancer:
Client βββΆ GET /api/users HTTP/1.1
Host: app.example.com
Cookie: session=abc123
β
β Inspects URL path, headers, cookies
βΌ
/api/* βββΆ API servers
/static/* βββΆ CDN / static servers
/ws/* βββΆ WebSocket servers
# NGINX L7 configuration
upstream api_servers {
server 10.0.1.1:3000;
server 10.0.1.2:3000;
server 10.0.1.3:3000;
}
upstream static_servers {
server 10.0.2.1:80;
server 10.0.2.2:80;
}
upstream websocket_servers {
server 10.0.3.1:8080;
server 10.0.3.2:8080;
}
server {
listen 443 ssl;
server_name app.example.com;
# Route by URL path
location /api/ {
proxy_pass http://api_servers;
}
location /static/ {
proxy_pass http://static_servers;
proxy_cache_valid 200 1d;
}
location /ws/ {
proxy_pass http://websocket_servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
L4 vs L7 Comparison
| Aspect | L4 (Transport) | L7 (Application) |
|---|---|---|
| Routing criteria | IP, port | URL, headers, cookies, body |
| Performance | Faster (no content inspection) | Slightly slower |
| SSL termination | Pass-through possible | Yes (decrypts, then re-encrypts or plain) |
| Content-based routing | No | Yes |
| Protocol support | Any TCP/UDP protocol | HTTP, HTTPS, gRPC, WebSocket |
| Use case | Database, generic TCP | Web apps, API gateway, microservices |
| Examples | AWS NLB, HAProxy (TCP mode) | AWS ALB, NGINX, Envoy |
Load Balancing Algorithms
Round Robin
Requests are distributed sequentially across servers. Server 1, then 2, then 3, then back to 1.
Request 1 βββΆ Server A
Request 2 βββΆ Server B
Request 3 βββΆ Server C
Request 4 βββΆ Server A (cycles back)
Request 5 βββΆ Server B
upstream backend {
server 10.0.1.1:3000; # Gets ~33% of traffic
server 10.0.1.2:3000; # Gets ~33% of traffic
server 10.0.1.3:3000; # Gets ~33% of traffic
}
Pros: Simple, even distribution. Cons: Does not account for server capacity or current load. A slow server gets the same traffic as a fast one.
Weighted Round Robin
Assign weights based on server capacity. A server with weight 3 gets 3x the requests of a server with weight 1.
Weights: A=3, B=2, C=1
Request 1 βββΆ Server A
Request 2 βββΆ Server A
Request 3 βββΆ Server A
Request 4 βββΆ Server B
Request 5 βββΆ Server B
Request 6 βββΆ Server C
Request 7 βββΆ Server A (cycles back)
upstream backend {
server 10.0.1.1:3000 weight=3; # 50% of traffic (powerful server)
server 10.0.1.2:3000 weight=2; # 33% of traffic
server 10.0.1.3:3000 weight=1; # 17% of traffic (smaller instance)
}
Least Connections
Route to the server with the fewest active connections. Better for requests with varying processing times.
Active connections: A=5, B=2, C=8
New request βββΆ Server B (fewest connections)
upstream backend {
least_conn;
server 10.0.1.1:3000;
server 10.0.1.2:3000;
server 10.0.1.3:3000;
}
Best for: Applications where request processing time varies significantly (some requests take 50ms, others take 5 seconds).
IP Hash
Route requests from the same client IP to the same server. Provides session affinity without cookies.
Client 1.2.3.4 βββΆ hash(1.2.3.4) % 3 = 1 βββΆ Server B
Client 5.6.7.8 βββΆ hash(5.6.7.8) % 3 = 0 βββΆ Server A
Client 1.2.3.4 βββΆ hash(1.2.3.4) % 3 = 1 βββΆ Server B (same server)
upstream backend {
ip_hash;
server 10.0.1.1:3000;
server 10.0.1.2:3000;
server 10.0.1.3:3000;
}
Pros: Built-in session affinity. Cons: Uneven distribution if traffic is concentrated from a few IPs (corporate NATs, proxies).
Algorithm Selection Guide
Stateless app, similar servers? βββΆ Round Robin
Stateless app, different capacities? βββΆ Weighted Round Robin
Variable request durations? βββΆ Least Connections
Need session affinity? βββΆ IP Hash (or cookie-based sticky)
WebSocket connections? βββΆ Least Connections + sticky
Session Persistence (Sticky Sessions)
Some applications store session data in memory on the server. If a user's next request goes to a different server, their session is lost.
Problem without sticky sessions:
Request 1 βββΆ Server A (session created in Server A memory)
Request 2 βββΆ Server B (no session found β user logged out!)
Solution 1: Sticky sessions
Request 1 βββΆ Server A (session created, cookie set)
Request 2 βββΆ Server A (LB routes based on cookie)
Request 3 βββΆ Server A (always Server A for this session)
Solution 2: Externalized sessions (better)
Request 1 βββΆ Server A βββΆ Redis (session stored)
Request 2 βββΆ Server B βββΆ Redis (session retrieved)
Request 3 βββΆ Server C βββΆ Redis (session retrieved)
Sticky sessions via NGINX:
upstream backend {
server 10.0.1.1:3000;
server 10.0.1.2:3000;
server 10.0.1.3:3000;
# Cookie-based sticky sessions
sticky cookie srv_id expires=1h path=/;
}
Externalized sessions (recommended):
import session from 'express-session';
import RedisStore from 'connect-redis';
import Redis from 'ioredis';
const redisClient = new Redis({
host: process.env.REDIS_HOST,
port: 6379,
});
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: {
secure: true,
httpOnly: true,
maxAge: 86400000, // 24 hours
},
}));
Externalized sessions are superior because they allow any server to handle any request, making horizontal scaling and rolling deployments trivial.
Health Checks
Load balancers need to know which servers are healthy. Unhealthy servers are removed from the pool until they recover.
βββββββββββββββββββ
β Load Balancer β
β β
β Every 10s: β
β GET /health ββββββββΆ Server 1: 200 OK β
β GET /health ββββββββΆ Server 2: 200 OK β
β GET /health ββββββββΆ Server 3: 503 β (removed from pool)
β β
β Traffic only β
β goes to 1 & 2 β
βββββββββββββββββββ
// Health check endpoint
app.get('/health', async (req, res) => {
const checks = {
database: false,
redis: false,
diskSpace: false,
};
try {
// Check database connectivity
await db.raw('SELECT 1');
checks.database = true;
} catch (e) {
// database unreachable
}
try {
// Check Redis connectivity
await redis.ping();
checks.redis = true;
} catch (e) {
// redis unreachable
}
// Check disk space
const freeSpace = await checkDiskSpace('/');
checks.diskSpace = freeSpace.free > 1_000_000_000; // > 1GB free
const healthy = Object.values(checks).every(Boolean);
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
checks,
uptime: process.uptime(),
timestamp: new Date().toISOString(),
});
});
# NGINX health check configuration
upstream backend {
server 10.0.1.1:3000 max_fails=3 fail_timeout=30s;
server 10.0.1.2:3000 max_fails=3 fail_timeout=30s;
server 10.0.1.3:3000 max_fails=3 fail_timeout=30s;
}
# Active health checks (NGINX Plus / OpenResty)
# Passively, NGINX checks on real traffic failures
Database Scaling
Databases are the hardest component to scale because they are stateful. Here are the primary strategies.
Read Replicas
Route read queries to replicas, write queries to the primary. Works well when reads outnumber writes (which is typical β most applications are 90%+ reads).
ββββββββββββββββ
Writes ββββββΆβ Primary β
β (master) β
ββββββββ¬ββββββββ
β Replication
ββββββββββΌβββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββ
Reads β Replica 1ββ Replica 2ββ Replica 3β
βββββββΆβ (slave) ββ (slave) ββ (slave) β
ββββββββββββββββββββββββββββββββββββ
// Application-level read/write splitting
import { Pool } from 'pg';
const primaryPool = new Pool({
host: process.env.DB_PRIMARY_HOST,
database: 'myapp',
});
const replicaPool = new Pool({
host: process.env.DB_REPLICA_HOST,
database: 'myapp',
});
async function query(sql: string, params: unknown[], isWrite: boolean = false) {
const pool = isWrite ? primaryPool : replicaPool;
return pool.query(sql, params);
}
// Usage
const users = await query('SELECT * FROM users WHERE active = true', [], false);
const result = await query('INSERT INTO users (name) VALUES ($1)', ['Alice'], true);
Database Sharding
Split data across multiple databases based on a shard key. Each shard holds a subset of the data.
Shard by user_id:
User IDs 1-1000 βββΆ Shard 1 (DB server A)
User IDs 1001-2000 βββΆ Shard 2 (DB server B)
User IDs 2001-3000 βββΆ Shard 3 (DB server C)
OR hash-based:
user_id % 3 = 0 βββΆ Shard 1
user_id % 3 = 1 βββΆ Shard 2
user_id % 3 = 2 βββΆ Shard 3
// Simple hash-based sharding
class ShardRouter {
private shards: Pool[];
constructor(shardConfigs: DatabaseConfig[]) {
this.shards = shardConfigs.map(config => new Pool(config));
}
getShardForUser(userId: number): Pool {
const shardIndex = userId % this.shards.length;
return this.shards[shardIndex];
}
async getUserOrders(userId: number) {
const shard = this.getShardForUser(userId);
return shard.query(
'SELECT * FROM orders WHERE user_id = $1',
[userId]
);
}
// Cross-shard queries are expensive β avoid them
async getAllOrdersAbove(amount: number) {
// Must query ALL shards and merge results
const results = await Promise.all(
this.shards.map(shard =>
shard.query('SELECT * FROM orders WHERE total > $1', [amount])
)
);
return results.flatMap(r => r.rows);
}
}
Sharding trade-offs:
| Benefit | Cost |
|---|---|
| Linear horizontal scaling | Cross-shard queries are slow |
| Each shard is smaller and faster | Rebalancing shards is complex |
| Fault isolation per shard | JOINs across shards not possible |
| Independent backups | Schema changes must apply to all shards |
Auto-Scaling
Auto-scaling automatically adjusts the number of running instances based on demand.
Traffic pattern:
β² Instances
10 β βββββββ
8 β ββββββ ββββββ
6 β ββββββ ββββββ
4 ββββββ ββββββ
2 β βββββ
ββββββββββββββββββββββββββββββββββββββββββββΆ Time
6am 9am 12pm 3pm 6pm 9pm
Auto-scaling adds instances at 9am, removes them at 6pm
AWS Auto Scaling Configuration
# AWS CloudFormation auto-scaling policy
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2
MaxSize: 20
DesiredCapacity: 4
HealthCheckType: ELB
HealthCheckGracePeriod: 300
LaunchTemplate:
LaunchTemplateId: !Ref AppLaunchTemplate
Version: !GetAtt AppLaunchTemplate.LatestVersionNumber
TargetGroupARNs:
- !Ref AppTargetGroup
# Scale based on CPU utilization
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref AutoScalingGroup
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70 # Add instances when CPU > 70%
Kubernetes Horizontal Pod Autoscaler
# kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# Scale on CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale on memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Scale on custom metric (requests per second)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # Scale up by max 50% at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25 # Scale down by max 25% at a time
periodSeconds: 120
Auto-Scaling Best Practices
Choose the right metric. CPU is a good default, but application-specific metrics (request queue depth, response latency) are often better signals.
Set appropriate cooldown periods. Without cooldowns, the autoscaler thrashes β scaling up, then immediately scaling down, then up again.
Test scaling boundaries. Run load tests that trigger your scaling policies. Verify that new instances receive traffic and old instances drain gracefully.
Use predictive scaling when possible. If your traffic is predictable (daily peaks, weekly patterns), pre-scale before the spike instead of reacting to it.
Reactive scaling: Traffic spikes βββΆ Detect βββΆ Scale βββΆ Ready (2-5 min lag)
Predictive scaling: Predict spike βββΆ Scale ahead βββΆ Traffic arrives (ready)
Complete Architecture Example
ββββββββββββββββ
β CDN β
β (CloudFront) β
ββββββββ¬ββββββββ
β
ββββββββΌββββββββ
β WAF β
β (Firewall) β
ββββββββ¬ββββββββ
β
βββββββββββββΌββββββββββββ
β L7 Load Balancer β
β (ALB / NGINX) β
βββββ¬βββββ¬βββββ¬ββββββββββ
β β β
βββββββββββ β βββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β App Server ββ App Server ββ App Server β
β (Auto-scale)ββ (Auto-scale)ββ (Auto-scale)β
ββββββββ¬ββββββββββββββββ¬ββββββββββββββββ¬ββββββββ
β β β
βββββββββ¬ββββββββββββββββββββββββ
β
ββββββββββΌββββββββββ
β Redis Cluster β
β (Sessions + β
β Caching) β
ββββββββββ¬ββββββββββ
β
βββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Primary β β Replica β β Replica β
β DB β β DB 1 β β DB 2 β
β (writes) β β (reads) β β (reads) β
ββββββββββββ ββββββββββββ ββββββββββββ
Key Takeaways
- Start with vertical scaling. It is simpler, requires no code changes, and works until it does not.
- Use horizontal scaling for stateless application servers. Externalize state to a database or cache.
- L7 load balancers give you content-based routing, SSL termination, and request inspection. Use them for HTTP traffic. Use L4 for raw TCP (databases, message brokers).
- Least connections is better than round robin when request processing time varies.
- Externalize sessions to Redis instead of using sticky sessions. This decouples your scaling decisions from your session management.
- Scale databases with read replicas first. Sharding is a last resort β it introduces significant complexity.
- Auto-scaling needs proper cooldown periods and the right metrics. CPU is a starting point, not the final answer.
- Load testing is not optional. Test your scaling policies before production traffic tests them for you.