Load Balancing & Scalability

Scaling Fundamentals

Every successful application eventually faces the same problem: more users, more data, more traffic than a single server can handle. Scaling is how you solve that problem. Load balancing is how you distribute work across the scaled infrastructure.

This article covers the full picture — from the decision between vertical and horizontal scaling, through load balancer types and algorithms, to database scaling patterns and auto-scaling policies that keep your infrastructure right-sized.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up):
┌──────────┐         ┌──────────────────┐
│ 4 CPU    │         │ 32 CPU           │
│ 8 GB RAM │  ──▶    │ 128 GB RAM       │
│ 100 GB   │         │ 2 TB SSD         │
└──────────┘         └──────────────────┘
  Small VM              Bigger VM

Horizontal Scaling (Scale Out):
┌──────────┐         ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 4 CPU    │         │ 4 CPU    │ │ 4 CPU    │ │ 4 CPU    │
│ 8 GB RAM │  ──▶    │ 8 GB RAM │ │ 8 GB RAM │ │ 8 GB RAM │
│ 100 GB   │         │ 100 GB   │ │ 100 GB   │ │ 100 GB   │
└──────────┘         └──────────┘ └──────────┘ └──────────┘
  1 server               3 identical servers

Comparison

Aspect	Vertical Scaling	Horizontal Scaling
Approach	Bigger machine	More machines
Downtime	Usually requires restart	Zero downtime (add servers)
Cost curve	Exponential (premium hardware)	Linear (commodity hardware)
Ceiling	Hardware limits (largest available machine)	Practically unlimited
Complexity	Simple (no code changes)	Complex (distributed system concerns)
Failure impact	Single point of failure	One node fails, others continue
Data consistency	Trivial (single machine)	Challenging (distributed state)
Best for	Databases, quick fixes	Stateless application servers

When to Use Each

Start vertical when:

You have a single database that is not yet at capacity
Your application is not designed for horizontal scaling
You need a quick fix while redesigning for horizontal scale
The cost of vertical scaling is still reasonable

Go horizontal when:

You hit the ceiling of the largest available machine
You need high availability (no single point of failure)
Your traffic patterns require elastic scaling
Your application is stateless or can be made stateless

Load Balancers

A load balancer sits between clients and backend servers, distributing incoming requests across multiple servers.

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │                 │
    Clients ───────▶│  Health checks  │
                    │  Routing rules  │
                    │  SSL termination│
                    └────┬──┬──┬──────┘
                         │  │  │
              ┌──────────┘  │  └──────────┐
              ▼             ▼             ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Server 1 │ │ Server 2 │ │ Server 3 │
        │ (healthy)│ │ (healthy)│ │ (healthy)│
        └──────────┘ └──────────┘ └──────────┘

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the OSI model, and this distinction matters significantly.

OSI Model (relevant layers):
┌─────────────────────────────────────┐
│  Layer 7: Application (HTTP, gRPC)  │  ◀── L7 load balancer operates here
├─────────────────────────────────────┤
│  Layer 6: Presentation (SSL/TLS)    │
├─────────────────────────────────────┤
│  Layer 5: Session                   │
├─────────────────────────────────────┤
│  Layer 4: Transport (TCP, UDP)      │  ◀── L4 load balancer operates here
├─────────────────────────────────────┤
│  Layer 3: Network (IP)              │
├─────────────────────────────────────┤
│  Layer 2: Data Link                 │
├─────────────────────────────────────┤
│  Layer 1: Physical                  │
└─────────────────────────────────────┘

Layer 4 (Transport) load balancers route based on IP address and TCP/UDP port. They do not inspect the content of the request. Fast, simple, and efficient.

L4 Load Balancer:
Client ──▶ [src: 1.2.3.4:54321, dst: LB:443]
              │
              │ NAT/DNAT — changes destination
              ▼
         [src: 1.2.3.4:54321, dst: Server2:8080]
              │
              ▼
         Server 2 processes request

# HAProxy L4 configuration
frontend tcp_front
    bind *:3306
    mode tcp
    default_backend mysql_servers

backend mysql_servers
    mode tcp
    balance roundrobin
    server db1 10.0.1.1:3306 check
    server db2 10.0.1.2:3306 check
    server db3 10.0.1.3:3306 check

Layer 7 (Application) load balancers inspect the HTTP request (URL path, headers, cookies, body) and make routing decisions based on content. More flexible, slightly slower.

L7 Load Balancer:
Client ──▶ GET /api/users HTTP/1.1
           Host: app.example.com
           Cookie: session=abc123
              │
              │ Inspects URL path, headers, cookies
              ▼
         /api/*      ──▶ API servers
         /static/*   ──▶ CDN / static servers
         /ws/*       ──▶ WebSocket servers

# NGINX L7 configuration
upstream api_servers {
    server 10.0.1.1:3000;
    server 10.0.1.2:3000;
    server 10.0.1.3:3000;
}

upstream static_servers {
    server 10.0.2.1:80;
    server 10.0.2.2:80;
}

upstream websocket_servers {
    server 10.0.3.1:8080;
    server 10.0.3.2:8080;
}

server {
    listen 443 ssl;
    server_name app.example.com;

    # Route by URL path
    location /api/ {
        proxy_pass http://api_servers;
    }

    location /static/ {
        proxy_pass http://static_servers;
        proxy_cache_valid 200 1d;
    }

    location /ws/ {
        proxy_pass http://websocket_servers;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

L4 vs L7 Comparison

Aspect	L4 (Transport)	L7 (Application)
Routing criteria	IP, port	URL, headers, cookies, body
Performance	Faster (no content inspection)	Slightly slower
SSL termination	Pass-through possible	Yes (decrypts, then re-encrypts or plain)
Content-based routing	No	Yes
Protocol support	Any TCP/UDP protocol	HTTP, HTTPS, gRPC, WebSocket
Use case	Database, generic TCP	Web apps, API gateway, microservices
Examples	AWS NLB, HAProxy (TCP mode)	AWS ALB, NGINX, Envoy

Load Balancing Algorithms

Round Robin

Requests are distributed sequentially across servers. Server 1, then 2, then 3, then back to 1.

Request 1 ──▶ Server A
Request 2 ──▶ Server B
Request 3 ──▶ Server C
Request 4 ──▶ Server A  (cycles back)
Request 5 ──▶ Server B

upstream backend {
    server 10.0.1.1:3000;  # Gets ~33% of traffic
    server 10.0.1.2:3000;  # Gets ~33% of traffic
    server 10.0.1.3:3000;  # Gets ~33% of traffic
}

Pros: Simple, even distribution. Cons: Does not account for server capacity or current load. A slow server gets the same traffic as a fast one.

Weighted Round Robin

Assign weights based on server capacity. A server with weight 3 gets 3x the requests of a server with weight 1.

Weights: A=3, B=2, C=1

Request 1 ──▶ Server A
Request 2 ──▶ Server A
Request 3 ──▶ Server A
Request 4 ──▶ Server B
Request 5 ──▶ Server B
Request 6 ──▶ Server C
Request 7 ──▶ Server A  (cycles back)

upstream backend {
    server 10.0.1.1:3000 weight=3;  # 50% of traffic (powerful server)
    server 10.0.1.2:3000 weight=2;  # 33% of traffic
    server 10.0.1.3:3000 weight=1;  # 17% of traffic (smaller instance)
}

Least Connections

Route to the server with the fewest active connections. Better for requests with varying processing times.

Active connections: A=5, B=2, C=8

New request ──▶ Server B  (fewest connections)

upstream backend {
    least_conn;
    server 10.0.1.1:3000;
    server 10.0.1.2:3000;
    server 10.0.1.3:3000;
}

Best for: Applications where request processing time varies significantly (some requests take 50ms, others take 5 seconds).

IP Hash

Route requests from the same client IP to the same server. Provides session affinity without cookies.

Client 1.2.3.4 ──▶ hash(1.2.3.4) % 3 = 1 ──▶ Server B
Client 5.6.7.8 ──▶ hash(5.6.7.8) % 3 = 0 ──▶ Server A
Client 1.2.3.4 ──▶ hash(1.2.3.4) % 3 = 1 ──▶ Server B  (same server)

upstream backend {
    ip_hash;
    server 10.0.1.1:3000;
    server 10.0.1.2:3000;
    server 10.0.1.3:3000;
}

Pros: Built-in session affinity. Cons: Uneven distribution if traffic is concentrated from a few IPs (corporate NATs, proxies).

Algorithm Selection Guide

Stateless app, similar servers?     ──▶ Round Robin
Stateless app, different capacities? ──▶ Weighted Round Robin
Variable request durations?          ──▶ Least Connections
Need session affinity?               ──▶ IP Hash (or cookie-based sticky)
WebSocket connections?               ──▶ Least Connections + sticky

Session Persistence (Sticky Sessions)

Some applications store session data in memory on the server. If a user's next request goes to a different server, their session is lost.

Problem without sticky sessions:
Request 1 ──▶ Server A  (session created in Server A memory)
Request 2 ──▶ Server B  (no session found — user logged out!)

Solution 1: Sticky sessions
Request 1 ──▶ Server A  (session created, cookie set)
Request 2 ──▶ Server A  (LB routes based on cookie)
Request 3 ──▶ Server A  (always Server A for this session)

Solution 2: Externalized sessions (better)
Request 1 ──▶ Server A ──▶ Redis (session stored)
Request 2 ──▶ Server B ──▶ Redis (session retrieved)
Request 3 ──▶ Server C ──▶ Redis (session retrieved)

Sticky sessions via NGINX:

upstream backend {
    server 10.0.1.1:3000;
    server 10.0.1.2:3000;
    server 10.0.1.3:3000;

    # Cookie-based sticky sessions
    sticky cookie srv_id expires=1h path=/;
}

Externalized sessions (recommended):

import session from 'express-session';
import RedisStore from 'connect-redis';
import Redis from 'ioredis';

const redisClient = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
});

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: {
    secure: true,
    httpOnly: true,
    maxAge: 86400000, // 24 hours
  },
}));

Externalized sessions are superior because they allow any server to handle any request, making horizontal scaling and rolling deployments trivial.

Health Checks

Load balancers need to know which servers are healthy. Unhealthy servers are removed from the pool until they recover.

┌─────────────────┐
│  Load Balancer  │
│                 │
│  Every 10s:     │
│  GET /health    │──────▶ Server 1: 200 OK ✓
│  GET /health    │──────▶ Server 2: 200 OK ✓
│  GET /health    │──────▶ Server 3: 503    ✗ (removed from pool)
│                 │
│  Traffic only   │
│  goes to 1 & 2  │
└─────────────────┘

// Health check endpoint
app.get('/health', async (req, res) => {
  const checks = {
    database: false,
    redis: false,
    diskSpace: false,
  };

  try {
    // Check database connectivity
    await db.raw('SELECT 1');
    checks.database = true;
  } catch (e) {
    // database unreachable
  }

  try {
    // Check Redis connectivity
    await redis.ping();
    checks.redis = true;
  } catch (e) {
    // redis unreachable
  }

  // Check disk space
  const freeSpace = await checkDiskSpace('/');
  checks.diskSpace = freeSpace.free > 1_000_000_000; // > 1GB free

  const healthy = Object.values(checks).every(Boolean);
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'unhealthy',
    checks,
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  });
});

# NGINX health check configuration
upstream backend {
    server 10.0.1.1:3000 max_fails=3 fail_timeout=30s;
    server 10.0.1.2:3000 max_fails=3 fail_timeout=30s;
    server 10.0.1.3:3000 max_fails=3 fail_timeout=30s;
}

# Active health checks (NGINX Plus / OpenResty)
# Passively, NGINX checks on real traffic failures

Database Scaling

Databases are the hardest component to scale because they are stateful. Here are the primary strategies.

Read Replicas

Route read queries to replicas, write queries to the primary. Works well when reads outnumber writes (which is typical — most applications are 90%+ reads).

                ┌──────────────┐
   Writes ─────▶│   Primary    │
                │   (master)   │
                └──────┬───────┘
                       │ Replication
              ┌────────┼────────┐
              ▼        ▼        ▼
        ┌──────────┐┌──────────┐┌──────────┐
 Reads  │ Replica 1││ Replica 2││ Replica 3│
 ──────▶│ (slave)  ││ (slave)  ││ (slave)  │
        └──────────┘└──────────┘└──────────┘

// Application-level read/write splitting
import { Pool } from 'pg';

const primaryPool = new Pool({
  host: process.env.DB_PRIMARY_HOST,
  database: 'myapp',
});

const replicaPool = new Pool({
  host: process.env.DB_REPLICA_HOST,
  database: 'myapp',
});

async function query(sql: string, params: unknown[], isWrite: boolean = false) {
  const pool = isWrite ? primaryPool : replicaPool;
  return pool.query(sql, params);
}

// Usage
const users = await query('SELECT * FROM users WHERE active = true', [], false);
const result = await query('INSERT INTO users (name) VALUES ($1)', ['Alice'], true);

Database Sharding

Split data across multiple databases based on a shard key. Each shard holds a subset of the data.

Shard by user_id:

User IDs 1-1000      ──▶ Shard 1 (DB server A)
User IDs 1001-2000   ──▶ Shard 2 (DB server B)
User IDs 2001-3000   ──▶ Shard 3 (DB server C)

OR hash-based:
user_id % 3 = 0      ──▶ Shard 1
user_id % 3 = 1      ──▶ Shard 2
user_id % 3 = 2      ──▶ Shard 3

// Simple hash-based sharding
class ShardRouter {
  private shards: Pool[];

  constructor(shardConfigs: DatabaseConfig[]) {
    this.shards = shardConfigs.map(config => new Pool(config));
  }

  getShardForUser(userId: number): Pool {
    const shardIndex = userId % this.shards.length;
    return this.shards[shardIndex];
  }

  async getUserOrders(userId: number) {
    const shard = this.getShardForUser(userId);
    return shard.query(
      'SELECT * FROM orders WHERE user_id = $1',
      [userId]
    );
  }

  // Cross-shard queries are expensive — avoid them
  async getAllOrdersAbove(amount: number) {
    // Must query ALL shards and merge results
    const results = await Promise.all(
      this.shards.map(shard =>
        shard.query('SELECT * FROM orders WHERE total > $1', [amount])
      )
    );
    return results.flatMap(r => r.rows);
  }
}

Sharding trade-offs:

Benefit	Cost
Linear horizontal scaling	Cross-shard queries are slow
Each shard is smaller and faster	Rebalancing shards is complex
Fault isolation per shard	JOINs across shards not possible
Independent backups	Schema changes must apply to all shards

Auto-Scaling

Auto-scaling automatically adjusts the number of running instances based on demand.

Traffic pattern:

     ▲ Instances
  10 │              ┌─────┐
   8 │         ┌────┘     └────┐
   6 │    ┌────┘               └────┐
   4 │────┘                         └────┐
   2 │                                    └────
     └──────────────────────────────────────────▶ Time
       6am     9am    12pm    3pm    6pm    9pm

Auto-scaling adds instances at 9am, removes them at 6pm

AWS Auto Scaling Configuration

# AWS CloudFormation auto-scaling policy
AutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    MinSize: 2
    MaxSize: 20
    DesiredCapacity: 4
    HealthCheckType: ELB
    HealthCheckGracePeriod: 300
    LaunchTemplate:
      LaunchTemplateId: !Ref AppLaunchTemplate
      Version: !GetAtt AppLaunchTemplate.LatestVersionNumber
    TargetGroupARNs:
      - !Ref AppTargetGroup

# Scale based on CPU utilization
ScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref AutoScalingGroup
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 70  # Add instances when CPU > 70%

Kubernetes Horizontal Pod Autoscaler

# kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    # Scale on CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    # Scale on memory
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

    # Scale on custom metric (requests per second)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 1000

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50         # Scale up by max 50% at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25         # Scale down by max 25% at a time
          periodSeconds: 120

Auto-Scaling Best Practices

Choose the right metric. CPU is a good default, but application-specific metrics (request queue depth, response latency) are often better signals.

Set appropriate cooldown periods. Without cooldowns, the autoscaler thrashes — scaling up, then immediately scaling down, then up again.

Test scaling boundaries. Run load tests that trigger your scaling policies. Verify that new instances receive traffic and old instances drain gracefully.

Use predictive scaling when possible. If your traffic is predictable (daily peaks, weekly patterns), pre-scale before the spike instead of reacting to it.

Reactive scaling:       Traffic spikes ──▶ Detect ──▶ Scale ──▶ Ready (2-5 min lag)
Predictive scaling:     Predict spike ──▶ Scale ahead ──▶ Traffic arrives (ready)

Complete Architecture Example

                         ┌──────────────┐
                         │   CDN        │
                         │ (CloudFront) │
                         └──────┬───────┘
                                │
                         ┌──────▼───────┐
                         │   WAF        │
                         │ (Firewall)   │
                         └──────┬───────┘
                                │
                    ┌───────────▼───────────┐
                    │    L7 Load Balancer   │
                    │    (ALB / NGINX)      │
                    └───┬────┬────┬─────────┘
                        │    │    │
              ┌─────────┘    │    └─────────┐
              ▼              ▼              ▼
     ┌──────────────┐┌──────────────┐┌──────────────┐
     │  App Server  ││  App Server  ││  App Server  │
     │  (Auto-scale)││  (Auto-scale)││  (Auto-scale)│
     └──────┬───────┘└──────┬───────┘└──────┬───────┘
            │               │               │
            └───────┬───────┘───────────────┘
                    │
           ┌────────▼─────────┐
           │   Redis Cluster  │
           │   (Sessions +    │
           │    Caching)      │
           └────────┬─────────┘
                    │
      ┌─────────────┼──────────────┐
      ▼             ▼              ▼
┌──────────┐ ┌──────────┐  ┌──────────┐
│  Primary │ │ Replica  │  │ Replica  │
│    DB    │ │   DB 1   │  │   DB 2   │
│ (writes) │ │ (reads)  │  │ (reads)  │
└──────────┘ └──────────┘  └──────────┘

Key Takeaways

Start with vertical scaling. It is simpler, requires no code changes, and works until it does not.
Use horizontal scaling for stateless application servers. Externalize state to a database or cache.
L7 load balancers give you content-based routing, SSL termination, and request inspection. Use them for HTTP traffic. Use L4 for raw TCP (databases, message brokers).
Least connections is better than round robin when request processing time varies.
Externalize sessions to Redis instead of using sticky sessions. This decouples your scaling decisions from your session management.
Scale databases with read replicas first. Sharding is a last resort — it introduces significant complexity.
Auto-scaling needs proper cooldown periods and the right metrics. CPU is a starting point, not the final answer.
Load testing is not optional. Test your scaling policies before production traffic tests them for you.