Monitoring, Logging & Observability

Monitoring vs Observability

Monitoring tells you when something is wrong. Observability tells you why.

Monitoring is about predefined checks — is CPU above 90%? Is the error rate above 1%? Is the health check passing? These are questions you know to ask in advance.

Observability is the ability to understand your system's internal state by examining its outputs — logs, metrics, and traces. It lets you debug problems you did not anticipate, answer questions you did not think to ask, and understand failures you have never seen before.

Monitoring:                          Observability:
"Is the API healthy?"                "Why did request abc123 take 12 seconds?"
"Is error rate below 1%?"            "What changed at 3:47 PM that caused errors?"
"Is CPU below 80%?"                  "Which downstream service is causing timeouts?"

Dashboard ──▶ Green/Red              Traces + Logs + Metrics ──▶ Root cause

The three pillars of observability are logs, metrics, and traces. Each provides a different lens into your system. Together, they give you the complete picture.

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
│                                                     │
│  ┌───────────┐  ┌───────────┐  ┌─────────────────┐  │
│  │   Logs    │  │  Metrics  │  │  Traces          │  │
│  │           │  │           │  │                  │  │
│  │ What      │  │ How much/ │  │ How requests     │  │
│  │ happened  │  │ how fast  │  │ flow through     │  │
│  │           │  │           │  │ services         │  │
│  │ Discrete  │  │ Aggregated│  │ Per-request      │  │
│  │ events    │  │ numbers   │  │ journey          │  │
│  └───────────┘  └───────────┘  └─────────────────┘  │
│                                                     │
└─────────────────────────────────────────────────────┘

Structured Logging

Unstructured logs are human-readable but machine-hostile. When you have 10 servers generating 50,000 log lines per minute, searching with grep stops working.

Unstructured (bad):
[2026-04-30 14:23:45] ERROR - Failed to process order 12345 for user 789

Structured (good):
{
  "timestamp": "2026-04-30T14:23:45.123Z",
  "level": "error",
  "message": "Failed to process order",
  "service": "order-service",
  "orderId": "12345",
  "userId": "789",
  "error": "PaymentDeclined",
  "duration_ms": 234,
  "request_id": "req-abc-123",
  "trace_id": "trace-xyz-456"
}

Implementation

import pino from 'pino';

// Create structured logger
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,
  base: {
    service: process.env.SERVICE_NAME || 'api-server',
    environment: process.env.NODE_ENV || 'development',
    version: process.env.APP_VERSION || 'unknown',
  },
});

// Usage — always log with context
logger.info({ orderId: '12345', userId: '789' }, 'Order created successfully');

logger.error({
  orderId: '12345',
  userId: '789',
  error: err.message,
  stack: err.stack,
  duration_ms: Date.now() - startTime,
}, 'Failed to process order');

// Child logger with persistent context
function createRequestLogger(req: Request) {
  return logger.child({
    requestId: req.headers['x-request-id'] || generateId(),
    method: req.method,
    path: req.path,
    ip: req.ip,
    userAgent: req.headers['user-agent'],
  });
}

// Middleware to attach logger to request
app.use((req, res, next) => {
  req.log = createRequestLogger(req);
  const start = Date.now();

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      duration_ms: Date.now() - start,
      contentLength: res.getHeader('content-length'),
    }, 'Request completed');
  });

  next();
});

// In route handlers
router.post('/orders', async (req, res) => {
  req.log.info({ items: req.body.items.length }, 'Creating order');

  try {
    const order = await createOrder(req.body);
    req.log.info({ orderId: order.id }, 'Order created');
    res.status(201).json({ data: order });
  } catch (error) {
    req.log.error({ error: error.message }, 'Order creation failed');
    res.status(500).json({ error: { message: 'Internal error' } });
  }
});

Log Levels

FATAL   — Application is about to crash. Wake someone up immediately.
ERROR   — Operation failed. Needs attention but application continues.
WARN    — Something unexpected happened but was handled. Watch for patterns.
INFO    — Normal operations. Request completed, job finished, config loaded.
DEBUG   — Detailed diagnostic information. Not in production by default.
TRACE   — Very detailed. Function entry/exit, variable values.

Production: INFO and above
Staging: DEBUG and above
Debugging a production issue: Temporarily enable DEBUG for specific service

Log Aggregation

Individual server logs are useless at scale. Aggregate them into a centralized system.

┌──────────┐     ┌──────────────┐     ┌───────────────┐
│ Server 1 │────▶│              │     │               │
├──────────┤     │  Log Shipper │────▶│  Centralized  │
│ Server 2 │────▶│  (Filebeat/  │     │  Log Store    │
├──────────┤     │   Fluentd)   │     │  (Elastic/    │
│ Server 3 │────▶│              │     │   Loki)       │
└──────────┘     └──────────────┘     └───────┬───────┘
                                              │
                                      ┌───────▼───────┐
                                      │   Dashboard   │
                                      │   (Kibana/    │
                                      │    Grafana)   │
                                      └───────────────┘

# Filebeat configuration — ships logs to Elasticsearch
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: ""
      - add_kubernetes_metadata: ~

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "app-logs-%{+yyyy.MM.dd}"

Metrics

Metrics are numerical measurements collected over time. They answer quantitative questions: How many? How fast? How much?

Metric Types

Counter — monotonically increasing value
  http_requests_total: 1, 2, 3, 4, 5, 6...
  errors_total: 0, 0, 1, 1, 1, 2, 3...

Gauge — value that goes up and down
  cpu_usage_percent: 45, 72, 38, 91, 55...
  active_connections: 120, 135, 110, 142...
  queue_depth: 0, 5, 12, 3, 0...

Histogram — distribution of values in buckets
  request_duration_seconds:
    bucket{le="0.01"}: 100   (100 requests under 10ms)
    bucket{le="0.05"}: 450   (450 requests under 50ms)
    bucket{le="0.1"}:  890   (890 requests under 100ms)
    bucket{le="0.5"}:  980   (980 requests under 500ms)
    bucket{le="1.0"}:  995   (995 requests under 1s)
    bucket{le="+Inf"}: 1000  (1000 total requests)

Summary — similar to histogram but with precomputed quantiles
  request_duration_seconds:
    quantile{p="0.5"}:  0.042   (p50: 42ms)
    quantile{p="0.9"}:  0.128   (p90: 128ms)
    quantile{p="0.99"}: 0.954   (p99: 954ms)

Prometheus Metrics in Node.js

import client from 'prom-client';

// Default metrics (CPU, memory, event loop, etc.)
client.collectDefaultMetrics({ prefix: 'app_' });

// Custom metrics
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status_code'],
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

const activeConnections = new client.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['query_type', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

// Middleware to track HTTP metrics
app.use((req, res, next) => {
  activeConnections.inc();
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status_code: res.statusCode.toString(),
    };

    httpRequestsTotal.inc(labels);
    end(labels);
    activeConnections.dec();
  });

  next();
});

// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

// Track database queries
async function queryWithMetrics(queryType: string, table: string, fn: () => Promise<unknown>) {
  const end = dbQueryDuration.startTimer({ query_type: queryType, table });
  try {
    return await fn();
  } finally {
    end();
  }
}

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-servers'
    static_configs:
      - targets:
          - 'api-server-1:3000'
          - 'api-server-2:3000'
          - 'api-server-3:3000'

  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node-exporter:9100'

rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Key Metrics to Track

RED Method (for request-driven services):
  Rate     — Requests per second
  Errors   — Error rate (percentage of failed requests)
  Duration — Latency distribution (p50, p90, p99)

USE Method (for resources: CPU, memory, disk, network):
  Utilization — Percentage of resource in use
  Saturation  — Amount of work queued/waiting
  Errors      — Error events related to the resource

Business metrics (application-specific):
  Orders per minute
  Cart abandonment rate
  Payment success rate
  User signup rate
  Active users (DAU, WAU, MAU)

Distributed Tracing

In a microservices architecture, a single user request might touch 5-10 services. Distributed tracing follows that request across all services, showing you exactly where time is spent.

Without tracing:
  "The API is slow" — Which service? Which call? No idea.

With tracing:
  Request abc-123:
  ├── API Gateway          2ms
  ├── Auth Service        15ms
  ├── Order Service       45ms
  │   ├── DB Query        12ms
  │   ├── Inventory Check 28ms  ◀── Bottleneck found!
  │   └── Cache Lookup     3ms
  └── Notification Svc     8ms
  Total: 70ms

OpenTelemetry Implementation

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: '1.2.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_ENDPOINT || 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instrument HTTP, Express, database clients, etc.
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

// Manual span creation for custom operations
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      // Nested span for inventory check
      const inventory = await tracer.startActiveSpan('checkInventory', async (childSpan) => {
        childSpan.setAttribute('order.id', orderId);
        const result = await inventoryService.check(orderId);
        childSpan.setAttribute('inventory.available', result.available);
        childSpan.end();
        return result;
      });

      // Nested span for payment
      const payment = await tracer.startActiveSpan('processPayment', async (childSpan) => {
        childSpan.setAttribute('order.id', orderId);
        childSpan.setAttribute('payment.amount', inventory.total);
        const result = await paymentService.charge(orderId, inventory.total);
        childSpan.setAttribute('payment.status', result.status);
        childSpan.end();
        return result;
      });

      span.setAttribute('order.status', 'completed');
      span.setStatus({ code: SpanStatusCode.OK });
      return { orderId, payment };
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Trace Context Propagation

For traces to work across services, each service must propagate the trace context to downstream calls.

// Context propagation happens automatically with OpenTelemetry
// instrumentation for HTTP clients (axios, fetch, etc.)

// If manual propagation is needed:
import { propagation, context } from '@opentelemetry/api';

// Inject context into outgoing request headers
function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
  return headers;
}

// Extract context from incoming request headers
function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

Alerting and On-Call

Metrics and logs are useless if nobody looks at them. Alerting bridges the gap between data collection and human action.

Alert Rules (Prometheus Alertmanager)

# alerts.yml
groups:
  - name: api-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value | humanizePercentage }}"
          description: "More than 5% of requests are failing for the last 5 minutes"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2 seconds"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes
          > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

      # Database connection pool exhaustion
      - alert: DBConnectionPoolExhausted
        expr: |
          app_db_pool_active_connections
          / app_db_pool_max_connections
          > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool is 90%+ utilized"

Alerting Best Practices

Good alerts:
- Actionable — someone can do something about it
- Urgent — it needs attention now, not tomorrow
- Based on symptoms — "users are seeing errors" not "CPU is high"
- Have runbooks — link to documentation on how to fix

Bad alerts:
- Non-actionable — "disk is 70% full" at 3 AM (no immediate action needed)
- Flapping — triggers and resolves repeatedly
- Too many — alert fatigue makes engineers ignore everything
- Based on causes — CPU can be high without affecting users

SLO / SLI / SLA

SLI (Service Level Indicator):
  A quantitative measure of a specific aspect of service quality.
  Example: "99.2% of requests completed in under 200ms last month"

SLO (Service Level Objective):
  A target value for an SLI.
  Example: "99.5% of requests must complete in under 200ms"

SLA (Service Level Agreement):
  A contract between provider and customer with consequences.
  Example: "99.9% uptime. If violated, customer gets 10% credit."

Relationship:
  SLI (measurement) ──▶ SLO (target) ──▶ SLA (contract)

// SLI tracking implementation
const sliRequestsTotal = new client.Counter({
  name: 'sli_requests_total',
  help: 'Total requests for SLI tracking',
  labelNames: ['sli_met'],
});

app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;
    const isSuccess = res.statusCode < 500;
    const isFast = duration < 200;

    // SLI: successful requests under 200ms
    const sliMet = isSuccess && isFast;
    sliRequestsTotal.inc({ sli_met: sliMet.toString() });
  });

  next();
});

// SLO calculation in Prometheus:
// rate(sli_requests_total{sli_met="true"}[30d])
// /
// rate(sli_requests_total[30d])
// Target: > 0.995 (99.5%)

Error Budget

SLO: 99.5% availability per month

Error budget = 100% - 99.5% = 0.5% of requests can fail

In a month with 10,000,000 requests:
  Error budget = 50,000 failed requests allowed

If you have used 40,000 of your error budget:
  Remaining budget = 10,000 requests
  Slow down deployments, focus on reliability

If you have used 5,000 of your error budget:
  Remaining budget = 45,000 requests
  Ship features confidently

Observability Tool Stack

┌─────────────────────────────────────────────────────────┐
│                   Grafana (Dashboards)                   │
│   Visualizes metrics, logs, and traces in one place      │
└─────────┬──────────────┬──────────────────┬─────────────┘
          │              │                  │
  ┌───────▼──────┐ ┌─────▼──────┐  ┌───────▼──────┐
  │  Prometheus  │ │    Loki    │  │   Jaeger /   │
  │  (metrics)   │ │   (logs)   │  │   Tempo      │
  │              │ │            │  │  (traces)    │
  └──────────────┘ └────────────┘  └──────────────┘

Alternative stacks:
  ELK:    Elasticsearch + Logstash + Kibana
  Datadog: Metrics + Logs + Traces + APM (SaaS)
  New Relic: Full-stack observability (SaaS)

Docker Compose for Local Observability Stack

# docker-compose.observability.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "4318:4318"    # OTLP HTTP

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  grafana-data:

Key Takeaways

Use structured JSON logging from day one. Retrofitting structured logging into an existing codebase is painful.
Track the RED metrics (Rate, Errors, Duration) for every service. This covers 80% of production debugging needs.
Distributed tracing is non-negotiable in a microservices architecture. Without it, debugging cross-service latency issues is guesswork.
Alert on symptoms, not causes. Users do not care about CPU — they care about whether the page loads.
Define SLOs before you have an outage. Error budgets give you a rational framework for balancing reliability and feature velocity.
Start with a simple stack (Prometheus + Grafana + structured logs). Add distributed tracing when you move to microservices.
Every alert must be actionable and have a linked runbook. If nobody can act on it, it should not page someone at 3 AM.
Observability is an investment that pays off during incidents. The cost of building it is always less than the cost of debugging blind.