Monitoring vs Observability
Monitoring tells you when something is wrong. Observability tells you why.
Monitoring is about predefined checks β is CPU above 90%? Is the error rate above 1%? Is the health check passing? These are questions you know to ask in advance.
Observability is the ability to understand your system's internal state by examining its outputs β logs, metrics, and traces. It lets you debug problems you did not anticipate, answer questions you did not think to ask, and understand failures you have never seen before.
Monitoring: Observability:
"Is the API healthy?" "Why did request abc123 take 12 seconds?"
"Is error rate below 1%?" "What changed at 3:47 PM that caused errors?"
"Is CPU below 80%?" "Which downstream service is causing timeouts?"
Dashboard βββΆ Green/Red Traces + Logs + Metrics βββΆ Root cause
The three pillars of observability are logs, metrics, and traces. Each provides a different lens into your system. Together, they give you the complete picture.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Observability β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββββββ β
β β Logs β β Metrics β β Traces β β
β β β β β β β β
β β What β β How much/ β β How requests β β
β β happened β β how fast β β flow through β β
β β β β β β services β β
β β Discrete β β Aggregatedβ β Per-request β β
β β events β β numbers β β journey β β
β βββββββββββββ βββββββββββββ βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Structured Logging
Unstructured logs are human-readable but machine-hostile. When you have 10 servers generating 50,000 log lines per minute, searching with grep stops working.
Unstructured (bad):
[2026-04-30 14:23:45] ERROR - Failed to process order 12345 for user 789
Structured (good):
{
"timestamp": "2026-04-30T14:23:45.123Z",
"level": "error",
"message": "Failed to process order",
"service": "order-service",
"orderId": "12345",
"userId": "789",
"error": "PaymentDeclined",
"duration_ms": 234,
"request_id": "req-abc-123",
"trace_id": "trace-xyz-456"
}
Implementation
import pino from 'pino';
// Create structured logger
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,
base: {
service: process.env.SERVICE_NAME || 'api-server',
environment: process.env.NODE_ENV || 'development',
version: process.env.APP_VERSION || 'unknown',
},
});
// Usage β always log with context
logger.info({ orderId: '12345', userId: '789' }, 'Order created successfully');
logger.error({
orderId: '12345',
userId: '789',
error: err.message,
stack: err.stack,
duration_ms: Date.now() - startTime,
}, 'Failed to process order');
// Child logger with persistent context
function createRequestLogger(req: Request) {
return logger.child({
requestId: req.headers['x-request-id'] || generateId(),
method: req.method,
path: req.path,
ip: req.ip,
userAgent: req.headers['user-agent'],
});
}
// Middleware to attach logger to request
app.use((req, res, next) => {
req.log = createRequestLogger(req);
const start = Date.now();
res.on('finish', () => {
req.log.info({
statusCode: res.statusCode,
duration_ms: Date.now() - start,
contentLength: res.getHeader('content-length'),
}, 'Request completed');
});
next();
});
// In route handlers
router.post('/orders', async (req, res) => {
req.log.info({ items: req.body.items.length }, 'Creating order');
try {
const order = await createOrder(req.body);
req.log.info({ orderId: order.id }, 'Order created');
res.status(201).json({ data: order });
} catch (error) {
req.log.error({ error: error.message }, 'Order creation failed');
res.status(500).json({ error: { message: 'Internal error' } });
}
});
Log Levels
FATAL β Application is about to crash. Wake someone up immediately.
ERROR β Operation failed. Needs attention but application continues.
WARN β Something unexpected happened but was handled. Watch for patterns.
INFO β Normal operations. Request completed, job finished, config loaded.
DEBUG β Detailed diagnostic information. Not in production by default.
TRACE β Very detailed. Function entry/exit, variable values.
Production: INFO and above
Staging: DEBUG and above
Debugging a production issue: Temporarily enable DEBUG for specific service
Log Aggregation
Individual server logs are useless at scale. Aggregate them into a centralized system.
ββββββββββββ ββββββββββββββββ βββββββββββββββββ
β Server 1 ββββββΆβ β β β
ββββββββββββ€ β Log Shipper ββββββΆβ Centralized β
β Server 2 ββββββΆβ (Filebeat/ β β Log Store β
ββββββββββββ€ β Fluentd) β β (Elastic/ β
β Server 3 ββββββΆβ β β Loki) β
ββββββββββββ ββββββββββββββββ βββββββββ¬ββββββββ
β
βββββββββΌββββββββ
β Dashboard β
β (Kibana/ β
β Grafana) β
βββββββββββββββββ
# Filebeat configuration β ships logs to Elasticsearch
filebeat.inputs:
- type: container
paths:
- /var/lib/docker/containers/*/*.log
processors:
- decode_json_fields:
fields: ["message"]
target: ""
- add_kubernetes_metadata: ~
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "app-logs-%{+yyyy.MM.dd}"
Metrics
Metrics are numerical measurements collected over time. They answer quantitative questions: How many? How fast? How much?
Metric Types
Counter β monotonically increasing value
http_requests_total: 1, 2, 3, 4, 5, 6...
errors_total: 0, 0, 1, 1, 1, 2, 3...
Gauge β value that goes up and down
cpu_usage_percent: 45, 72, 38, 91, 55...
active_connections: 120, 135, 110, 142...
queue_depth: 0, 5, 12, 3, 0...
Histogram β distribution of values in buckets
request_duration_seconds:
bucket{le="0.01"}: 100 (100 requests under 10ms)
bucket{le="0.05"}: 450 (450 requests under 50ms)
bucket{le="0.1"}: 890 (890 requests under 100ms)
bucket{le="0.5"}: 980 (980 requests under 500ms)
bucket{le="1.0"}: 995 (995 requests under 1s)
bucket{le="+Inf"}: 1000 (1000 total requests)
Summary β similar to histogram but with precomputed quantiles
request_duration_seconds:
quantile{p="0.5"}: 0.042 (p50: 42ms)
quantile{p="0.9"}: 0.128 (p90: 128ms)
quantile{p="0.99"}: 0.954 (p99: 954ms)
Prometheus Metrics in Node.js
import client from 'prom-client';
// Default metrics (CPU, memory, event loop, etc.)
client.collectDefaultMetrics({ prefix: 'app_' });
// Custom metrics
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status_code'],
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
const dbQueryDuration = new client.Histogram({
name: 'db_query_duration_seconds',
help: 'Database query duration',
labelNames: ['query_type', 'table'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});
// Middleware to track HTTP metrics
app.use((req, res, next) => {
activeConnections.inc();
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const labels = {
method: req.method,
path: req.route?.path || req.path,
status_code: res.statusCode.toString(),
};
httpRequestsTotal.inc(labels);
end(labels);
activeConnections.dec();
});
next();
});
// Expose metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
// Track database queries
async function queryWithMetrics(queryType: string, table: string, fn: () => Promise<unknown>) {
const end = dbQueryDuration.startTimer({ query_type: queryType, table });
try {
return await fn();
} finally {
end();
}
}
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api-servers'
static_configs:
- targets:
- 'api-server-1:3000'
- 'api-server-2:3000'
- 'api-server-3:3000'
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node-exporter:9100'
rule_files:
- 'alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Key Metrics to Track
RED Method (for request-driven services):
Rate β Requests per second
Errors β Error rate (percentage of failed requests)
Duration β Latency distribution (p50, p90, p99)
USE Method (for resources: CPU, memory, disk, network):
Utilization β Percentage of resource in use
Saturation β Amount of work queued/waiting
Errors β Error events related to the resource
Business metrics (application-specific):
Orders per minute
Cart abandonment rate
Payment success rate
User signup rate
Active users (DAU, WAU, MAU)
Distributed Tracing
In a microservices architecture, a single user request might touch 5-10 services. Distributed tracing follows that request across all services, showing you exactly where time is spent.
Without tracing:
"The API is slow" β Which service? Which call? No idea.
With tracing:
Request abc-123:
βββ API Gateway 2ms
βββ Auth Service 15ms
βββ Order Service 45ms
β βββ DB Query 12ms
β βββ Inventory Check 28ms βββ Bottleneck found!
β βββ Cache Lookup 3ms
βββ Notification Svc 8ms
Total: 70ms
OpenTelemetry Implementation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';
// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
[ATTR_SERVICE_VERSION]: '1.2.0',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_ENDPOINT || 'http://jaeger:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instrument HTTP, Express, database clients, etc.
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
// Manual span creation for custom operations
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
// Nested span for inventory check
const inventory = await tracer.startActiveSpan('checkInventory', async (childSpan) => {
childSpan.setAttribute('order.id', orderId);
const result = await inventoryService.check(orderId);
childSpan.setAttribute('inventory.available', result.available);
childSpan.end();
return result;
});
// Nested span for payment
const payment = await tracer.startActiveSpan('processPayment', async (childSpan) => {
childSpan.setAttribute('order.id', orderId);
childSpan.setAttribute('payment.amount', inventory.total);
const result = await paymentService.charge(orderId, inventory.total);
childSpan.setAttribute('payment.status', result.status);
childSpan.end();
return result;
});
span.setAttribute('order.status', 'completed');
span.setStatus({ code: SpanStatusCode.OK });
return { orderId, payment };
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
Trace Context Propagation
For traces to work across services, each service must propagate the trace context to downstream calls.
// Context propagation happens automatically with OpenTelemetry
// instrumentation for HTTP clients (axios, fetch, etc.)
// If manual propagation is needed:
import { propagation, context } from '@opentelemetry/api';
// Inject context into outgoing request headers
function injectTraceContext(headers: Record<string, string>) {
propagation.inject(context.active(), headers);
return headers;
}
// Extract context from incoming request headers
function extractTraceContext(headers: Record<string, string>) {
return propagation.extract(context.active(), headers);
}
Alerting and On-Call
Metrics and logs are useless if nobody looks at them. Alerting bridges the gap between data collection and human action.
Alert Rules (Prometheus Alertmanager)
# alerts.yml
groups:
- name: api-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $value | humanizePercentage }}"
description: "More than 5% of requests are failing for the last 5 minutes"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2 seconds"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
# High memory usage
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes
> 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
# Database connection pool exhaustion
- alert: DBConnectionPoolExhausted
expr: |
app_db_pool_active_connections
/ app_db_pool_max_connections
> 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection pool is 90%+ utilized"
Alerting Best Practices
Good alerts:
- Actionable β someone can do something about it
- Urgent β it needs attention now, not tomorrow
- Based on symptoms β "users are seeing errors" not "CPU is high"
- Have runbooks β link to documentation on how to fix
Bad alerts:
- Non-actionable β "disk is 70% full" at 3 AM (no immediate action needed)
- Flapping β triggers and resolves repeatedly
- Too many β alert fatigue makes engineers ignore everything
- Based on causes β CPU can be high without affecting users
SLO / SLI / SLA
SLI (Service Level Indicator):
A quantitative measure of a specific aspect of service quality.
Example: "99.2% of requests completed in under 200ms last month"
SLO (Service Level Objective):
A target value for an SLI.
Example: "99.5% of requests must complete in under 200ms"
SLA (Service Level Agreement):
A contract between provider and customer with consequences.
Example: "99.9% uptime. If violated, customer gets 10% credit."
Relationship:
SLI (measurement) βββΆ SLO (target) βββΆ SLA (contract)
// SLI tracking implementation
const sliRequestsTotal = new client.Counter({
name: 'sli_requests_total',
help: 'Total requests for SLI tracking',
labelNames: ['sli_met'],
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
const isSuccess = res.statusCode < 500;
const isFast = duration < 200;
// SLI: successful requests under 200ms
const sliMet = isSuccess && isFast;
sliRequestsTotal.inc({ sli_met: sliMet.toString() });
});
next();
});
// SLO calculation in Prometheus:
// rate(sli_requests_total{sli_met="true"}[30d])
// /
// rate(sli_requests_total[30d])
// Target: > 0.995 (99.5%)
Error Budget
SLO: 99.5% availability per month
Error budget = 100% - 99.5% = 0.5% of requests can fail
In a month with 10,000,000 requests:
Error budget = 50,000 failed requests allowed
If you have used 40,000 of your error budget:
Remaining budget = 10,000 requests
Slow down deployments, focus on reliability
If you have used 5,000 of your error budget:
Remaining budget = 45,000 requests
Ship features confidently
Observability Tool Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Grafana (Dashboards) β
β Visualizes metrics, logs, and traces in one place β
βββββββββββ¬βββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββ
β β β
βββββββββΌβββββββ βββββββΌβββββββ βββββββββΌβββββββ
β Prometheus β β Loki β β Jaeger / β
β (metrics) β β (logs) β β Tempo β
β β β β β (traces) β
ββββββββββββββββ ββββββββββββββ ββββββββββββββββ
Alternative stacks:
ELK: Elasticsearch + Logstash + Kibana
Datadog: Metrics + Logs + Traces + APM (SaaS)
New Relic: Full-stack observability (SaaS)
Docker Compose for Local Observability Stack
# docker-compose.observability.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4318:4318" # OTLP HTTP
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
volumes:
grafana-data:
Key Takeaways
- Use structured JSON logging from day one. Retrofitting structured logging into an existing codebase is painful.
- Track the RED metrics (Rate, Errors, Duration) for every service. This covers 80% of production debugging needs.
- Distributed tracing is non-negotiable in a microservices architecture. Without it, debugging cross-service latency issues is guesswork.
- Alert on symptoms, not causes. Users do not care about CPU β they care about whether the page loads.
- Define SLOs before you have an outage. Error budgets give you a rational framework for balancing reliability and feature velocity.
- Start with a simple stack (Prometheus + Grafana + structured logs). Add distributed tracing when you move to microservices.
- Every alert must be actionable and have a linked runbook. If nobody can act on it, it should not page someone at 3 AM.
- Observability is an investment that pays off during incidents. The cost of building it is always less than the cost of debugging blind.