Table of Contents
Observability is not monitoring. It's the ability to understand your system's internal state from external outputs. This guide shows how to build a complete stack using open-source tools.
The Three Pillars
Modern observability rests on three pillars:
- Metrics: Time-series data (CPU, requests/sec, latency percentiles)
- Logs: Discrete events with context
- Traces: Request flow through distributed systems
Metrics with Prometheus
Prometheus is the de facto standard for metrics collection:
const promClient = require('prom-client');
const register = new promClient.Registry();
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.3, 0.5, 1, 1.5, 2, 3, 5]
});
register.registerMetric(httpRequestDuration);
Key Metrics to Track
- RED: Rate, Errors, Duration
- USE: Utilization, Saturation, Errors
- Custom business metrics
Centralized Logging
Structured logging enables powerful querying:
logger.info({
event: 'user_login',
user_id: user.id,
ip: req.ip,
duration_ms: 145,
success: true
});
Ship logs to a central system using the ELK stack or Grafana Loki:
- Elasticsearch: Powerful but resource-heavy
- Loki: Lightweight, integrates with Prometheus
Distributed Tracing
Understand request flow across microservices using OpenTelemetry:
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('myapp');
app.get('/api/users', async (req, res) => {
const span = tracer.startSpan('fetch_users');
try {
const users = await db.query('SELECT * FROM users');
span.setStatus({ code: SpanStatusCode.OK });
res.json(users);
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
});
Tying It Together
The real power comes from correlation:
Trace ID in Logs
Include trace ID in every log entry:
logger.info({
trace_id: span.context().traceId,
message: 'Processing request'
});
Metrics to Traces
When a metric spikes, drill down to individual traces to find the root cause.
Dashboard Example
A useful dashboard shows:
- Request rate and latency (metrics)
- Error rate with links to error logs
- Slowest traces in the last hour
- Resource utilization trends
Summary
Start with metrics for overview, use logs for context, and add tracing for distributed systems. Instrument early and correlate signals. The investment in observability pays off when production issues arise.