Building a Modern Observability Stack

A grounded walkthrough for building a useful observability stack with logs, metrics, traces, and sane dashboards.

Baikal Signal
From blind spots to useful visibility across services and incidents.

Observability is not monitoring. It's the ability to understand your system's internal state from external outputs. This guide shows how to build a complete stack using open-source tools.

The Three Pillars

Modern observability rests on three pillars:

  • Metrics: Time-series data (CPU, requests/sec, latency percentiles)
  • Logs: Discrete events with context
  • Traces: Request flow through distributed systems

Metrics with Prometheus

Prometheus is the de facto standard for metrics collection:

const promClient = require('prom-client');
                                                                        const register = new promClient.Registry();
                                                                        
                                                                        const httpRequestDuration = new promClient.Histogram({
                                                                          name: 'http_request_duration_seconds',
                                                                          help: 'Duration of HTTP requests in seconds',
                                                                          labelNames: ['method', 'route', 'status'],
                                                                          buckets: [0.1, 0.3, 0.5, 1, 1.5, 2, 3, 5]
                                                                        });
                                                                        
                                                                        register.registerMetric(httpRequestDuration);

Key Metrics to Track

  • RED: Rate, Errors, Duration
  • USE: Utilization, Saturation, Errors
  • Custom business metrics

Centralized Logging

Structured logging enables powerful querying:

logger.info({
                                                                          event: 'user_login',
                                                                          user_id: user.id,
                                                                          ip: req.ip,
                                                                          duration_ms: 145,
                                                                          success: true
                                                                        });

Ship logs to a central system using the ELK stack or Grafana Loki:

  • Elasticsearch: Powerful but resource-heavy
  • Loki: Lightweight, integrates with Prometheus

Distributed Tracing

Understand request flow across microservices using OpenTelemetry:

const { trace } = require('@opentelemetry/api');
                                                                        
                                                                        const tracer = trace.getTracer('myapp');
                                                                        
                                                                        app.get('/api/users', async (req, res) => {
                                                                          const span = tracer.startSpan('fetch_users');
                                                                          
                                                                          try {
                                                                            const users = await db.query('SELECT * FROM users');
                                                                            span.setStatus({ code: SpanStatusCode.OK });
                                                                            res.json(users);
                                                                          } catch (error) {
                                                                            span.setStatus({ code: SpanStatusCode.ERROR });
                                                                            throw error;
                                                                          } finally {
                                                                            span.end();
                                                                          }
                                                                        });

Tying It Together

The real power comes from correlation:

Trace ID in Logs

Include trace ID in every log entry:

logger.info({
                                                                          trace_id: span.context().traceId,
                                                                          message: 'Processing request'
                                                                        });

Metrics to Traces

When a metric spikes, drill down to individual traces to find the root cause.

Dashboard Example

A useful dashboard shows:

  • Request rate and latency (metrics)
  • Error rate with links to error logs
  • Slowest traces in the last hour
  • Resource utilization trends

Summary

Start with metrics for overview, use logs for context, and add tracing for distributed systems. Instrument early and correlate signals. The investment in observability pays off when production issues arise.