Building a Modern Observability Stack

Baikal Signal

From blind spots to useful visibility across services and incidents.

The Three Pillars
Metrics with Prometheus
Centralized Logging
Distributed Tracing
Tying It Together

Observability is not monitoring. It's the ability to understand your system's internal state from external outputs. This guide shows how to build a complete stack using open-source tools.

The Three Pillars

Modern observability rests on three pillars:

Metrics: Time-series data (CPU, requests/sec, latency percentiles)
Logs: Discrete events with context
Traces: Request flow through distributed systems

Metrics with Prometheus

Prometheus is the de facto standard for metrics collection:

const promClient = require('prom-client');
                                                                        const register = new promClient.Registry();
                                                                        
                                                                        const httpRequestDuration = new promClient.Histogram({
                                                                          name: 'http_request_duration_seconds',
                                                                          help: 'Duration of HTTP requests in seconds',
                                                                          labelNames: ['method', 'route', 'status'],
                                                                          buckets: [0.1, 0.3, 0.5, 1, 1.5, 2, 3, 5]
                                                                        });
                                                                        
                                                                        register.registerMetric(httpRequestDuration);

Key Metrics to Track

RED: Rate, Errors, Duration
USE: Utilization, Saturation, Errors
Custom business metrics

Centralized Logging

Structured logging enables powerful querying:

logger.info({
                                                                          event: 'user_login',
                                                                          user_id: user.id,
                                                                          ip: req.ip,
                                                                          duration_ms: 145,
                                                                          success: true
                                                                        });

Ship logs to a central system using the ELK stack or Grafana Loki:

Elasticsearch: Powerful but resource-heavy
Loki: Lightweight, integrates with Prometheus

Distributed Tracing

Understand request flow across microservices using OpenTelemetry:

const { trace } = require('@opentelemetry/api');
                                                                        
                                                                        const tracer = trace.getTracer('myapp');
                                                                        
                                                                        app.get('/api/users', async (req, res) => {
                                                                          const span = tracer.startSpan('fetch_users');
                                                                          
                                                                          try {
                                                                            const users = await db.query('SELECT * FROM users');
                                                                            span.setStatus({ code: SpanStatusCode.OK });
                                                                            res.json(users);
                                                                          } catch (error) {
                                                                            span.setStatus({ code: SpanStatusCode.ERROR });
                                                                            throw error;
                                                                          } finally {
                                                                            span.end();
                                                                          }
                                                                        });

Tying It Together

The real power comes from correlation:

Trace ID in Logs

Include trace ID in every log entry:

logger.info({
                                                                          trace_id: span.context().traceId,
                                                                          message: 'Processing request'
                                                                        });

Metrics to Traces

When a metric spikes, drill down to individual traces to find the root cause.

Dashboard Example

A useful dashboard shows:

Request rate and latency (metrics)
Error rate with links to error logs
Slowest traces in the last hour
Resource utilization trends

Summary

Start with metrics for overview, use logs for context, and add tracing for distributed systems. Instrument early and correlate signals. The investment in observability pays off when production issues arise.

Building a Modern Observability Stack

Table of Contents