Observability and SRE

11 min read · Updated 2026-04-25

In distributed systems, you can’t fix what you can’t see. Observability is the discipline of building systems whose internal state can be inferred from external signals. SRE — Site Reliability Engineering, the practice that came out of Google — is the operational philosophy that turns observability into reliability.

This lesson is the working vocabulary: the three pillars, SLIs/SLOs/error budgets, on-call practices, and what good looks like.

The Three Pillars of Observability

Metrics

Aggregated numerical data. Request rate, error rate, latency percentiles, queue depth. Cheap to store, fast to query, but lossy — you can't go back and ask new questions.

Logs

Discrete events with context. "User X attempted action Y at time Z." Rich detail, expensive at scale. Best for after-the-fact investigation.

Traces

A single request's journey across services. Distributed tracing shows where time is spent. Critical for debugging in microservices.

Metrics

The Google SRE book defines four golden signals every service should expose:

Latency

How long requests take. Track percentiles (p50, p95, p99), not averages — averages hide tail latency.

Traffic

How many requests are coming in. Per second, per minute. Different units for different services.

Errors

Rate of failed requests. Subdivide by error class — 5xx (server bug) vs. 4xx (client issue) vs. circuit-breaker rejection.

Saturation

How "full" your service is. CPU, memory, queue depth, thread pool utilization. Saturation predicts errors before they happen.

Modern stack:

Prometheus — open-source metrics collection. Pull-based, time-series database, PromQL query language.
Grafana — visualization. Dashboards, alerts. Standard companion to Prometheus.
Datadog / New Relic / Honeycomb — commercial alternatives with broader feature sets.
OpenTelemetry — vendor-neutral instrumentation. The future-proof choice.

Logs

Structured logging is non-negotiable. Plain-text logs are unsearchable at scale.

Plain text

Unsearchable, unparsable

`2024-01-15 14:23:01 INFO User logged in: alice@example.com from 192.168.1.1`. Hard to filter, aggregate, or extract fields.

Structured (JSON)

Queryable, aggregatable

`{"timestamp": "...", "level": "INFO", "msg": "user.login", "user_email": "alice@...", "ip": "192.168.1.1"}`. Filter by any field, aggregate by any dimension.

What to log

Request boundaries

Every request entering and leaving a service. With request ID, user ID, tenant ID for correlation.

Errors

Stack traces, error context, what was being attempted.

Significant business events

Order placed, payment processed. Useful for debugging business-logic issues.

External calls

Outgoing API calls — who, what, latency. Critical for debugging "is it us or them?"

What to not log: PII (without explicit consent and proper redaction), passwords, full credit card numbers, secret tokens. Logs leak.

Stack: ELK / EFK / Loki

Elasticsearch + Logstash + Kibana (ELK) — the classic.
Elasticsearch + Fluentd + Kibana (EFK) — Fluentd is more modern.
Grafana Loki — log aggregation with Prometheus-style label-based indexing. Cheaper to operate than ES at scale.

Distributed Tracing

A trace follows a single request through every service it touches.

[ User request ]
       │
       ▼
[ API Gateway ]                    Span: 5ms
       │
       ▼
[ Auth Service ]                   Span: 20ms
       │
       ▼
[ Order Service ]                  Span: 80ms
       │
       ├──► [ Inventory Service ]  Span: 30ms
       │
       └──► [ Payment Service ]    Span: 200ms
                  │
                  └──► [ Stripe ]  Span: 180ms

Each box is a span. The whole tree is a trace. Critical for understanding where time is actually spent.

OpenTelemetry

Vendor-neutral standard. Instrument once, export to any backend. Replaces Jaeger/Zipkin client libraries.

Honeycomb

Premium tracing platform. Wide events. Excellent for debugging.

Jaeger / Zipkin

Open-source tracing backends. Free; require operational effort.

AWS X-Ray, GCP Trace, Azure Monitor

Cloud-native tracing. Easy onboarding; tied to cloud.

Correlation IDs

Every request gets a request ID that propagates through every service it touches. Logs include it; traces use it; metrics tag with it. Then “find everything related to this request” is a single search.

X-Request-ID: a8f4b2e1-9c3d-4f5e-8a6b-1c2d3e4f5a6b

Generate at the edge (API gateway), inject into request headers, propagate through every service. Every log line, every span, every metric should carry it.

SLI, SLO, Error Budget

The SRE framework that turns reliability into a math problem.

SLI — Service Level Indicator

A specific, measurable metric. "Percentage of API requests completing in <500ms." A single number.

SLO — Service Level Objective

A target for the SLI over a window. "99.9% of requests complete in <500ms over a rolling 30-day window."

Error budget

The amount of unreliability you can spend. 99.9% SLO = 0.1% error budget = ~43 minutes per month.

The error budget is the magic. It turns “should we ship faster or be more reliable?” into a quantitative question.

Setting SLOs

The hardest part. Some heuristics:

User-facing reality

The SLO should match what users actually experience. "99.9% of homepage loads complete in 2s" — measured at the user, not at the load balancer.

Cost vs. value

Each "9" costs disproportionately more. 99.9% might be 100x the cost of 99%. Pick what business actually needs.

Start with what you have

Look at the last 90 days. If you've been at 99.5%, target 99.5%. Don't commit to 99.99% if you've never measured it.

Review and adjust

SLOs aren't set once. Review quarterly. Tighten where business demands; loosen where the cost outweighs value.

On-Call

The cultural and operational practice that makes reliability sustainable.

Rotation

Engineers rotate through on-call duty. Typically 1 week at a time. Primary + secondary backup.

Runbooks

Step-by-step procedures for common alerts. "If X is firing, check Y, then Z." Living documents — update when reality changes.

Alert quality

Alert only on user-impacting issues. False alerts erode trust and cause burnout. Aim for <1 false alert per shift.

Postmortems

After every incident, document what happened, what was learned, what changes follow. Blameless culture — focus on systems, not individuals.

Practice the Failures

Production reliability comes from treating failure as expected.

Chaos engineering

Inject failures intentionally to verify resilience. Netflix's Chaos Monkey kills random instances. Larger experiments: latency injection, network partitions.

Game days

Scheduled exercises. "What happens if the database fails over?" Run it during business hours; observe how the system and team respond.

Postmortem-driven improvements

Each postmortem identifies action items — automation, runbook updates, architectural changes. Track them; close the loop.

Synthetic monitoring

Continuous fake user transactions. Catch problems before real users do.

What Good Observability Looks Like

For a healthy SaaS service:

Single dashboard

One dashboard you check first thing — golden signals + SLO compliance. If green, all's well. If red, you know where to look.

Tracing in 1 click

From any error log, you can jump to the trace, see all related logs, see all related metrics. Tooling that connects the three pillars.

Symptom-based alerts

Alerts on "users are seeing errors," not "CPU is high." CPU might be high benignly. User-facing symptoms always need attention.

Runbook per alert

Every alert has a runbook. Steps to investigate, steps to mitigate. New on-call engineers can handle their first shift.

Recap

The three pillars: metrics (aggregated), logs (events), traces (request flow). Use all three.
Golden signals: latency, traffic, errors, saturation. Track these for every service.
Structured logging is mandatory. Correlation IDs propagate everywhere.
Distributed tracing reveals where time is spent across services.
SLI / SLO / Error Budget: the framework that turns reliability into math.
On-call practices: rotation, runbooks, blameless postmortems, alert quality.
Practice failures via chaos engineering, game days, synthetic monitoring.
What good looks like: golden-signal dashboards, tracing in one click, symptom-based alerts, runbook per alert.