In distributed systems, you canโt fix what you canโt see. Observability is the discipline of building systems whose internal state can be inferred from external signals. SRE โ Site Reliability Engineering, the practice that came out of Google โ is the operational philosophy that turns observability into reliability.
This lesson is the working vocabulary: the three pillars, SLIs/SLOs/error budgets, on-call practices, and what good looks like.
The Three Pillars of Observability
๐
Metrics
Aggregated numerical data. Request rate, error rate, latency percentiles, queue depth. Cheap to store, fast to query, but lossy โ you can't go back and ask new questions.
๐
Logs
Discrete events with context. "User X attempted action Y at time Z." Rich detail, expensive at scale. Best for after-the-fact investigation.
๐
Traces
A single request's journey across services. Distributed tracing shows where time is spent. Critical for debugging in microservices.
Metrics
The Google SRE book defines four golden signals every service should expose:
๐ฆ
Latency
How long requests take. Track percentiles (p50, p95, p99), not averages โ averages hide tail latency.
๐
Traffic
How many requests are coming in. Per second, per minute. Different units for different services.
โ ๏ธ
Errors
Rate of failed requests. Subdivide by error class โ 5xx (server bug) vs. 4xx (client issue) vs. circuit-breaker rejection.
โก
Saturation
How "full" your service is. CPU, memory, queue depth, thread pool utilization. Saturation predicts errors before they happen.
Grafana โ visualization. Dashboards, alerts. Standard companion to Prometheus.
Datadog / New Relic / Honeycomb โ commercial alternatives with broader feature sets.
OpenTelemetry โ vendor-neutral instrumentation. The future-proof choice.
Logs
Structured logging is non-negotiable. Plain-text logs are unsearchable at scale.
Plain text
Unsearchable, unparsable
`2024-01-15 14:23:01 INFO User logged in: alice@example.com from 192.168.1.1`. Hard to filter, aggregate, or extract fields.
Structured (JSON)
Queryable, aggregatable
`{"timestamp": "...", "level": "INFO", "msg": "user.login", "user_email": "alice@...", "ip": "192.168.1.1"}`. Filter by any field, aggregate by any dimension.
What to log
๐ฅ
Request boundaries
Every request entering and leaving a service. With request ID, user ID, tenant ID for correlation.
โ ๏ธ
Errors
Stack traces, error context, what was being attempted.
๐
Significant business events
Order placed, payment processed. Useful for debugging business-logic issues.
๐ฆ
External calls
Outgoing API calls โ who, what, latency. Critical for debugging "is it us or them?"
What to not log: PII (without explicit consent and proper redaction), passwords, full credit card numbers, secret tokens. Logs leak.
Stack: ELK / EFK / Loki
Elasticsearch + Logstash + Kibana (ELK) โ the classic.
Elasticsearch + Fluentd + Kibana (EFK) โ Fluentd is more modern.
Grafana Loki โ log aggregation with Prometheus-style label-based indexing. Cheaper to operate than ES at scale.
Distributed Tracing
A trace follows a single request through every service it touches.
[ User request ] โ โผ[ API Gateway ] Span: 5ms โ โผ[ Auth Service ] Span: 20ms โ โผ[ Order Service ] Span: 80ms โ โโโโบ [ Inventory Service ] Span: 30ms โ โโโโบ [ Payment Service ] Span: 200ms โ โโโโบ [ Stripe ] Span: 180ms
Each box is a span. The whole tree is a trace. Critical for understanding where time is actually spent.
๐ญ
OpenTelemetry
Vendor-neutral standard. Instrument once, export to any backend. Replaces Jaeger/Zipkin client libraries.
๐
Honeycomb
Premium tracing platform. Wide events. Excellent for debugging.
Cloud-native tracing. Easy onboarding; tied to cloud.
Correlation IDs
Every request gets a request ID that propagates through every service it touches. Logs include it; traces use it; metrics tag with it. Then โfind everything related to this requestโ is a single search.
Generate at the edge (API gateway), inject into request headers, propagate through every service. Every log line, every span, every metric should carry it.
SLI, SLO, Error Budget
The SRE framework that turns reliability into a math problem.
๐
SLI โ Service Level Indicator
A specific, measurable metric. "Percentage of API requests completing in <500ms." A single number.
๐ฏ
SLO โ Service Level Objective
A target for the SLI over a window. "99.9% of requests complete in <500ms over a rolling 30-day window."
๐ฐ
Error budget
The amount of unreliability you can spend. 99.9% SLO = 0.1% error budget = ~43 minutes per month.
The error budget is the magic. It turns โshould we ship faster or be more reliable?โ into a quantitative question.
Setting SLOs
The hardest part. Some heuristics:
๐ฏ
User-facing reality
The SLO should match what users actually experience. "99.9% of homepage loads complete in 2s" โ measured at the user, not at the load balancer.
๐ธ
Cost vs. value
Each "9" costs disproportionately more. 99.9% might be 100x the cost of 99%. Pick what business actually needs.
๐
Start with what you have
Look at the last 90 days. If you've been at 99.5%, target 99.5%. Don't commit to 99.99% if you've never measured it.
๐
Review and adjust
SLOs aren't set once. Review quarterly. Tighten where business demands; loosen where the cost outweighs value.
On-Call
The cultural and operational practice that makes reliability sustainable.
๐ฅ
Rotation
Engineers rotate through on-call duty. Typically 1 week at a time. Primary + secondary backup.
๐
Runbooks
Step-by-step procedures for common alerts. "If X is firing, check Y, then Z." Living documents โ update when reality changes.
โฑ๏ธ
Alert quality
Alert only on user-impacting issues. False alerts erode trust and cause burnout. Aim for <1 false alert per shift.
๐
Postmortems
After every incident, document what happened, what was learned, what changes follow. Blameless culture โ focus on systems, not individuals.
Practice the Failures
Production reliability comes from treating failure as expected.
๐ฆ
Chaos engineering
Inject failures intentionally to verify resilience. Netflix's Chaos Monkey kills random instances. Larger experiments: latency injection, network partitions.
๐ฎ
Game days
Scheduled exercises. "What happens if the database fails over?" Run it during business hours; observe how the system and team respond.
๐
Postmortem-driven improvements
Each postmortem identifies action items โ automation, runbook updates, architectural changes. Track them; close the loop.
๐ฏ
Synthetic monitoring
Continuous fake user transactions. Catch problems before real users do.
What Good Observability Looks Like
For a healthy SaaS service:
๐
Single dashboard
One dashboard you check first thing โ golden signals + SLO compliance. If green, all's well. If red, you know where to look.
๐
Tracing in 1 click
From any error log, you can jump to the trace, see all related logs, see all related metrics. Tooling that connects the three pillars.
๐
Symptom-based alerts
Alerts on "users are seeing errors," not "CPU is high." CPU might be high benignly. User-facing symptoms always need attention.
๐
Runbook per alert
Every alert has a runbook. Steps to investigate, steps to mitigate. New on-call engineers can handle their first shift.
Recap
The three pillars: metrics (aggregated), logs (events), traces (request flow). Use all three.
Golden signals: latency, traffic, errors, saturation. Track these for every service.
Structured logging is mandatory. Correlation IDs propagate everywhere.
Distributed tracing reveals where time is spent across services.
SLI / SLO / Error Budget: the framework that turns reliability into math.