André Wlodkovski - Senior DevOps/Platform Engineer

Modern distributed systems are rich in telemetry.

Dashboards glow with metrics. Logs stream endlessly. Traces weave through services.

Yet during real production incidents, many teams experience a familiar and unsettling reality:

“Why is latency increasing?”
“Which dependency is failing?”
“Is this impacting users?”
“Where should we even start looking?”

Despite abundant data, clarity is often scarce.

This paradox reveals a fundamental issue: most observability implementations are installed, not designed.

Observability is frequently approached as a tooling exercise — deploying Prometheus, Grafana, log aggregation, tracing collectors — rather than as a deliberate system design discipline grounded in reliability objectives and operational questions.

The result is telemetry without understanding.

Observability is not defined by the volume of data collected, but by the system’s ability to explain its behavior under normal and abnormal conditions.

Observability Begins With Intent

Before discussing logs, metrics, or traces, we must establish a more foundational layer:

What does “healthy” mean for this system?
Which degradations are unacceptable?
How do we measure reliability from a user’s perspective?

This is where SLIs, SLOs, and SLAs enter the conversation.

SLI — Service Level Indicator

An SLI is a measurement of system behavior.

It represents a quantifiable signal that reflects user experience or service correctness, such as:

Request latency
Error rate
Availability
Data freshness

An SLI is not merely a metric — it is a meaningful metric tied to perceived reliability.

A critical distinction often overlooked:

System metrics (CPU, memory, disk usage) are rarely SLIs.

They describe resource utilization, not service quality.

A system can exhibit low CPU usage while users experience high latency. It can consume significant memory while functioning perfectly.

SLIs should describe what users experience, not what infrastructure consumes.

SLO — Service Level Objective

An SLO defines the target reliability level for an SLI.

Examples:

99.9% successful requests
95% of requests under 300ms
Data updated within 5 seconds

SLOs transform measurements into operational commitments.

They introduce:

Clear definitions of “good enough”
Error budgets
Rational prioritization

Without SLOs, telemetry remains descriptive rather than actionable.

Alerts lack grounding. Reliability becomes subjective.

Teams drift toward reactive firefighting.

SLA — Service Level Agreement

An SLA formalizes reliability expectations at the business or contractual level, often including consequences when objectives are not met.

While SLAs are externally oriented, SLOs serve as the internal mechanisms that protect them.

Why This Foundation Matters

Observability without SLOs is structurally fragile.

It leads to:

Arbitrary alerts
Threshold-driven noise
Misaligned priorities
Alert fatigue

SLOs provide the context that gives telemetry meaning.

They answer the critical question:

“Why should we care about this signal?”

Logs — Constructing the System Narrative

Logs are the most traditional observability signal, yet they remain widely misunderstood.

At their core, logs capture discrete events within a system.

They answer:

✅ What happened
✅ When it happened
✅ Under which context

They are invaluable for:

Incident reconstruction
Auditing
Debugging
Explaining decision paths

❌ Bad Log Example

Error processing request

Why this log fails:

Unstructured
No context
No identifiers
No explanation

During incidents, this log provides almost no diagnostic value.

✅ Good Log Example

{
  "timestamp": "2026-02-19T14:32:10Z",
  "level": "ERROR",
  "service": "payment-api",
  "operation": "authorize_payment",
  "request_id": "req-8f3a21",
  "user_id": "redacted",
  "provider": "stripe",
  "error": "timeout",
  "retry": "scheduled",
  "duration_ms": 1200
}

Why this log works:

✔ Structured → searchable & parsable
✔ Context-rich → service, operation
✔ Correlatable → request_id
✔ Tells a story → timeout + retry

Good logs reduce investigation time.

Common Logging Pitfalls

Many logging strategies fail because they prioritize quantity over clarity:

Excessive verbosity
Unstructured free text
Missing contextual metadata
Messages written solely for developers

Such logs become noise during incidents.

Operators search through vast streams of text with little ability to correlate events or extract patterns.

Architectural Logging Principles

Effective production logging resembles information architecture, not print statements.

Logs should:

✅ Be structured
✅ Carry contextual identifiers
✅ Emphasize decisions and transitions
✅ Minimize redundancy
✅ Balance signal vs cost

Logs are most valuable when they describe why something happened, not just that it happened.

A state change is informative. A decision rationale is transformative.

Logs Within the Observability Landscape

Logs excel at narrative reconstruction but struggle with aggregation:

✅ Rich context
❌ Poor pattern visibility
❌ Limited frequency analysis

They answer:

“What exactly occurred?”

But not efficiently:

“How often is this occurring?”

Metrics — Modeling Behavior at Scale

Metrics compress complex system behavior into numerical representations over time.

They are exceptionally effective at:

✅ Trend detection
✅ Anomaly identification
✅ Capacity planning
✅ SLO measurement
✅ Alert triggering

Metrics reveal patterns invisible in raw logs.

They transform millions of events into interpretable signals.

❌ Bad Metric Example

http_requests_total{
  pod_name="payment-api-7f9d8c",
  pod_ip="10.42.3.17",
  request_id="req-123456"
}

Why this metric fails:

Extremely high cardinality
Labels change constantly
Poor aggregation value
Expensive & unstable

This design can degrade monitoring systems themselves.

✅ Good Metric Example

http_requests_total{
  service="payment-api",
  endpoint="/authorize",
  status_code="500"
}

Why this metric works:

✔ Stable dimensions
✔ Low cardinality
✔ Aggregation-friendly
✔ Operationally meaningful

Enables:

Error rate tracking
SLO measurement
Trend analysis

Metrics as Abstractions

Metrics are not raw truths — they are designed abstractions.

They represent deliberate decisions about:

What to measure
What to aggregate
Which dimensions matter

Poorly designed metrics create misleading interpretations.

Well-designed metrics illuminate system health.

Cardinality — The Hidden Constraint

One of the most critical architectural concerns in metrics design is cardinality.

Cardinality refers to the number of unique label combinations a metric can produce.

High-cardinality labels — such as user IDs or request IDs — generate explosive growth in time series.

Consequences include:

Degraded query performance
Increased storage costs
Unstable monitoring systems
Impaired incident analysis

Cardinality is not merely a technical detail; it is a design constraint.

Principles for Sustainable Metrics

Robust metrics systems emphasize:

✅ Aggregation-friendly dimensions
✅ Low cardinality discipline
✅ Stability over hyper-granularity
✅ Signals aligned with SLOs

Metrics should prioritize operational usefulness, not theoretical completeness.

Metrics Within the Observability Landscape

Metrics excel at pattern detection but lack narrative depth:

✅ Patterns & trends
❌ Context richness
❌ Precise causality

They answer:

“Is something abnormal?”

But not fully:

“Why is it abnormal?”

Traces — Revealing Causality

Tracing addresses the fundamental challenge of distributed systems:

Understanding request behavior across service boundaries.

A trace models the lifecycle of a request as it propagates through components.

It answers:

✅ Where latency accumulates
✅ Which dependency failed
✅ How failures propagate
✅ Where retries amplify issues

❌ Bad Trace Example

Trace where spans are named:

span_1
span_2
span_3

With missing attributes:

No service names
No latency breakdown
No error metadata

Why this trace fails:

❌ No semantic meaning
❌ No diagnostic clarity
❌ Hard to interpret

Enabled but practically useless.

✅ Good Trace Example

Trace showing:

✔ payment-api → fraud-check → stripe
✔ Span durations
✔ Error tag on Stripe call
✔ Retry sequence visible

Why this trace works:

✔ Clear dependency chain
✔ Latency attribution
✔ Failure localization
✔ Explains why latency occurred

Tracing transforms distributed systems into explainable systems.

The Unique Power of Tracing

While metrics show that latency increased, traces reveal:

✅ Which operation slowed
✅ Which service contributed
✅ Which dependency caused it

Tracing transforms systems from opaque to explainable.

Architectural Considerations

Effective tracing requires intentional design:

✅ Meaningful span semantics
✅ Strategic sampling
✅ Cost-awareness
✅ Correlation with logs and metrics

Tracing systems that lack semantic clarity often become:

Enabled → Collected → Ignored

Traces Within the Observability Landscape

Traces excel at causality but are less suited for long-term aggregation:

✅ Root cause & dependency analysis
❌ Broad statistical summaries

They answer:

“Why did this request behave this way?”

Observability Requires Multiple Perspectives

Logs, metrics, and traces are not competing solutions.

They are complementary perspectives:

Logs → Narrative
Metrics → Patterns
Traces → Causality

Each answers different classes of questions.

An overreliance on one signal produces blind spots:

Metrics-only → Lacks explanation

Logs-only → Lacks patterns

Traces-only → Lacks system-wide trends

Observability emerges from signal synthesis, not signal accumulation.

Observability as a Design Discipline

A recurring misconception is that observability can be “added later.”

In practice, observability is deeply intertwined with:

System architecture
Failure modeling
Reliability strategy
Operational maturity

Observability design begins by asking:

✅ Which failures must be detectable?
✅ Which degradations are unacceptable?
✅ Which behaviors indicate user pain?
✅ Which questions must telemetry answer?

Tools amplify answers — they do not supply them.

Conclusion — Designing Systems That Speak Clearly

Observability is not defined by dashboards, collectors, or storage backends.

It is defined by a system’s ability to:

✅ Surface meaningful signals
✅ Explain abnormal behavior
✅ Connect symptoms to causes
✅ Map degradation to user impact

The foundation of observability is intent:

SLIs define what matters

SLOs define what is acceptable

Telemetry explains what is happening

Actionable Architectural Guidance

When designing or evolving observability:

✅ Define reliability objectives (SLOs)
✅ Derive SLIs from user or business impact
✅ Align alerts with SLO violations
✅ Design metrics with cardinality discipline
✅ Treat logging as information architecture
✅ Instrument traces intentionally
✅ Continuously evaluate signal usefulness

Observability is not a static implementation.

It is an evolving capability that matures alongside the system.

Production-ready Kubernetes Series:

Production-ready Kubernetes Part 1 - Observability Foundations

Observability Begins With Intent

SLI — Service Level Indicator

SLO — Service Level Objective

SLA — Service Level Agreement

Why This Foundation Matters

Logs — Constructing the System Narrative

❌ Bad Log Example

✅ Good Log Example

Common Logging Pitfalls

Architectural Logging Principles

Logs Within the Observability Landscape

Metrics — Modeling Behavior at Scale

❌ Bad Metric Example

✅ Good Metric Example

Metrics as Abstractions

Cardinality — The Hidden Constraint

Principles for Sustainable Metrics

Metrics Within the Observability Landscape

Traces — Revealing Causality

❌ Bad Trace Example

✅ Good Trace Example

The Unique Power of Tracing

Architectural Considerations

Traces Within the Observability Landscape

Observability Requires Multiple Perspectives

Metrics-only → Lacks explanation

Logs-only → Lacks patterns

Traces-only → Lacks system-wide trends

Observability as a Design Discipline

Conclusion — Designing Systems That Speak Clearly

SLIs define what matters

SLOs define what is acceptable

Actionable Architectural Guidance

Related Posts