← Back to Projects

Building a Production-Grade OpenTelemetry Stack

Fault-tolerant, cost-optimized observability for distributed microservices

Architecture

Fault-tolerant

Cardinality

Optimized

Availability

HA with sharding

Architecture diagram

The Challenge

My client operated a distributed microservices architecture with dozens of services across multiple Kubernetes clusters, but lacked a unified observability strategy. The existing setup had critical gaps:

Observability Gaps

  • No distributed tracing: Impossible to track requests across service boundaries
  • Fragmented telemetry: Each team implemented their own logging, metrics, and monitoring
  • High operational overhead: Multiple agents, exporters, and collection methods
  • No standardization: Different instrumentation libraries, formats, and conventions
  • Limited visibility: Incidents required hours of log-diving and guesswork

Technical Challenges

The team needed a solution that:

  • Supported multiple signals (logs, metrics, traces) in a unified pipeline
  • Scaled reliably without data loss during pod restarts or node cycling
  • Optimized costs by controlling metric cardinality and data volume
  • Provided high availability through sharding and fault tolerance
  • Integrated with existing tools (Prometheus, Kubernetes, existing services)

The Solution

I designed and implemented a production-grade OpenTelemetry Collector stack on Kubernetes, focusing on reliability, cost optimization, and operational excellence. Rather than a quick proof-of-concept, this architecture was built for long-term production use.

OpenTelemetry (OTel) was chosen because:

  • Vendor-neutral: Not locked into any specific observability platform
  • Three signals, one pipeline: Logs, metrics, and traces through a single collector
  • Rich ecosystem: Extensive receiver, processor, and exporter support
  • Kubernetes-native: First-class support for service discovery and metadata
  • Future-proof: CNCF graduated project with strong community and adoption

In summary, OpenTelemetry is the gold standard of distributed observability.

Architecture Philosophy

The design prioritized:

  1. Fault tolerance — No data loss during pod restarts, node cycles, or OOM crashes
  2. Cost efficiency — Strict cardinality control to avoid exponential metric growth
  3. High availability — Sharding and target allocation to prevent duplicate scraping
  4. Operational stability — HPA tuning to minimize churn during traffic spikes

Architecture Overview

Three-Component Design

The OpenTelemetry stack consists of three specialized deployments, each optimized for its telemetry signal:

1. Logs: DaemonSet (Node-Level Collection)

Deployed as a DaemonSet to tail logs from all pods on each node.

Why DaemonSet?

  • Logs are node-local (written to stdout/stderr, collected by kubelet)
  • Each node needs one collector to read /var/log/pods
  • Avoids network overhead of shipping logs between nodes

Configuration:

  • Tails pod logs directly from the filesystem
  • Enriches with Kubernetes metadata (pod name, namespace, labels)
  • Filters and parses logs before sending to backend
  • Minimal resource footprint per node

2. Metrics & Traces: StatefulSet (Persistent Queue)

Deployed as a StatefulSet with persistent volumes for fault tolerance.

Why StatefulSet?

  • Requires persistent storage for sending_queue to survive pod restarts
  • Each replica needs stable identity for Target Allocator sharding
  • PVCs ensure no data loss during node cycling or OOM crashes

Metrics Pipeline:

  • Prometheus receiver scrapes pods based on service discovery
  • Target allocation via separate Target Allocator service (HA sharding)
  • Cardinality optimization through metric relabeling and filtering
  • Persistent queue backed by PVC before sending to Coralogix

Traces Pipeline:

  • OTLP receiver accepts traces pushed by instrumented services
  • Services send traces directly to collector (no sidecar needed)
  • Batching and sampling to reduce volume and costs
  • Persistent queue ensures no trace loss during collector restarts

3. Target Allocator: Deployment (HA Target Sharding)

Separate Deployment for the OpenTelemetry Target Allocator.

Purpose:

  • Discovers Prometheus scrape targets (via Kubernetes service discovery)
  • Distributes targets evenly across StatefulSet replicas (sharding)
  • Prevents duplicate scraping when scaling collector replicas
  • Enables true horizontal scaling for metrics collection

Technical Deep Dive

Logs: DaemonSet Architecture

Filesystem Tailing

Challenge: Collecting logs from dozens of pods per node efficiently.

Solution:

# DaemonSet mounts host filesystem
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
# Collector tails logs from these paths
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
include_file_path: true
include_file_name: false

K8sattributes processor - Kubernetes Metadata Enrichment

Automatic enrichment with pod/namespace/labels:

  • Uses Kubernetes API to resolve container IDs → pod metadata
  • Adds namespace, pod name, container name, labels as log attributes
  • Enables filtering and querying by Kubernetes context in Coralogix

Log Parsing & Filtering

Reduces noise before sending:

  • Parses JSON logs automatically
  • Filters out health check endpoints (/health, /ready) and known noisy application-specific endpoints
  • Drops debug-level logs in production
  • Only sends logs matching severity thresholds

Result: Lower data volume and costs while preserving important logs.


Metrics: StatefulSet with Target Allocator

The Cardinality Problem

Challenge: Prometheus metrics can explode in cardinality.

Example: A metric with 10 label dimensions, each with 10 values = 10^10 possible combinations. This causes:

  • Massive storage costs
  • Slow queries
  • Backend ingestion limits hit

Solution: Aggressive Metric Relabeling

Principle: Only scrape what you need, drop everything else.

Implemented via metric relabeling:

  1. Allowlist specific metrics — Only keep metrics critical for dashboards/alerts
  2. Drop high-cardinality labels — Remove labels like pod_ip, instance_id that add no value
  3. Aggregate where possible — Use recording rules to pre-aggregate
  4. Scrape interval tuning — Adjust scrape frequency per service (critical = 15s, normal = 60s, development/testing = 300s)

Example relabeling rule:

metric_relabel_configs:
# Keep only specific metrics
- source_labels: [__name__]
regex: '(go_.*|http_requests_total|http_request_duration_seconds|up)'
action: keep
# Drop high-cardinality labels
- source_labels: [__name__]
regex: '.*'
target_label: pod_ip
action: labeldrop
# Rename for consistency
- source_labels: [service]
target_label: service_name
action: replace

Result: Reduced metric cardinality by ~95% while preserving all critical observability.

Persistent Queue for Fault Tolerance

Challenge: Collector pods can crash or restart, losing in-flight metrics.

Solution: sending_queue with persistent storage.

exporters:
otlp/coralogix:
endpoint: coralogix-endpoint:443
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage/queue
extensions:
file_storage/queue:
directory: /var/lib/otelcol/queue
timeout: 10s

Key benefits:

  • Queue persisted to PVC (survives pod restarts)
  • Prevents data loss during node cycling, OOM kills, or deployments
  • Automatic retry with exponential backoff
  • Batch sending for efficiency

PVC configuration:

volumeClaimTemplates:
- metadata:
name: queue-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi

Additional important configuration: Configured pod termination grace period to allow for a graceful termination without data loss.

Target Allocator for HA Sharding

Challenge: Multiple collector replicas scraping the same targets = duplicate or inconsistent metrics.

Solution: OpenTelemetry Target Allocator.

How it works:

  1. Target Allocator discovers all Prometheus scrape targets (via Kubernetes SD)
  2. Distributes targets evenly across StatefulSet replicas using consistent hashing
  3. Each collector replica queries Target Allocator for its assigned subset
  4. Targets automatically rebalanced when replicas scale up/down

Architecture:

┌────────────────────────────┐
│  Target Allocator (Deploy) │
│  - Discovers targets       │
│  - Shards across replicas  │
└─────────┬──────────────────┘
          │ assigns targets
          ├──────────┬──────────┐
          ▼          ▼          ▼
     ┌────────┐ ┌────────┐ ┌────────┐
     │ OTel-0 │ │ OTel-1 │ │ OTel-2 │  (StatefulSet)
     │ Scrapes│ │ Scrapes│ │ Scrapes│
     │ 1/3 of │ │ 1/3 of │ │ 1/3 of │
     │ targets│ │ targets│ │ targets│
     └────────┘ └────────┘ └────────┘

Benefits:

  • ✅ True horizontal scaling (3 replicas = 3x capacity)
  • ✅ No duplicate scrapes (each target scraped exactly once)
  • ✅ Automatic rebalancing on scale events
  • ✅ Resilient to collector pod failures

Target Allocator configuration:

config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

Traces: OTLP Receiver

Service Instrumentation

Instrumented services using OpenTelemetry SDKs:

  • Base starter Go microservice (main company's stack) with initial configuration (including the OpenTelemetry SDK configuration), allowing new microservices to fork from this repo
  • Manual instrumentation for critical business logic
  • Consistent span naming conventions across all services

Direct Push Model

Unlike metrics (pull), traces use push:

Service → OTLP (gRPC/HTTP) → Collector → Coralogix

Why push instead of pull?

  • Traces are event-driven, not time-series
  • Services know when traces are complete
  • Reduces collector complexity (no service discovery needed for traces)

Sampling Strategy

Implemented head-based sampling to control costs:

  • 100% sampling for errors and slow requests (>1s)
  • 50% general sampling for non-production environments
  • 10% sampling for successful, fast requests
  • Trace ID-based (consistent sampling across distributed traces)

Configuration:

processors:
probabilistic_sampler:
sampling_percentage: 10
hash_seed: 22
# Always sample errors
tail_sampling:
policies:
- name: error-traces
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 1000

Additionally, Coralogix has a feature called TCO Quota Optimizer which allows us to classify logs and traces into different tiers (Compliance - rarely accessed, Monitoring - non-essential data, Frequent access - essential data for fast access, such as production).

Result: 90% cost reduction on trace ingestion while keeping all important traces.


HPA Stabilization for Target Allocator Efficiency

The Scaling Churn Problem

Challenge: HPA rapidly scaling up/down causes:

  • Target Allocator constantly reshuffling targets across replicas
  • Scrape schedule disruption
  • Increased load on Kubernetes API and Target Allocator
  • Inconsistent metric collection

Solution: HPA Stabilization Windows

Configured downscale stabilization to prevent flapping:

behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 10 minutes
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15

How it works:

  • Scale up fast: React immediately to traffic spikes (100% in 15s)
  • Scale down slow: Wait 10 minutes before downscaling, prevent thrashing
  • Gradual downscale: Only remove 25% of replicas per minute

Benefits:

  • ✅ Target Allocator reshuffles less frequently
  • ✅ More consistent scrape scheduling
  • ✅ Lower API server load
  • ✅ Stable metric collection during traffic fluctuations

Key Technical Decisions

Why StatefulSet for Metrics/Traces?

Considered alternatives:

  • Deployment: No persistent storage = data loss on pod restart ❌
  • DaemonSet: Can't shard targets, over-collects ❌
  • StatefulSet: Stable identity + PVCs + sharding ✅

Rationale: StatefulSet provides stable pod identities required for Target Allocator sharding and persistent storage for queue fault tolerance.

Why Separate DaemonSet for Logs?

Could have used StatefulSet for logs too.

Decision: Logs are node-local, DaemonSet is the Kubernetes-native pattern.

  • Each node has one collector reading local filesystem
  • No network overhead shipping logs between nodes
  • Simpler configuration and lower resource usage

Why Persistent Queue?

Alternative: In-memory queue.

Decision: Persistent queue prevents data loss.

  • Node cycling (common in spot instances, cluster upgrades)
  • OOM kills (can happen under heavy load)
  • Deployments/rollouts (inevitable in active development)

Trade-off: Requires PVC provisioning, but worth it for production reliability.

Why Target Allocator?

Alternative: Each collector scrapes all targets, deduplicate at backend.

Decision: Target Allocator prevents duplicate scraping at the source.

  • Lower network traffic
  • Lower backend ingestion costs
  • True horizontal scaling (3 replicas = 3x capacity, not 3x duplication)

Why Coralogix?

Alternative: Self-hosted Grafana + Loki + Tempo + Prometheus.

Decision: Coralogix provides all-in-one platform with less operational overhead.

  • Unified logs, metrics, traces in one UI
  • Managed service (no need to run and scale Loki/Tempo/Prometheus)
  • Built-in alerting and dashboards
  • Cost-effective at our scale

Trade-off: Vendor lock-in, but OpenTelemetry makes migration easy if needed.


Lessons Learned

On Architecture Design

  1. Start simple, add complexity incrementally — Began with logs DaemonSet, added StatefulSet complexity only when needed
  2. Persistent queues are non-negotiable — We realized we had data gaps, especially in metrics, before implementing PVC-backed queues
  3. StatefulSet identity matters — Target Allocator sharding requires stable pod names (otel-0, otel-1, etc.)
  4. Resource limits are critical — OTel Collector can consume unbounded memory without proper limits

On Cost Optimization

  1. Cardinality is the enemy — One misconfigured label can 10x your metrics bill
  2. Allowlists > Denylists — Easier to maintain "only collect these metrics" than "block these"
  3. Scrape intervals matter — Not everything needs 15s granularity; 60s (or more) is fine for most metrics. Assess before implementing
  4. Sampling is essential for traces — 100% trace collection is unsustainable at scale

On Operational Excellence

  1. HPA stabilization prevents chaos — 10-minute downscale window dramatically reduced target churn
  2. Monitoring the monitors — Created alerts for collector health, queue size, and drop rates
  3. Documentation wins adoption — Lunch-and-learns and runbooks got teams using traces
  4. Iterative improvement beats perfection — 20% → 50% → 80% coverage over months, not overnight, and I learn all of the concepts mentioned in this case study over time

On Kubernetes-Native Patterns

  1. DaemonSet for node-local work — Perfect for log collection from filesystem
  2. StatefulSet for identity + storage — Required for sharding and persistent queues
  3. Deployment for stateless logic — Target Allocator doesn't need persistence
  4. Service discovery is powerful — Kubernetes annotations (prometheus.io/scrape=true) enable self-service

Technologies Used

OpenTelemetry Ecosystem:

  • OpenTelemetry Collector (core pipeline)
  • OpenTelemetry Target Allocator (HA sharding)
  • OpenTelemetry SDKs (service instrumentation)

Kubernetes:

  • DaemonSet (logs collection)
  • StatefulSet (metrics/traces with persistence)
  • Deployment (Target Allocator)
  • HPA (autoscaling with stabilization)
  • PVC (persistent queue storage)

Observability Backend:

  • Coralogix (logs, metrics, traces, dashboards, alerts)

Infrastructure:

  • Terraform (infrastructure as code)
  • Helm (OpenTelemetry Collector deployment)

Receivers & Exporters:

  • K8sattributes (Kubernetes data enrichment)
  • Prometheus receiver (metrics scraping)
  • OTLP receiver (traces ingestion)
  • Filelog receiver (log tailing)
  • OTLP exporter (sending to Coralogix)

Next Project: CloudFront CDN Implementation for Global Performance →

Technologies Used

OpenTelemetryOpenTelemetry CollectorTarget AllocatorPrometheusCoralogixKubernetesTerraformHPA