Building a Production-Grade OpenTelemetry Stack
Fault-tolerant, cost-optimized observability for distributed microservices
Architecture
Fault-tolerant
Cardinality
Optimized
Availability
HA with sharding
The Challenge
My client operated a distributed microservices architecture with dozens of services across multiple Kubernetes clusters, but lacked a unified observability strategy. The existing setup had critical gaps:
Observability Gaps
- No distributed tracing: Impossible to track requests across service boundaries
- Fragmented telemetry: Each team implemented their own logging, metrics, and monitoring
- High operational overhead: Multiple agents, exporters, and collection methods
- No standardization: Different instrumentation libraries, formats, and conventions
- Limited visibility: Incidents required hours of log-diving and guesswork
Technical Challenges
The team needed a solution that:
- Supported multiple signals (logs, metrics, traces) in a unified pipeline
- Scaled reliably without data loss during pod restarts or node cycling
- Optimized costs by controlling metric cardinality and data volume
- Provided high availability through sharding and fault tolerance
- Integrated with existing tools (Prometheus, Kubernetes, existing services)
The Solution
I designed and implemented a production-grade OpenTelemetry Collector stack on Kubernetes, focusing on reliability, cost optimization, and operational excellence. Rather than a quick proof-of-concept, this architecture was built for long-term production use.
OpenTelemetry (OTel) was chosen because:
- Vendor-neutral: Not locked into any specific observability platform
- Three signals, one pipeline: Logs, metrics, and traces through a single collector
- Rich ecosystem: Extensive receiver, processor, and exporter support
- Kubernetes-native: First-class support for service discovery and metadata
- Future-proof: CNCF graduated project with strong community and adoption
In summary, OpenTelemetry is the gold standard of distributed observability.
Architecture Philosophy
The design prioritized:
- Fault tolerance — No data loss during pod restarts, node cycles, or OOM crashes
- Cost efficiency — Strict cardinality control to avoid exponential metric growth
- High availability — Sharding and target allocation to prevent duplicate scraping
- Operational stability — HPA tuning to minimize churn during traffic spikes
Architecture Overview
Three-Component Design
The OpenTelemetry stack consists of three specialized deployments, each optimized for its telemetry signal:
1. Logs: DaemonSet (Node-Level Collection)
Deployed as a DaemonSet to tail logs from all pods on each node.
Why DaemonSet?
- Logs are node-local (written to stdout/stderr, collected by kubelet)
- Each node needs one collector to read
/var/log/pods - Avoids network overhead of shipping logs between nodes
Configuration:
- Tails pod logs directly from the filesystem
- Enriches with Kubernetes metadata (pod name, namespace, labels)
- Filters and parses logs before sending to backend
- Minimal resource footprint per node
2. Metrics & Traces: StatefulSet (Persistent Queue)
Deployed as a StatefulSet with persistent volumes for fault tolerance.
Why StatefulSet?
- Requires persistent storage for
sending_queueto survive pod restarts - Each replica needs stable identity for Target Allocator sharding
- PVCs ensure no data loss during node cycling or OOM crashes
Metrics Pipeline:
- Prometheus receiver scrapes pods based on service discovery
- Target allocation via separate Target Allocator service (HA sharding)
- Cardinality optimization through metric relabeling and filtering
- Persistent queue backed by PVC before sending to Coralogix
Traces Pipeline:
- OTLP receiver accepts traces pushed by instrumented services
- Services send traces directly to collector (no sidecar needed)
- Batching and sampling to reduce volume and costs
- Persistent queue ensures no trace loss during collector restarts
3. Target Allocator: Deployment (HA Target Sharding)
Separate Deployment for the OpenTelemetry Target Allocator.
Purpose:
- Discovers Prometheus scrape targets (via Kubernetes service discovery)
- Distributes targets evenly across StatefulSet replicas (sharding)
- Prevents duplicate scraping when scaling collector replicas
- Enables true horizontal scaling for metrics collection
Technical Deep Dive
Logs: DaemonSet Architecture
Filesystem Tailing
Challenge: Collecting logs from dozens of pods per node efficiently.
Solution:
# DaemonSet mounts host filesystemvolumes:- name: varloghostPath:path: /var/log- name: varlibdockercontainershostPath:path: /var/lib/docker/containers# Collector tails logs from these pathsreceivers:filelog:include:- /var/log/pods/*/*/*.loginclude_file_path: trueinclude_file_name: false
K8sattributes processor - Kubernetes Metadata Enrichment
Automatic enrichment with pod/namespace/labels:
- Uses Kubernetes API to resolve container IDs → pod metadata
- Adds namespace, pod name, container name, labels as log attributes
- Enables filtering and querying by Kubernetes context in Coralogix
Log Parsing & Filtering
Reduces noise before sending:
- Parses JSON logs automatically
- Filters out health check endpoints (
/health,/ready) and known noisy application-specific endpoints - Drops debug-level logs in production
- Only sends logs matching severity thresholds
Result: Lower data volume and costs while preserving important logs.
Metrics: StatefulSet with Target Allocator
The Cardinality Problem
Challenge: Prometheus metrics can explode in cardinality.
Example: A metric with 10 label dimensions, each with 10 values = 10^10 possible combinations. This causes:
- Massive storage costs
- Slow queries
- Backend ingestion limits hit
Solution: Aggressive Metric Relabeling
Principle: Only scrape what you need, drop everything else.
Implemented via metric relabeling:
- Allowlist specific metrics — Only keep metrics critical for dashboards/alerts
- Drop high-cardinality labels — Remove labels like
pod_ip,instance_idthat add no value - Aggregate where possible — Use recording rules to pre-aggregate
- Scrape interval tuning — Adjust scrape frequency per service (critical = 15s, normal = 60s, development/testing = 300s)
Example relabeling rule:
metric_relabel_configs:# Keep only specific metrics- source_labels: [__name__]regex: '(go_.*|http_requests_total|http_request_duration_seconds|up)'action: keep# Drop high-cardinality labels- source_labels: [__name__]regex: '.*'target_label: pod_ipaction: labeldrop# Rename for consistency- source_labels: [service]target_label: service_nameaction: replace
Result: Reduced metric cardinality by ~95% while preserving all critical observability.
Persistent Queue for Fault Tolerance
Challenge: Collector pods can crash or restart, losing in-flight metrics.
Solution: sending_queue with persistent storage.
exporters:otlp/coralogix:endpoint: coralogix-endpoint:443sending_queue:enabled: truenum_consumers: 10queue_size: 5000storage: file_storage/queueextensions:file_storage/queue:directory: /var/lib/otelcol/queuetimeout: 10s
Key benefits:
- Queue persisted to PVC (survives pod restarts)
- Prevents data loss during node cycling, OOM kills, or deployments
- Automatic retry with exponential backoff
- Batch sending for efficiency
PVC configuration:
volumeClaimTemplates:- metadata:name: queue-storagespec:accessModes: [ "ReadWriteOnce" ]resources:requests:storage: 10Gi
Additional important configuration: Configured pod termination grace period to allow for a graceful termination without data loss.
Target Allocator for HA Sharding
Challenge: Multiple collector replicas scraping the same targets = duplicate or inconsistent metrics.
Solution: OpenTelemetry Target Allocator.
How it works:
- Target Allocator discovers all Prometheus scrape targets (via Kubernetes SD)
- Distributes targets evenly across StatefulSet replicas using consistent hashing
- Each collector replica queries Target Allocator for its assigned subset
- Targets automatically rebalanced when replicas scale up/down
Architecture:
┌────────────────────────────┐
│ Target Allocator (Deploy) │
│ - Discovers targets │
│ - Shards across replicas │
└─────────┬──────────────────┘
│ assigns targets
├──────────┬──────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ OTel-0 │ │ OTel-1 │ │ OTel-2 │ (StatefulSet)
│ Scrapes│ │ Scrapes│ │ Scrapes│
│ 1/3 of │ │ 1/3 of │ │ 1/3 of │
│ targets│ │ targets│ │ targets│
└────────┘ └────────┘ └────────┘
Benefits:
- ✅ True horizontal scaling (3 replicas = 3x capacity)
- ✅ No duplicate scrapes (each target scraped exactly once)
- ✅ Automatic rebalancing on scale events
- ✅ Resilient to collector pod failures
Target Allocator configuration:
config:scrape_configs:- job_name: 'kubernetes-pods'kubernetes_sd_configs:- role: podrelabel_configs:# Only scrape pods with prometheus.io/scrape=true annotation- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]action: keepregex: true
Traces: OTLP Receiver
Service Instrumentation
Instrumented services using OpenTelemetry SDKs:
- Base starter Go microservice (main company's stack) with initial configuration (including the OpenTelemetry SDK configuration), allowing new microservices to fork from this repo
- Manual instrumentation for critical business logic
- Consistent span naming conventions across all services
Direct Push Model
Unlike metrics (pull), traces use push:
Service → OTLP (gRPC/HTTP) → Collector → Coralogix
Why push instead of pull?
- Traces are event-driven, not time-series
- Services know when traces are complete
- Reduces collector complexity (no service discovery needed for traces)
Sampling Strategy
Implemented head-based sampling to control costs:
- 100% sampling for errors and slow requests (>1s)
- 50% general sampling for non-production environments
- 10% sampling for successful, fast requests
- Trace ID-based (consistent sampling across distributed traces)
Configuration:
processors:probabilistic_sampler:sampling_percentage: 10hash_seed: 22# Always sample errorstail_sampling:policies:- name: error-tracestype: status_codestatus_code:status_codes: [ERROR]- name: slow-tracestype: latencylatency:threshold_ms: 1000
Additionally, Coralogix has a feature called TCO Quota Optimizer which allows us to classify logs and traces into different tiers (Compliance - rarely accessed, Monitoring - non-essential data, Frequent access - essential data for fast access, such as production).
Result: 90% cost reduction on trace ingestion while keeping all important traces.
HPA Stabilization for Target Allocator Efficiency
The Scaling Churn Problem
Challenge: HPA rapidly scaling up/down causes:
- Target Allocator constantly reshuffling targets across replicas
- Scrape schedule disruption
- Increased load on Kubernetes API and Target Allocator
- Inconsistent metric collection
Solution: HPA Stabilization Windows
Configured downscale stabilization to prevent flapping:
behavior:scaleDown:stabilizationWindowSeconds: 600 # 10 minutespolicies:- type: Percentvalue: 25periodSeconds: 60scaleUp:stabilizationWindowSeconds: 0 # Scale up immediatelypolicies:- type: Percentvalue: 100periodSeconds: 15
How it works:
- Scale up fast: React immediately to traffic spikes (100% in 15s)
- Scale down slow: Wait 10 minutes before downscaling, prevent thrashing
- Gradual downscale: Only remove 25% of replicas per minute
Benefits:
- ✅ Target Allocator reshuffles less frequently
- ✅ More consistent scrape scheduling
- ✅ Lower API server load
- ✅ Stable metric collection during traffic fluctuations
Key Technical Decisions
Why StatefulSet for Metrics/Traces?
Considered alternatives:
- Deployment: No persistent storage = data loss on pod restart ❌
- DaemonSet: Can't shard targets, over-collects ❌
- StatefulSet: Stable identity + PVCs + sharding ✅
Rationale: StatefulSet provides stable pod identities required for Target Allocator sharding and persistent storage for queue fault tolerance.
Why Separate DaemonSet for Logs?
Could have used StatefulSet for logs too.
Decision: Logs are node-local, DaemonSet is the Kubernetes-native pattern.
- Each node has one collector reading local filesystem
- No network overhead shipping logs between nodes
- Simpler configuration and lower resource usage
Why Persistent Queue?
Alternative: In-memory queue.
Decision: Persistent queue prevents data loss.
- Node cycling (common in spot instances, cluster upgrades)
- OOM kills (can happen under heavy load)
- Deployments/rollouts (inevitable in active development)
Trade-off: Requires PVC provisioning, but worth it for production reliability.
Why Target Allocator?
Alternative: Each collector scrapes all targets, deduplicate at backend.
Decision: Target Allocator prevents duplicate scraping at the source.
- Lower network traffic
- Lower backend ingestion costs
- True horizontal scaling (3 replicas = 3x capacity, not 3x duplication)
Why Coralogix?
Alternative: Self-hosted Grafana + Loki + Tempo + Prometheus.
Decision: Coralogix provides all-in-one platform with less operational overhead.
- Unified logs, metrics, traces in one UI
- Managed service (no need to run and scale Loki/Tempo/Prometheus)
- Built-in alerting and dashboards
- Cost-effective at our scale
Trade-off: Vendor lock-in, but OpenTelemetry makes migration easy if needed.
Lessons Learned
On Architecture Design
- Start simple, add complexity incrementally — Began with logs DaemonSet, added StatefulSet complexity only when needed
- Persistent queues are non-negotiable — We realized we had data gaps, especially in metrics, before implementing PVC-backed queues
- StatefulSet identity matters — Target Allocator sharding requires stable pod names (otel-0, otel-1, etc.)
- Resource limits are critical — OTel Collector can consume unbounded memory without proper limits
On Cost Optimization
- Cardinality is the enemy — One misconfigured label can 10x your metrics bill
- Allowlists > Denylists — Easier to maintain "only collect these metrics" than "block these"
- Scrape intervals matter — Not everything needs 15s granularity; 60s (or more) is fine for most metrics. Assess before implementing
- Sampling is essential for traces — 100% trace collection is unsustainable at scale
On Operational Excellence
- HPA stabilization prevents chaos — 10-minute downscale window dramatically reduced target churn
- Monitoring the monitors — Created alerts for collector health, queue size, and drop rates
- Documentation wins adoption — Lunch-and-learns and runbooks got teams using traces
- Iterative improvement beats perfection — 20% → 50% → 80% coverage over months, not overnight, and I learn all of the concepts mentioned in this case study over time
On Kubernetes-Native Patterns
- DaemonSet for node-local work — Perfect for log collection from filesystem
- StatefulSet for identity + storage — Required for sharding and persistent queues
- Deployment for stateless logic — Target Allocator doesn't need persistence
- Service discovery is powerful — Kubernetes annotations (
prometheus.io/scrape=true) enable self-service
Technologies Used
OpenTelemetry Ecosystem:
- OpenTelemetry Collector (core pipeline)
- OpenTelemetry Target Allocator (HA sharding)
- OpenTelemetry SDKs (service instrumentation)
Kubernetes:
- DaemonSet (logs collection)
- StatefulSet (metrics/traces with persistence)
- Deployment (Target Allocator)
- HPA (autoscaling with stabilization)
- PVC (persistent queue storage)
Observability Backend:
- Coralogix (logs, metrics, traces, dashboards, alerts)
Infrastructure:
- Terraform (infrastructure as code)
- Helm (OpenTelemetry Collector deployment)
Receivers & Exporters:
- K8sattributes (Kubernetes data enrichment)
- Prometheus receiver (metrics scraping)
- OTLP receiver (traces ingestion)
- Filelog receiver (log tailing)
- OTLP exporter (sending to Coralogix)
Next Project: CloudFront CDN Implementation for Global Performance →