The Challenge

My client operated a distributed microservices architecture with dozens of services across multiple Kubernetes clusters, but lacked a unified observability strategy. The existing setup had critical gaps:

Observability Gaps

No distributed tracing: Impossible to track requests across service boundaries
Fragmented telemetry: Each team implemented their own logging, metrics, and monitoring
High operational overhead: Multiple agents, exporters, and collection methods
No standardization: Different instrumentation libraries, formats, and conventions
Limited visibility: Incidents required hours of log-diving and guesswork

Technical Challenges

The team needed a solution that:

Supported multiple signals (logs, metrics, traces) in a unified pipeline
Scaled reliably without data loss during pod restarts or node cycling
Optimized costs by controlling metric cardinality and data volume
Provided high availability through sharding and fault tolerance
Integrated with existing tools (Prometheus, Kubernetes, existing services)

The Solution

I designed and implemented a production-grade OpenTelemetry Collector stack on Kubernetes, focusing on reliability, cost optimization, and operational excellence. Rather than a quick proof-of-concept, this architecture was built for long-term production use.

OpenTelemetry (OTel) was chosen because:

Vendor-neutral: Not locked into any specific observability platform
Three signals, one pipeline: Logs, metrics, and traces through a single collector
Rich ecosystem: Extensive receiver, processor, and exporter support
Kubernetes-native: First-class support for service discovery and metadata
Future-proof: CNCF graduated project with strong community and adoption

In summary, OpenTelemetry is the gold standard of distributed observability.

Architecture Philosophy

The design prioritized:

Fault tolerance — No data loss during pod restarts, node cycles, or OOM crashes
Cost efficiency — Strict cardinality control to avoid exponential metric growth
High availability — Sharding and target allocation to prevent duplicate scraping
Operational stability — HPA tuning to minimize churn during traffic spikes

Architecture Overview

Three-Component Design

The OpenTelemetry stack consists of three specialized deployments, each optimized for its telemetry signal:

1. Logs: DaemonSet (Node-Level Collection)

Deployed as a DaemonSet to tail logs from all pods on each node.

Why DaemonSet?

Logs are node-local (written to stdout/stderr, collected by kubelet)
Each node needs one collector to read /var/log/pods
Avoids network overhead of shipping logs between nodes

Configuration:

Tails pod logs directly from the filesystem
Enriches with Kubernetes metadata (pod name, namespace, labels)
Filters and parses logs before sending to backend
Minimal resource footprint per node

2. Metrics & Traces: StatefulSet (Persistent Queue)

Deployed as a StatefulSet with persistent volumes for fault tolerance.

Why StatefulSet?

Requires persistent storage for sending_queue to survive pod restarts
Each replica needs stable identity for Target Allocator sharding
PVCs ensure no data loss during node cycling or OOM crashes

Metrics Pipeline:

Prometheus receiver scrapes pods based on service discovery
Target allocation via separate Target Allocator service (HA sharding)
Cardinality optimization through metric relabeling and filtering
Persistent queue backed by PVC before sending to Coralogix

Traces Pipeline:

OTLP receiver accepts traces pushed by instrumented services
Services send traces directly to collector (no sidecar needed)
Batching and sampling to reduce volume and costs
Persistent queue ensures no trace loss during collector restarts

3. Target Allocator: Deployment (HA Target Sharding)

Separate Deployment for the OpenTelemetry Target Allocator.

Purpose:

Discovers Prometheus scrape targets (via Kubernetes service discovery)
Distributes targets evenly across StatefulSet replicas (sharding)
Prevents duplicate scraping when scaling collector replicas
Enables true horizontal scaling for metrics collection

Technical Deep Dive

Logs: DaemonSet Architecture

Filesystem Tailing

Challenge: Collecting logs from dozens of pods per node efficiently.

Solution:

# DaemonSet mounts host filesystem
volumes:
  - name: varlog
    hostPath:
      path: /var/log
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers

# Collector tails logs from these paths
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    include_file_path: true
    include_file_name: false

K8sattributes processor - Kubernetes Metadata Enrichment

Automatic enrichment with pod/namespace/labels:

Uses Kubernetes API to resolve container IDs → pod metadata
Adds namespace, pod name, container name, labels as log attributes
Enables filtering and querying by Kubernetes context in Coralogix

Log Parsing & Filtering

Reduces noise before sending:

Parses JSON logs automatically
Filters out health check endpoints (/health, /ready) and known noisy application-specific endpoints
Drops debug-level logs in production
Only sends logs matching severity thresholds

Result: Lower data volume and costs while preserving important logs.

Metrics: StatefulSet with Target Allocator

The Cardinality Problem

Challenge: Prometheus metrics can explode in cardinality.

Example: A metric with 10 label dimensions, each with 10 values = 10^10 possible combinations. This causes:

Massive storage costs
Slow queries
Backend ingestion limits hit

Solution: Aggressive Metric Relabeling

Principle: Only scrape what you need, drop everything else.

Implemented via metric relabeling:

Allowlist specific metrics — Only keep metrics critical for dashboards/alerts
Drop high-cardinality labels — Remove labels like pod_ip, instance_id that add no value
Aggregate where possible — Use recording rules to pre-aggregate
Scrape interval tuning — Adjust scrape frequency per service (critical = 15s, normal = 60s, development/testing = 300s)

Example relabeling rule:

metric_relabel_configs:
  # Keep only specific metrics
  - source_labels: [__name__]
    regex: '(go_.*|http_requests_total|http_request_duration_seconds|up)'
    action: keep
  
  # Drop high-cardinality labels
  - source_labels: [__name__]
    regex: '.*'
    target_label: pod_ip
    action: labeldrop
  
  # Rename for consistency
  - source_labels: [service]
    target_label: service_name
    action: replace

Result: Reduced metric cardinality by ~95% while preserving all critical observability.

Persistent Queue for Fault Tolerance

Challenge: Collector pods can crash or restart, losing in-flight metrics.

Solution: sending_queue with persistent storage.

exporters:
  otlp/coralogix:
    endpoint: coralogix-endpoint:443
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
      storage: file_storage/queue
    
extensions:
  file_storage/queue:
    directory: /var/lib/otelcol/queue
    timeout: 10s

Key benefits:

Queue persisted to PVC (survives pod restarts)
Prevents data loss during node cycling, OOM kills, or deployments
Automatic retry with exponential backoff
Batch sending for efficiency

PVC configuration:

volumeClaimTemplates:
  - metadata:
      name: queue-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Additional important configuration: Configured pod termination grace period to allow for a graceful termination without data loss.

Target Allocator for HA Sharding

Challenge: Multiple collector replicas scraping the same targets = duplicate or inconsistent metrics.

Solution: OpenTelemetry Target Allocator.

How it works:

Target Allocator discovers all Prometheus scrape targets (via Kubernetes SD)
Distributes targets evenly across StatefulSet replicas using consistent hashing
Each collector replica queries Target Allocator for its assigned subset
Targets automatically rebalanced when replicas scale up/down

Architecture:

┌────────────────────────────┐
│  Target Allocator (Deploy) │
│  - Discovers targets       │
│  - Shards across replicas  │
└─────────┬──────────────────┘
          │ assigns targets
          ├──────────┬──────────┐
          ▼          ▼          ▼
     ┌────────┐ ┌────────┐ ┌────────┐
     │ OTel-0 │ │ OTel-1 │ │ OTel-2 │  (StatefulSet)
     │ Scrapes│ │ Scrapes│ │ Scrapes│
     │ 1/3 of │ │ 1/3 of │ │ 1/3 of │
     │ targets│ │ targets│ │ targets│
     └────────┘ └────────┘ └────────┘

Benefits:

✅ True horizontal scaling (3 replicas = 3x capacity)
✅ No duplicate scrapes (each target scraped exactly once)
✅ Automatic rebalancing on scale events
✅ Resilient to collector pod failures

Target Allocator configuration:

config:
  scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        # Only scrape pods with prometheus.io/scrape=true annotation
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true

Traces: OTLP Receiver

Service Instrumentation

Instrumented services using OpenTelemetry SDKs:

Base starter Go microservice (main company's stack) with initial configuration (including the OpenTelemetry SDK configuration), allowing new microservices to fork from this repo
Manual instrumentation for critical business logic
Consistent span naming conventions across all services

Direct Push Model

Unlike metrics (pull), traces use push:

Service → OTLP (gRPC/HTTP) → Collector → Coralogix

Why push instead of pull?

Traces are event-driven, not time-series
Services know when traces are complete
Reduces collector complexity (no service discovery needed for traces)

Sampling Strategy

Implemented head-based sampling to control costs:

100% sampling for errors and slow requests (>1s)
50% general sampling for non-production environments
10% sampling for successful, fast requests
Trace ID-based (consistent sampling across distributed traces)

Configuration:

processors:
  probabilistic_sampler:
    sampling_percentage: 10
    hash_seed: 22
  
  # Always sample errors
  tail_sampling:
    policies:
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

Additionally, Coralogix has a feature called TCO Quota Optimizer which allows us to classify logs and traces into different tiers (Compliance - rarely accessed, Monitoring - non-essential data, Frequent access - essential data for fast access, such as production).

Result: 90% cost reduction on trace ingestion while keeping all important traces.

HPA Stabilization for Target Allocator Efficiency

The Scaling Churn Problem

Challenge: HPA rapidly scaling up/down causes:

Target Allocator constantly reshuffling targets across replicas
Scrape schedule disruption
Increased load on Kubernetes API and Target Allocator
Inconsistent metric collection

Solution: HPA Stabilization Windows

Configured downscale stabilization to prevent flapping:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 minutes
    policies:
      - type: Percent
        value: 25
        periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15

How it works:

Scale up fast: React immediately to traffic spikes (100% in 15s)
Scale down slow: Wait 10 minutes before downscaling, prevent thrashing
Gradual downscale: Only remove 25% of replicas per minute

Benefits:

✅ Target Allocator reshuffles less frequently
✅ More consistent scrape scheduling
✅ Lower API server load
✅ Stable metric collection during traffic fluctuations

Key Technical Decisions

Why StatefulSet for Metrics/Traces?

Considered alternatives:

Deployment: No persistent storage = data loss on pod restart ❌
DaemonSet: Can't shard targets, over-collects ❌
StatefulSet: Stable identity + PVCs + sharding ✅

Rationale: StatefulSet provides stable pod identities required for Target Allocator sharding and persistent storage for queue fault tolerance.

Why Separate DaemonSet for Logs?

Could have used StatefulSet for logs too.

Decision: Logs are node-local, DaemonSet is the Kubernetes-native pattern.

Each node has one collector reading local filesystem
No network overhead shipping logs between nodes
Simpler configuration and lower resource usage

Why Persistent Queue?

Alternative: In-memory queue.

Decision: Persistent queue prevents data loss.

Node cycling (common in spot instances, cluster upgrades)
OOM kills (can happen under heavy load)
Deployments/rollouts (inevitable in active development)

Trade-off: Requires PVC provisioning, but worth it for production reliability.

Why Target Allocator?

Alternative: Each collector scrapes all targets, deduplicate at backend.

Decision: Target Allocator prevents duplicate scraping at the source.

Lower network traffic
Lower backend ingestion costs
True horizontal scaling (3 replicas = 3x capacity, not 3x duplication)

Why Coralogix?

Alternative: Self-hosted Grafana + Loki + Tempo + Prometheus.

Decision: Coralogix provides all-in-one platform with less operational overhead.

Unified logs, metrics, traces in one UI
Managed service (no need to run and scale Loki/Tempo/Prometheus)
Built-in alerting and dashboards
Cost-effective at our scale

Trade-off: Vendor lock-in, but OpenTelemetry makes migration easy if needed.

Lessons Learned

On Architecture Design

Start simple, add complexity incrementally — Began with logs DaemonSet, added StatefulSet complexity only when needed
Persistent queues are non-negotiable — We realized we had data gaps, especially in metrics, before implementing PVC-backed queues
StatefulSet identity matters — Target Allocator sharding requires stable pod names (otel-0, otel-1, etc.)
Resource limits are critical — OTel Collector can consume unbounded memory without proper limits

On Cost Optimization

Cardinality is the enemy — One misconfigured label can 10x your metrics bill
Allowlists > Denylists — Easier to maintain "only collect these metrics" than "block these"
Scrape intervals matter — Not everything needs 15s granularity; 60s (or more) is fine for most metrics. Assess before implementing
Sampling is essential for traces — 100% trace collection is unsustainable at scale

On Operational Excellence

HPA stabilization prevents chaos — 10-minute downscale window dramatically reduced target churn
Monitoring the monitors — Created alerts for collector health, queue size, and drop rates
Documentation wins adoption — Lunch-and-learns and runbooks got teams using traces
Iterative improvement beats perfection — 20% → 50% → 80% coverage over months, not overnight, and I learn all of the concepts mentioned in this case study over time

On Kubernetes-Native Patterns

DaemonSet for node-local work — Perfect for log collection from filesystem
StatefulSet for identity + storage — Required for sharding and persistent queues
Deployment for stateless logic — Target Allocator doesn't need persistence
Service discovery is powerful — Kubernetes annotations (prometheus.io/scrape=true) enable self-service

Technologies Used

OpenTelemetry Ecosystem:

OpenTelemetry Collector (core pipeline)
OpenTelemetry Target Allocator (HA sharding)
OpenTelemetry SDKs (service instrumentation)

Kubernetes:

DaemonSet (logs collection)
StatefulSet (metrics/traces with persistence)
Deployment (Target Allocator)
HPA (autoscaling with stabilization)
PVC (persistent queue storage)

Observability Backend:

Coralogix (logs, metrics, traces, dashboards, alerts)

Infrastructure:

Terraform (infrastructure as code)
Helm (OpenTelemetry Collector deployment)

Receivers & Exporters:

K8sattributes (Kubernetes data enrichment)
Prometheus receiver (metrics scraping)
OTLP receiver (traces ingestion)
Filelog receiver (log tailing)
OTLP exporter (sending to Coralogix)

Next Project: CloudFront CDN Implementation for Global Performance →

Building a Production-Grade OpenTelemetry Stack

The Challenge

Observability Gaps

Technical Challenges

The Solution

Architecture Philosophy

Architecture Overview

Three-Component Design

1. Logs: DaemonSet (Node-Level Collection)

2. Metrics & Traces: StatefulSet (Persistent Queue)

3. Target Allocator: Deployment (HA Target Sharding)

Technical Deep Dive

Logs: DaemonSet Architecture

Filesystem Tailing

K8sattributes processor - Kubernetes Metadata Enrichment

Log Parsing & Filtering

Metrics: StatefulSet with Target Allocator

The Cardinality Problem

Solution: Aggressive Metric Relabeling

Persistent Queue for Fault Tolerance

Target Allocator for HA Sharding

Traces: OTLP Receiver

Service Instrumentation

Direct Push Model

Sampling Strategy

HPA Stabilization for Target Allocator Efficiency

The Scaling Churn Problem

Solution: HPA Stabilization Windows

Key Technical Decisions

Why StatefulSet for Metrics/Traces?

Why Separate DaemonSet for Logs?

Why Persistent Queue?

Why Target Allocator?

Why Coralogix?

Lessons Learned

On Architecture Design

On Cost Optimization

On Operational Excellence

On Kubernetes-Native Patterns

Technologies Used

Technologies Used