ML Inference Platform for Financial Services
98% latency reduction to meet transaction SLA
Latency Reduction
500ms
9ms (p50)
SLA Requirement
50ms
SLO Target
20ms
Throughput Scaling
+500% RPS
p99 Latency
16ms
The Challenge
A large LATAM financial institution that I had as a client once required real-time ML inference for fraud detection and credit risk analysis during transaction processing. The business constraint was critical:
Hard Requirement: sub-second end-to-end transaction time
- Our ML inference API: Must be <50ms to meet SLA
- Initial system latency: 500ms (completely unacceptable)
The initial Python-based inference service had:
- 500ms average latency (10x over budget)
- Limited throughput (couldn't handle peak banking hours)
- Single-threaded bottlenecks in data parsing
- No path to meeting the SLA without fundamental changes
Failure to meet this SLA would mean:
- ❌ Lost contracts with a large financial institution Brazil's
- ❌ No real-time fraud detection capability
- ❌ Five to six-figure MRR at risk
The Solution
I implemented a two-phase optimization strategy to meet and exceed the SLA:
Phase 1: Architectural Migration (Macro-Optimization)
Goal: Get under 100ms to make micro-optimization feasible
Problem Diagnosis:
- Profiled the system and identified several loops and data structures that could be optimized
- 80% of latency came from JSON parsing (single-threaded)
- 20% from request validation and ML inference
Solution: Migrated the data parsing and validation layer from Python to Node.js while keeping Python for ML model inference (due to SKLearn and other libraries dependencies).
Phase 1 Results:
- ✅ Latency: 500ms → 80ms (84% reduction)
- ✅ 500% RPS increase (could handle 5x traffic)
- ✅ System stable during peak banking hours
- ⚠️ Still not good enough — needed <50ms for SLA
Phase 2: Micro-Optimization (Meeting the SLA)
New Goal: Get from 80ms to <20ms to exceed SLO (SLA: 50ms, internal SLO: 20ms)
SLA vs SLO: The Service Level Agreement (SLA) was agreed upon a <50ms hard limit. We set an internal Service Level Objective (SLO) of <20ms for operational margin.
Problem: 80ms was a huge improvement, but the client's longer processing meant we needed to be under 50ms to meet the sub-second end-to-end SLA. We needed every millisecond.
Solution: Application-level performance tuning across the entire inference pipeline:
1. Feature Engineering Optimization (40ms → 5ms)
- Replaced Pandas slow parsing with vectorized NumPy operations
- Pre-computed static features at model load time
- Cached frequently accessed lookups (LRU cache)
2. Model Input Preparation (20ms → 2ms)
- Optimized data type conversions
- Removed redundant transformations
3. Inference Optimization (20ms → 2ms)
- The sheer replacement of Pandas DataFrames with NumPy arrays improved inference performance
Phase 2 Results:
- ✅ Median latency (p50): 9ms (91% reduction from Phase 1)
- ✅ p99 latency: 16ms (99th percentile still well under budget)
- ✅ Total improvement: 500ms → 9ms (98% reduction)
- ✅ Maximum headroom for client processing (>90% of SLA budget)
Results & Impact
Business Impact
Meeting the SLA enabled:
- 💰 Five to six-figure MRR from ML inference services
- 🏦 A large LATAM financial institution as paying customers
- 🚀 Real-time fraud detection (previously impossible)
- 📈 Credit risk scoring during transaction approval
- ⚡ 99.9% uptime SLA maintained
Technical Achievements
- 98% latency reduction (500ms → 9ms median)
- 500% RPS scaling (architectural migration)
- Multi-cloud deployment (client required Azure deployment for private links, our main cloud was AWS)
- Zero-downtime deployments (canary with health checks)
- Full observability (Grafana dashboards, alerting)
Latency Evolution Timeline
Latency (ms)
│
500 ┤●──────────────────────┐
│ Initial State │
│ (Python-only) │ ❌ 10x over SLA budget
│ │
│ │
80 ┤ ●─────────┐
│ │
│ Phase 1 │ ⚠️ Still 60% over budget
│ (Architecture) │
│ │
50 ┤- - - - - - - - - - - - - - - - - - - - - - - ← SLA budget line
│ │
20 ┤· · · · · · · · · · · · · · · · · · · · · · · · ← SLO (target)
│ │
9 ┤ ●═══════════
│ Phase 2
│ (Micro-optimization) ✅ Exceeds SLO!
│
│
└────────────────────────────────────────────────→ Time
Before Phase 1 Phase 2
Lessons Learned
On Performance Optimization
- Profile before optimizing — 80% of latency was in JSON parsing and using the proper library, not ML inference
- Think in layers — Architectural changes (Phase 1) enable micro-optimizations (Phase 2)
- SLA budgets matter — Client's sub-second processing meant we needed <50ms, not just "fast enough"
- Measure everything — p50 vs p99 latency tell different stories (9ms vs 16ms)
On SLA-Driven Development
- Hard constraints drive creativity — The <50ms SLA forced us to optimize aggressively
- Build in margin — Our 9-16ms gives 34-40ms headroom (safety buffer for client variability)
- End-to-end thinking — Your latency is only part of the total transaction time
On Multi-Phase Optimization
- Quick wins first — Phase 1 (architecture) bought us time for Phase 2 (micro-opt)
- Know when to stop — At 9ms, further optimization had diminishing returns
- Document the journey — Showing 500ms → 80ms → 9ms tells a better story than just "9ms"
Technologies Used
Infrastructure:
- Azure (AKS, Azure DevOps/Pipelines)
- Kubernetes
- Azure Private Link
- Terraform (IaC)
Application:
- Python (ML models, NumPy, Fast API)
Observability:
- Prometheus (metrics)
- Grafana (dashboards)
Storage:
- Azure Blob Storage (ML model storage)
Disclaimer: SLA and latency values on this page are approximate and may have been adjusted for client privacy and confidentiality. See SLA/SLO definition above →
Next Project: Building a Production-Grade OpenTelemetry Stack →