When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings

A case study from expedia about finops.

The Cost-Reliability disconnect

Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,…) with huge operational responsibilities (SLA, SLO, scalabiloity, …)
Platform Team: REliability, Performance, Stability
FinOps Team: Cloud Resources reduction, budget adherence, efficiency
Problem: Conflicting goals and often organizationally seperated
- Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
- Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs

Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
Revisit legacy: Old configs like static sizing, huge buffers, …
Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table

General recommendations

Problem: Lack of cost attribution for shared info
Problem: Lack of insights into which clusters are generating consts
Problem: No transparency into which teams are consuming resources
Solution: Bring the generation of cost together with the existance of costs
Solution: Identify a safe operating range that wraps the “optimal zone” with a buffer for over- and underutilization -> Baseline for automatic scaling

General recommendations

Problem childs: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
Challenge: No one wants to touch a running system

What they did in their legacy systems

Find out if your current workload actually need the currently selected note types
Optimize Jobs into batches
Even if the size is right: Check if you can switch to newer nodes with better price to performance
Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects