When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings
Sched LinkA case study from expedia about finops.
The Cost-Reliability disconnect
- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,…) with huge operational responsibilities (SLA, SLO, scalabiloity, …)
- Platform Team: REliability, Performance, Stability
- FinOps Team: Cloud Resources reduction, budget adherence, efficiency
- Problem: Conflicting goals and often organizationally seperated
- Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
- Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs
Patterns
- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
- Revisit legacy: Old configs like static sizing, huge buffers, …
- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table
Views & baselines
General recommendations
- Problem: Lack of cost attribution for shared info
- Problem: Lack of insights into which clusters are generating consts
- Problem: No transparency into which teams are consuming resources
- Solution: Bring the generation of cost together with the existance of costs
- Solution: Identify a safe operating range that wraps the “optimal zone” with a buffer for over- and underutilization -> Baseline for automatic scaling
Revisiting legacy
General recommendations
- Problem childs: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
- Challenge: No one wants to touch a running system
- Analyze historical utilization (identifiy spikes/traffic patterns)
- Identify safe optimization opportunities
- Roll out changes gradually
Rearchitecture without fear
What they did in their legacy systems
- Find out if your current workload actually need the currently selected note types
- Optimize Jobs into batches
- Even if the size is right: Check if you can switch to newer nodes with better price to performance
- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects