When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings

Sched Link

A case study from expedia about finops.

The Cost-Reliability disconnect

  • Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,…) with huge operational responsibilities (SLA, SLO, scalabiloity, …)
  • Platform Team: REliability, Performance, Stability
  • FinOps Team: Cloud Resources reduction, budget adherence, efficiency
  • Problem: Conflicting goals and often organizationally seperated
    • Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
    • Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs

Patterns

  • Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
  • Revisit legacy: Old configs like static sizing, huge buffers, …
  • Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table

Views & baselines

General recommendations

  • Problem: Lack of cost attribution for shared info
  • Problem: Lack of insights into which clusters are generating consts
  • Problem: No transparency into which teams are consuming resources
  • Solution: Bring the generation of cost together with the existance of costs
  • Solution: Identify a safe operating range that wraps the “optimal zone” with a buffer for over- and underutilization -> Baseline for automatic scaling

Revisiting legacy

General recommendations

  • Problem childs: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
  • Challenge: No one wants to touch a running system
  1. Analyze historical utilization (identifiy spikes/traffic patterns)
  2. Identify safe optimization opportunities
  3. Roll out changes gradually

Rearchitecture without fear

What they did in their legacy systems

  • Find out if your current workload actually need the currently selected note types
  • Optimize Jobs into batches
  • Even if the size is right: Check if you can switch to newer nodes with better price to performance
  • Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects