The self-improving platform: Closing the Loop Between Telemetry and Tuning

TODO: Copy repo link for samples

The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications

Baseline

  • Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
  • Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers

Overvations regarding stakeholders

  • Stakeholders
    • ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
    • only ~18% have a dedicated SRE team that couples application to platforms
  • Ownership: over 50% of companies ue a shared ownership model -> Not my problem
  • Priorities
    • Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
    • SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
    • FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
  • Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
  • 75% of interviewees use kubernetes with over 50% using JVM as the runtime

Pain points

  • Main focus: Cost vs performance
  • Side-note: Reloability
  • Result: We need a flexible path that can decern between
    • User facing app: Performance first
    • Critical app: Reliability first
    • Non-critical apps: Reduce cost

Optimizatiomn

  • Tuning: Only 18% are tuning their container and runtime
  • We need a full stack approach:
    • Don’t just increase pod resources but also update things like the heap-size in your runtime
    • Use HPA to sale if you already right-sized your pod+runtime
    • Get to know your per node usage to improve node autoscaling

Building a continuus automation layer

  • Telemetry: Import Metrics
  • Analysis with tuning profiles (historic data) for optimizations
  • GitOps for automatic PR creation and previews
  • Sample Architecture:
    • Import: OTEL Metric into Prometheus
    • Visualize: Grafana
    • Analyze: Cronjob that collects the last 30mins of metrics
    • Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)

TODO: Steal image from slides

Wrap-up

  • Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
  • Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
  • Optimization is a domino effect: The right foundations enable better future decisions