The self-improving platform: Closing the Loop Between Telemetry and Tuning

The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications

Baseline

Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers

Stakeholders
- ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
- only ~18% have a dedicated SRE team that couples application to platforms
Ownership: over 50% of companies ue a shared ownership model -> Not my problem
Priorities
- Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
- SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
- FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
75% of interviewees use kubernetes with over 50% using JVM as the runtime

Main focus: Cost vs performance
Side-note: Reloability
Result: We need a flexible path that can decern between
- User facing app: Performance first
- Critical app: Reliability first
- Non-critical apps: Reduce cost

Tuning: Only 18% are tuning their container and runtime
We need a full stack approach:
- Don’t just increase pod resources but also update things like the heap-size in your runtime
- Use HPA to sale if you already right-sized your pod+runtime
- Get to know your per node usage to improve node autoscaling

Telemetry: Import Metrics
Analysis with tuning profiles (historic data) for optimizations
GitOps for automatic PR creation and previews
Sample Architecture:
- Import: OTEL Metric into Prometheus
- Visualize: Grafana
- Analyze: Cronjob that collects the last 30mins of metrics
- Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)

TODO: Steal image from slides

Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
Optimization is a domino effect: The right foundations enable better future decisions