The self-improving platform: Closing the Loop Between Telemetry and Tuning
TODO: Copy repo link for samples
The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications
Baseline
- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers
Overvations regarding stakeholders
- Stakeholders
- ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
- only ~18% have a dedicated SRE team that couples application to platforms
- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
- Priorities
- Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
- SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
- FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
- 75% of interviewees use kubernetes with over 50% using JVM as the runtime
Pain points
- Main focus: Cost vs performance
- Side-note: Reloability
- Result: We need a flexible path that can decern between
- User facing app: Performance first
- Critical app: Reliability first
- Non-critical apps: Reduce cost
Optimizatiomn
- Tuning: Only 18% are tuning their container and runtime
- We need a full stack approach:
- Don’t just increase pod resources but also update things like the heap-size in your runtime
- Use HPA to sale if you already right-sized your pod+runtime
- Get to know your per node usage to improve node autoscaling
Building a continuus automation layer
- Telemetry: Import Metrics
- Analysis with tuning profiles (historic data) for optimizations
- GitOps for automatic PR creation and previews
- Sample Architecture:
- Import: OTEL Metric into Prometheus
- Visualize: Grafana
- Analyze: Cronjob that collects the last 30mins of metrics
- Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)
TODO: Steal image from slides
Wrap-up
- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
- Optimization is a domino effect: The right foundations enable better future decisions