Finding zombies in our systems: A real-world story of CPU bottlenecks
After a network outage crisis, Pinterest's ML Platform team discovered high Kubernetes agent CPU usage was causing critical Ray training job failures. The team's deep profiling strategy revealed a rarely seen flaw in how Kubelet was handling memory cgroup iterations... read more ย





