Modal unpacked how it keeps a 20,000+ GPU fleet sane across AWS, GCP, Azure, and OCI. Think autoscaling, yes, but with some serious moves behind the curtain.
They're running instance benchmarking, enforcing machine image consistency, running boot-time checks, and tracking GPU health both passively and actively. Sick GPUs get quarantined. The whole thing’s wired up with image testing and auto-failover.










