Datadog made PostgreSQL failover safer by treating replica lag as the promotion gate. A zonal-failure gameday showed that detection and automation could not protect the database if the standby sat behind the primary. The team added lag-aware checks, clearer operator signals, and failure drills so engineers could fail over with a known data-loss boundary.










