AI Agent Benchmarks are Broken
Ah,WebArena—where getting math wrong gets a pass. Out of ten benchmarks, eight stumbled in spectacular style, misjudging things by a staggering100%. Enter theAI Benchmark Checklist (ABC), a 43-point lifeline designed to yank these tests out of the abyss and show what AI can actually do...