The Benchmark Illusion: How UC Berkeley Broke the World's Top AI Leaderboards
523
Current AI agent benchmarks are easily gamed through infrastructure exploits, necessitating a new standard of adversarial robustness and environment isolation to accurately measure model capabilities.