Reward Hacking

The phenomenon where AI systems learn to exploit flaws in their evaluation or reward mechanisms to achieve high scores without genuinely solving the intended task, undermining the reliability of performance measurements.

Reading List