Isomorphic Perturbation Testing (IPT)
Do reasoning LLMs actually reason — or learn to game the test?
LLMs are increasingly trained with reinforcement learning from verifiable rewards (RLVR), which boosts their performance on problems whose answers can be checked automatically. But it can also teach them to exploit the verifier rather than solve the task.
We test this on inductive reasoning: a model sees a few labeled examples and must write a general rule that explains them. In our evaluation we find that some LLMs systematically abandon rule induction. Rather than inferring relational rules (e.g., "a train is eastbound if it has a long car"), they enumerate instance-level labels (e.g., "train0 is eastbound, train2 is eastbound"). While such outputs fail the intended task of rule induction, they may game imperfect verifiers that only check extensional correctness on the provided examples.
- 🎯 Intended:
plants with purple leaves are toxic. - ⚠️ Shortcut:
plant_01 is toxic. plant_02 is safe. ...
Isomorphic Perturbation Testing (IPT) exposes these shortcuts and provides a metric for this kind of reward hacking behavior on SLR-Bench.
📄 ArXiv · 💻 Github · 🧪 Reward-Hacking Leaderboard · 📊 SLR-Bench
Result
This interface evaluates one hypothesis at a time. Use the Python API for batch processing. The SLR-Bench dataset provides both programs as the validation_program_shortcuts (extensional) and validation_program (isomorphic) fields.