oersted 9 days ago

From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.

> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

  • devmor 9 days ago

    To me, this is one of the most frustrating parts of this type of ML. If we could actually track the steps taken in the LLM, it would be trivial for the judge to evaluate the output of each intermediate and detect when reward hacking is taking place.

    I wonder if there's any alternative other than trying to build the perfect judge for every single test case.

    • oersted 9 days ago

      Recent paper from Anthropic on the fact that the reasoning output does not necessarily reflect what the LLM is "thinking".

      https://arxiv.org/abs/2305.04388

      That being said, your idea is not unreasonable. The way DeepSeek phrased it, it just sounds like implementing such solutions might be a hassle greatly increasing complexity, and they were just focused on making an RL baseline work at scale.

      • devmor 9 days ago

        I was actually thinking of that paper when I wrote that comment, hence the frustration that we don't actually know the intermediates.

        Still, perhaps the stepped output we get may hint at that kind of "cheating" and can be used in reinforcement... or perhaps that kind of reinforcement will just make the LLMs better at cheating. The problem is definitely a lot more complex than the trivial way I referenced it, at least.