From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.
> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
To me, this is one of the most frustrating parts of this type of ML. If we could actually track the steps taken in the LLM, it would be trivial for the judge to evaluate the output of each intermediate and detect when reward hacking is taking place.
I wonder if there's any alternative other than trying to build the perfect judge for every single test case.
That being said, your idea is not unreasonable. The way DeepSeek phrased it, it just sounds like implementing such solutions might be a hassle greatly increasing complexity, and they were just focused on making an RL baseline work at scale.
I was actually thinking of that paper when I wrote that comment, hence the frustration that we don't actually know the intermediates.
Still, perhaps the stepped output we get may hint at that kind of "cheating" and can be used in reinforcement... or perhaps that kind of reinforcement will just make the LLMs better at cheating. The problem is definitely a lot more complex than the trivial way I referenced it, at least.
From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.
> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
To me, this is one of the most frustrating parts of this type of ML. If we could actually track the steps taken in the LLM, it would be trivial for the judge to evaluate the output of each intermediate and detect when reward hacking is taking place.
I wonder if there's any alternative other than trying to build the perfect judge for every single test case.
Recent paper from Anthropic on the fact that the reasoning output does not necessarily reflect what the LLM is "thinking".
https://arxiv.org/abs/2305.04388
That being said, your idea is not unreasonable. The way DeepSeek phrased it, it just sounds like implementing such solutions might be a hassle greatly increasing complexity, and they were just focused on making an RL baseline work at scale.
I was actually thinking of that paper when I wrote that comment, hence the frustration that we don't actually know the intermediates.
Still, perhaps the stepped output we get may hint at that kind of "cheating" and can be used in reinforcement... or perhaps that kind of reinforcement will just make the LLMs better at cheating. The problem is definitely a lot more complex than the trivial way I referenced it, at least.