**Jie Cheng, Lijun Li, Gang Xiong, Jing Shao, Yisheng Lv, Fei-Yue Wang**
🗒️ Notion blog, 👨💻 Github, 🤗 HF Model, 📈 Wandb Logs, 🔎 Wandb Report
— code, checkpoints, wandb logs: Feb 9, 2025
— notion blog: Feb 22, 2025
The pass@1 accuracy tested with greedy decoding
Models | AIME 2024 | MATH 500 | AMC | Minerva Math | OlympiadBench | Avg. |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B | 13.3 | 71.8 | 47.5 | 29.8 | 35.1 | 39.5 |
Qwen2.5-Math-7B-Instruct | 16.7 | 83.2 | 52.5 | 37.5 | 41.3 | 46.2 |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
Qwen2.5-7B-SimpleRL-Zero | 33.3 | 77.2 | 62.5 | 33.5 | 37.6 | 48.8 |
Qwen-DPO-R1-Zero | 26.7 | 76.8 | 62.5 | 30.9 | 37.9 | 47.0 |
Qwen2.5-7B-PURE-PRM+VR | 20.0 | 82.6 | 82.5 | 37.1 | 44.1 | 53.3 |
Qwen2.5-7B-PURE-PRM | 16.7 | 81.8 | 60.0 | 38.2 | 44.7 | 49.3 |
Qwen2.5-7B-PURE-VR | 23.3 | 79.4 | 60.0 | 36.8 | 41.8 | 48.3 |
Models in the table are all based on Qwen2.5-Math-7B for RL fine-tuning without SFT, except for Qwen2.5-Math-7B-Instruct and Eurus-2-7B-PRIME.
This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). However, previous work has encountered challenges in exploring process reward model (PRM), so we wonder:
How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?
To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Using Qwen2.5-Math-7B
as the base model, we first train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model without SFT using 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. Our framework supports multiple reward types: only process reward (Qwen2.5-7B-PURE-PRM
), only verifiable reward (Qwen2.5-7B-PURE-VR
), or a mix of both (Qwen2.5-7B-PURE-PRM+VR
). Our findings are as follows:
<aside> 💡
Qwen2.5-7B-PURE-PRM
achieves an overall accuracy of 49.3%, slightly better than 48.3% of Qwen2.5-7B-PURE-VR
Qwen2.5-7B-PURE-PRM+VR
with an overall accuracy of 53.3%. To the best of our knowledge, this is the highest score so far based on Qwen2.5-Math-7B without SFT before RL fine-tuning.Qwen2.5-7B-PURE-PRM
before hacking happens. Moreover, mixing only 1/10th VR mitigates reward hacking well, allowing us to stably fine-tune for 500 steps and get Qwen2.5-7B-PURE-PRM+VR
The role of PRM is to assign a process reward for each step to judge its correctness. Prior work typically uses the minimum or the product over process rewards as a reduction to get outcome-level scores when evaluating PRM through BoN approach, and then chooses the answer corresponding to the highest outcome score. Assume that there are a question $s_0$ and an answer with $n$ steps denoted as $(s_1,\cdots,s_n)$. The process reward for step $i$ is denoted as $r_i$. Therefore, previous reduction approaches indicate that for reasoning tasks, the return of an answer should be more in line with $\min (r_1,\cdots,r_n)$ or $\prod_{i=1}^n r_i$, rather than $\sum_{i=1}^n \gamma^{i-1} r_i$ that is most commonly used in reinforcement learning. We choose the minimum form as the return of $s_0$, because the product approach can’t handle negative process rewards and also causes $R(s_0)$ to converge to 0 in cases with a large number of steps.
Extend min-form return to any step, we have: