**Jie Cheng, Lijun Li, Gang Xiong, Jing Shao, Yisheng Lv, Fei-Yue Wang**

🗒️ Notion blog, 👨‍💻 Github, 🤗 HF Model, 📈 Wandb Logs, 🔎 Wandb Report

— code, checkpoints, wandb logs: Feb 9, 2025

— notion blog: Feb 22, 2025

The pass@1 accuracy tested with greedy decoding

Models	AIME 2024	MATH 500	AMC	Minerva Math	OlympiadBench	Avg.
Qwen2.5-Math-7B	13.3	71.8	47.5	29.8	35.1	39.5
Qwen2.5-Math-7B-Instruct	16.7	83.2	52.5	37.5	41.3	46.2
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL-Zero	33.3	77.2	62.5	33.5	37.6	48.8
Qwen-DPO-R1-Zero	26.7	76.8	62.5	30.9	37.9	47.0
Qwen2.5-7B-PURE-PRM+VR	20.0	82.6	82.5	37.1	44.1	53.3
Qwen2.5-7B-PURE-PRM	16.7	81.8	60.0	38.2	44.7	49.3
Qwen2.5-7B-PURE-VR	23.3	79.4	60.0	36.8	41.8	48.3

Models in the table are all based on Qwen2.5-Math-7B for RL fine-tuning without SFT, except for Qwen2.5-Math-7B-Instruct and Eurus-2-7B-PRIME.

Introduction

This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). However, previous work has encountered challenges in exploring process reward model (PRM), so we wonder:

How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?

To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Using Qwen2.5-Math-7B as the base model, we first train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model without SFT using 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. Our framework supports multiple reward types: only process reward (Qwen2.5-7B-PURE-PRM), only verifiable reward (Qwen2.5-7B-PURE-VR), or a mix of both (Qwen2.5-7B-PURE-PRM+VR). Our findings are as follows:

<aside> 💡

Process Reward vs. Verifiable Reward: PRM trained on PRM800K dataset can fine-tune LLM to achieve SOTA-level mathematical reasoning capabilities for < $150 (8 A100 GPUs * 16 hours). The resulting model Qwen2.5-7B-PURE-PRM achieves an overall accuracy of 49.3%, slightly better than 48.3% of Qwen2.5-7B-PURE-VR
Process Reward + Verifiable Reward: These two approaches are not conflicting. When we fine-tune LLM with both PRM and 1/10th VR (800 problems with ground-truth answers + 7.2k open problems), we get Qwen2.5-7B-PURE-PRM+VR with an overall accuracy of 53.3%. To the best of our knowledge, this is the highest score so far based on Qwen2.5-Math-7B without SFT before RL fine-tuning.
Reward Hacking: Reward hacking is inevitable when relying solely on process rewards from the PRM. However, our method significantly delays the time when reward hacking occurs and gets Qwen2.5-7B-PURE-PRM before hacking happens. Moreover, mixing only 1/10th VR mitigates reward hacking well, allowing us to stably fine-tune for 500 steps and get Qwen2.5-7B-PURE-PRM+VR
Credit Assignment: We propose an approximate min-form instead of gamma decay-form credit assignment for return calculation. Abalation studies show that our method significantly mitigates reward hacking, while gamma-decay even struggles to produce positive-gained fine-tuning! </aside>

Method

Credit Assignment

The role of PRM is to assign a process reward for each step to judge its correctness. Prior work typically uses the minimum or the product over process rewards as a reduction to get outcome-level scores when evaluating PRM through BoN approach, and then chooses the answer corresponding to the highest outcome score. Assume that there are a question $s_0$ and an answer with $n$ steps denoted as $(s_1,\cdots,s_n)$. The process reward for step $i$ is denoted as $r_i$. Therefore, previous reduction approaches indicate that for reasoning tasks, the return of an answer should be more in line with $\min (r_1,\cdots,r_n)$ or $\prod_{i=1}^n r_i$, rather than $\sum_{i=1}^n \gamma^{i-1} r_i$ that is most commonly used in reinforcement learning. We choose the minimum form as the return of $s_0$, because the product approach can’t handle negative process rewards and also causes $R(s_0)$ to converge to 0 in cases with a large number of steps.

Extend min-form return to any step, we have: