**Jie Cheng, Lijun Li, Gang Xiong, Jing Shao, Yisheng Lv, Fei-Yue Wang**

🗒️ Notion blog, 👨‍💻 Github, 🤗 HF Model, 📈 Wandb Logs, 🔎 Wandb Report

— code, checkpoints, wandb logs: Feb 9, 2025

— notion blog: Feb 22, 2025

The pass@1 accuracy tested with greedy decoding

Models AIME 2024 MATH 500 AMC Minerva Math OlympiadBench Avg.
Qwen2.5-Math-7B 13.3 71.8 47.5 29.8 35.1 39.5
Qwen2.5-Math-7B-Instruct 16.7 83.2 52.5 37.5 41.3 46.2
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
Qwen2.5-7B-SimpleRL-Zero 33.3 77.2 62.5 33.5 37.6 48.8
Qwen-DPO-R1-Zero 26.7 76.8 62.5 30.9 37.9 47.0
Qwen2.5-7B-PURE-PRM+VR 20.0 82.6 82.5 37.1 44.1 53.3
Qwen2.5-7B-PURE-PRM 16.7 81.8 60.0 38.2 44.7 49.3
Qwen2.5-7B-PURE-VR 23.3 79.4 60.0 36.8 41.8 48.3

Models in the table are all based on Qwen2.5-Math-7B for RL fine-tuning without SFT, except for Qwen2.5-Math-7B-Instruct and Eurus-2-7B-PRIME.


Introduction

This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). However, previous work has encountered challenges in exploring process reward model (PRM), so we wonder:

How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?

To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Using Qwen2.5-Math-7B as the base model, we first train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model without SFT using 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. Our framework supports multiple reward types: only process reward (Qwen2.5-7B-PURE-PRM), only verifiable reward (Qwen2.5-7B-PURE-VR), or a mix of both (Qwen2.5-7B-PURE-PRM+VR). Our findings are as follows:

<aside> 💡


Method

Credit Assignment

The role of PRM is to assign a process reward for each step to judge its correctness. Prior work typically uses the minimum or the product over process rewards as a reduction to get outcome-level scores when evaluating PRM through BoN approach, and then chooses the answer corresponding to the highest outcome score. Assume that there are a question $s_0$ and an answer with $n$ steps denoted as $(s_1,\cdots,s_n)$. The process reward for step $i$ is denoted as $r_i$. Therefore, previous reduction approaches indicate that for reasoning tasks, the return of an answer should be more in line with $\min (r_1,\cdots,r_n)$ or $\prod_{i=1}^n r_i$, rather than $\sum_{i=1}^n \gamma^{i-1} r_i$ that is most commonly used in reinforcement learning. We choose the minimum form as the return of $s_0$, because the product approach can’t handle negative process rewards and also causes $R(s_0)$ to converge to 0 in cases with a large number of steps.

Extend min-form return to any step, we have: