GRPO踩坑记录2

这次我更换了任务,结合了MMLUPro、MELD和MATH三个数据集,并统一使用了相同的系统提示,模型仍然采用Qwen2.5-0.5B-Instruct (足够小,且不需要冷启动)。

PROMPT=(
“A conversation between User and Assistant. The user asks a question, and the Assistant solves it. ”
“The assistant first thinks about the…


This content originally appeared on DEV Community and was authored by 张逸群

Image description

这次我更换了任务,结合了MMLUPro、MELD和MATH三个数据集,并统一使用了相同的系统提示,模型仍然采用Qwen2.5-0.5B-Instruct (足够小,且不需要冷启动)。

PROMPT=(
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
    "<think> reasoning process here </think> <answer> answer here </answer>"
)

奖励函数分为两种:第一种是 format_reward,顾名思义,用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>;第二种是 accuracy_reward,用于检测输出的准确性。
从训练日志中可以清楚地看到,format_reward经历了一次显著的提升(这是否可以被称为一种涌现现象?),而accuracy_reward则一直在大幅波动,没有明显的上升趋势。这表明,模型只是通过“模仿”特定格式来提高得分,而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目,能够看到模型只是在拙劣的模仿这种格式,而没有学会真的思考。

<think> 
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>

<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.

According to our formula, it would look like this: 
\[ 2x + 5 = 15 \]

Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]

Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]

Therefore, the number in question is 5.</answer>

分析了一下,原因可能有两个:(1)模型尺寸太小;(2)奖励函数设置有问题。


This content originally appeared on DEV Community and was authored by 张逸群


Print Share Comment Cite Upload Translate Updates
APA

张逸群 | Sciencx (2025-02-04T00:44:27+00:00) GRPO踩坑记录2. Retrieved from https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/

MLA
" » GRPO踩坑记录2." 张逸群 | Sciencx - Tuesday February 4, 2025, https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/
HARVARD
张逸群 | Sciencx Tuesday February 4, 2025 » GRPO踩坑记录2., viewed ,<https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/>
VANCOUVER
张逸群 | Sciencx - » GRPO踩坑记录2. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/
CHICAGO
" » GRPO踩坑记录2." 张逸群 | Sciencx - Accessed . https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/
IEEE
" » GRPO踩坑记录2." 张逸群 | Sciencx [Online]. Available: https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/. [Accessed: ]
rf:citation
» GRPO踩坑记录2 | 张逸群 | Sciencx | https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.