GRPO踩坑记录2

This content originally appeared on DEV Community and was authored by 张逸群

这次我更换了任务，结合了MMLUPro、MELD和MATH三个数据集，并统一使用了相同的系统提示，模型仍然采用Qwen2.5-0.5B-Instruct （足够小，且不需要冷启动）。

PROMPT=(
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
    "<think> reasoning process here </think> <answer> answer here </answer>"
)

奖励函数分为两种：第一种是 format_reward，顾名思义，用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>；第二种是 accuracy_reward，用于检测输出的准确性。
从训练日志中可以清楚地看到，format_reward经历了一次显著的提升（这是否可以被称为一种涌现现象？），而accuracy_reward则一直在大幅波动，没有明显的上升趋势。这表明，模型只是通过“模仿”特定格式来提高得分，而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目，能够看到模型只是在拙劣的模仿这种格式，而没有学会真的思考。

<think> 
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>

<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.

According to our formula, it would look like this: 
\[ 2x + 5 = 15 \]

Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]

Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]

Therefore, the number in question is 5.</answer>

分析了一下，原因可能有两个：（1）模型尺寸太小；（2）奖励函数设置有问题。

This content originally appeared on DEV Community and was authored by 张逸群

Print Share Comment Cite Upload Translate Updates

APA

张逸群 | Sciencx (2025-02-04T00:44:27+00:00) GRPO踩坑记录2. Retrieved from https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/

MLA

" » GRPO踩坑记录2." 张逸群 | Sciencx - Tuesday February 4, 2025, https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/

HARVARD

张逸群 | Sciencx Tuesday February 4, 2025 » GRPO踩坑记录2., viewed ,<https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/>

VANCOUVER

张逸群 | Sciencx - » GRPO踩坑记录2. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/

CHICAGO

" » GRPO踩坑记录2." 张逸群 | Sciencx - Accessed . https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/

IEEE

" » GRPO踩坑记录2." 张逸群 | Sciencx [Online]. Available: https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/. [Accessed: ]

rf:citation

» GRPO踩坑记录2 | 张逸群 | Sciencx | https://www.scien.cx/2025/02/04/grpo%e8%b8%a9%e5%9d%91%e8%ae%b0%e5%bd%952/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Related Posts