Files
ModelHub XC 9910b9773b 初始化项目,由ModelHub XC社区提供模型
Model: jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning
Source: Original Platform
2026-05-04 16:22:41 +08:00

3.7 KiB

library_name, license, base_model, tags, datasets, pipeline_tag
library_name license base_model tags datasets pipeline_tag
transformers apache-2.0 Qwen/Qwen3-1.7B
reinforcement-learning
grpo
math-reasoning
pipelinerl
gsm8k_train
math_train
text-generation

Qwen3-1.7B-GRPO-KL-math-reasoning

This model is a fine-tuned version of Qwen3-1.7B using GRPO (Group Relative Policy Optimization) with KL penalty for mathematical reasoning.

Trained with PipelineRL.

Training Details

Datasets

Split Datasets
Train gsm8k_train, math_train
Test gsm8k_test, math_500

RL Algorithm

Parameter Value
Algorithm GRPO (Group Relative Policy Optimization)
Policy Loss ppo
KL Coefficient 0.001
Epsilon (clip) 0.02
Divide Advantage by Std False
Filter Zero Advantage Groups False
Rollouts per Problem 16

Training Hyperparameters

Parameter Value
Base Model Qwen/Qwen3-1.7B
Learning Rate 1e-06
LR Scheduler cosine
Warmup Steps 25
Max Training Steps 1500
Micro Batch Size 4
Gradient Accumulation 64
Effective Batch Size 256
Sequence Length 8192
Gradient Clipping 0.3
Weight Decay 0.01
Optimizer adamw_torch
Precision bf16
DeepSpeed ZeRO Stage 3

Evaluation Results

Pass@k on math reasoning benchmarks (N=32 samples per problem, temperature=1.0):

Dataset pass@1 pass@2 pass@4 pass@8 pass@16 pass@32
GSM8K (test) 80.07 86.12 90.15 92.76 94.47 95.75
MATH-500 69.64 77.54 83.47 87.97 91.16 93.60
Overall 77.20 83.76 88.32 91.44 93.56 95.16

GSM8K test: 1318 problems · MATH-500: 500 problems · Overall: 1818 problems (overall weighted by problem count).

Training Curves

Training Metrics

W&B Run

Full training logs: https://wandb.ai/jaygala24-team/rl-post-training/runs/qwen3_1.7b_grpo_with_kl_2a1p1f_4xh100_197368_finetune_12a34277

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning", revision="step-0200")  # or whatever branch name, e.g. "step-0400", "step-0600"
tokenizer = AutoTokenizer.from_pretrained("jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning", revision="step-0200")  # or whatever branch name, e.g. "step-0400", "step-0600"

prompt = "Please reason step by step, and put your final answer within \\boxed{{}}.\n\nWhat is the sum of 123 and 456?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="jaygala24/Qwen3-1.7B-GRPO-KL-math-reasoning", revision="step-0200")  # or whatever branch name, e.g. "step-0400", "step-0600"
sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)

prompt = "Please reason step by step, and put your final answer within \\boxed{}.\n\nWhat is the sum of 123 and 456?"
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Framework