Files
qwen3-0.6B-interleaved-thin…/RLMT_RESULTS.md
ModelHub XC a9fc551b52 初始化项目,由ModelHub XC社区提供模型
Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform
2026-05-01 22:12:10 +08:00

15 KiB

Phase 4B RLMT Results Source of Truth

Date: 2026-04-27

Purpose: verified local handoff for the next Codex/writing session. This is not the blog draft. It records source artifacts, completed training results, and pending overnight eval outputs so the research-blog claims can be grounded in files rather than memory.

Research Question

Does self-improving continued pretraining make Qwen3-0.6B-Base a better substrate for thinking mid-training?

Portfolio framing: use Phase 3 self-improving pretraining, Phase 4B interleaved-thinking SFT, and Phase 4B RLMT as a small-scale proxy for moving frontier post-training recipes earlier in the model lifecycle.

Primary Source Artifacts

  • Paper: /Users/jarrodbarnes/Downloads/Self-Improving Pretraining (1).pdf
  • Phase 3 eval: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md
  • Phase 3 qualitative audit: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-qualitative-audit-2026-04-25.md
  • Phase 4B data audit: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase4b-data-audit-2026-04-25.md
  • Metrics schema: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/metrics-schema.md
  • RLMT trainer: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_train.py
  • RLMT reward gate: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_reward_gate.py
  • Sharp eval: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_sharp_eval.py
  • Reasoning eval: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_reasoning_eval.py

Completed Training Results

Source host: spark-f7e2 via ssh spark.

Phase 3 Self-Improving Pretraining

Source artifact: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md

  • Method: full-pairwise Online DPO with K=16 rollouts.
  • Held-out pairwise continuation quality: Phase3-final beat Qwen3-0.6B-Base on 81/128 comparisons = 63.28%.
  • Interpretation status: positive Phase 3 result already documented; use as the substrate-improvement premise for Phase 4B.

Phase 4B SFT

Source artifacts on Spark:

  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/metrics.json
  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/metrics.json
  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/metrics.json
Arm Checkpoint Primary Metric Val Loss Raw Val Loss Selected Repetition 4-gram Rate Tokens Seen Throughput tok/s Min Available Mem GiB
raw_base SFT /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/final.pt 2.694593 2.694593 2.694593 0.020321 4,036,608 2729.53 94.94
think_base SFT /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/final.pt 2.792869 2.707164 2.792869 0.019397 4,035,285 3284.51 69.78
think_phase3 SFT /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/final.pt 2.788120 2.708099 2.788120 0.019397 4,035,285 3298.08 68.65

Interpretation status: SFT installed the teacher-thought distribution but did not by itself produce a clear H3 gain pre-RLMT. Treat SFT as the setup for thinking mid-training, not the final claim.

Phase 4B RLMT

Source artifacts on Spark:

  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/metrics.json
  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/metrics.json
  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt
  • /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt

Configuration:

  • K=16 samples per prefix.
  • Two-stage external-boundary interface: sample thought, externally insert thought/suffix boundary, sample suffix, judge only suffix.
  • 200 RLMT steps per matched arm.
  • Prefixes per step: 2.
  • Thought max tokens: 48.
  • Suffix max tokens: 128.
  • Learning rate: 1e-7.
  • KL coefficient: 0.02.
  • Max grad norm: 0.1.
  • Stop conditions were logged as alerts only, per user instruction, except non-finite loss safety.
Arm Completed Steps Stopped Reason Avg Reward Final Reward Avg Mixed Groups Avg Near-Zero Groups Avg All-Zero Groups Avg All-One Groups Invalid Judge Avg Length Drift Avg Artifact Rate Wall Time s Final Checkpoint
think_base RLMT 200 null 0.092188 0.031250 0.5700 0.4300 0.4300 0.0000 0.0000 -0.000511 0.132031 5488.32 /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt
think_phase3 RLMT 200 null 0.090313 0.093750 0.5625 0.4375 0.4375 0.0000 0.0000 -0.000545 0.139844 5448.25 /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt

Interpretation status: training completed cleanly for the matched RLMT arms. Do not infer the substrate claim from training reward alone; use post-RLMT reward gate and downstream evals below.

Overnight Eval Status

Eval ID: phase4b-post-rlmt-eval-20260426-181047

Watcher:

  • /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.sh
  • /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.out

Reward gate:

  • Log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-20260426-181047-reward-gate.out
  • Output dir: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate
  • Summary: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate/summary.json
  • Arms: think_base, think_phase3, think_base_rlmt, think_phase3_rlmt
  • Scope: 64 prefixes, 16 samples per prefix, two-stage external-boundary interface.
  • Status at 2026-04-26 19:29 ET: complete. Split reasoning eval launched on both Spark hosts.
Arm Reward Mean Reward Std Invalid Rate Mixed Groups Near-Zero Groups Any-Success Groups All-Success Groups Avg Total New Tokens Avg Predicted Suffix Words Closed Think Rate
think_base 0.087891 0.283136 0.0000 0.546875 0.453125 0.546875 0.0000 175.8164 87.2734 1.0000
think_phase3 0.090820 0.287353 0.0000 0.593750 0.406250 0.593750 0.0000 175.8818 86.7861 1.0000
think_base_rlmt 0.093750 0.291481 0.0000 0.640625 0.359375 0.640625 0.0000 175.8936 86.3281 1.0000
think_phase3_rlmt 0.097656 0.296849 0.0000 0.609375 0.390625 0.609375 0.0000 175.9229 88.7979 1.0000

Reward-gate comparisons from summary.json:

Comparison Mean Reward Delta Right Better Prefix Rate Left Better Prefix Rate Tie Prefix Rate Shared Prefixes
think_base_rlmt_minus_think_base 0.005859 0.281250 0.281250 0.437500 64
think_phase3_minus_think_base 0.002930 0.296875 0.250000 0.453125 64
think_phase3_rlmt_minus_think_base 0.009766 0.296875 0.296875 0.406250 64
think_phase3_minus_think_base_rlmt -0.002930 0.281250 0.312500 0.406250 64
think_phase3_rlmt_minus_think_base_rlmt 0.003906 0.250000 0.265625 0.484375 64
think_phase3_rlmt_minus_think_phase3 0.006836 0.312500 0.312500 0.375000 64

Reward-gate go/no-go fields:

  • reward_validity_ok: true
  • variance_ok: true
  • phase3_more_reward_separable: true
  • phase3_higher_mean_reward: true

Reasoning eval:

  • Initial Hugging Face model.generate eval was intentionally stopped after throughput instrumentation showed only ~106-140 generated tok/s. That path was valid but too slow for the final overnight eval.
  • Optimized SGLang eval ID: phase4b-post-rlmt-eval-sglang-20260427-0014
  • Optimized result root: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014
  • f7e2 runner/log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.sh, /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.out
  • cfd0 runner/log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.sh, /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.out
  • f7e2 arms: think_base, then think_base_rlmt
  • cfd0 arms: think_phase3, then think_phase3_rlmt
  • Serving image: scitrera/dgx-spark-sglang:0.5.9-t5
  • Serving flags: --served-model-name default --tp 1 --cuda-graph-max-bs 32 --num-continuous-decode-steps 16 --schedule-policy lpm --mem-fraction-static 0.70
  • Eval client: scripts/phase4b_reasoning_eval_sglang.py
  • Eval config: 8 samples/problem, max 512 new tokens, concurrency 32, temperature 0.6, top_p 0.95.
  • Completion status: complete on both hosts.
    • f7e2 completed at 2026-04-27T01:57:57Z.
    • cfd0 completed at 2026-04-27T01:57:29Z.
  • Throughput from final cfd0 progress logs: roughly 2.2k generated tok/s cumulative on think_phase3_rlmt OlympiadBench. f7e2 logs showed the same order of magnitude during the run.
  • Note: the per-host summary.json files contain the last arm run on that host. The full four-arm table below was reconstructed directly from the per-arm JSONL artifacts under the optimized result root.

Reasoning-eval source artifacts:

  • think_base: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base/*.jsonl
  • think_base_rlmt: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base_rlmt/*.jsonl
  • think_phase3: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3/*.jsonl
  • think_phase3_rlmt: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3_rlmt/*.jsonl
Arm Benchmark Mean@8 Correct / Samples Pass@8 Any Pass / Problems Avg Completion Tokens
think_base GSM8K 0.105478 1113 / 10552 0.508719 671 / 1319 322.2
think_base MATH-500 0.273250 1093 / 4000 0.574000 287 / 500 371.5
think_base GPQA-Diamond 0.222222 352 / 1584 0.671717 133 / 198 320.3
think_base OlympiadBench 0.071023 125 / 1760 0.222727 49 / 220 441.9
think_base_rlmt GSM8K 0.094484 997 / 10552 0.476876 629 / 1319 324.9
think_base_rlmt MATH-500 0.287750 1151 / 4000 0.562000 281 / 500 376.2
think_base_rlmt GPQA-Diamond 0.224116 355 / 1584 0.737374 146 / 198 323.3
think_base_rlmt OlympiadBench 0.082386 145 / 1760 0.222727 49 / 220 447.4
think_phase3 GSM8K 0.103393 1091 / 10552 0.512509 676 / 1319 321.8
think_phase3 MATH-500 0.282250 1129 / 4000 0.568000 284 / 500 362.1
think_phase3 GPQA-Diamond 0.234848 372 / 1584 0.712121 141 / 198 323.8
think_phase3 OlympiadBench 0.078409 138 / 1760 0.254545 56 / 220 441.0
think_phase3_rlmt GSM8K 0.103014 1087 / 10552 0.506444 668 / 1319 322.0
think_phase3_rlmt MATH-500 0.273750 1095 / 4000 0.574000 287 / 500 366.2
think_phase3_rlmt GPQA-Diamond 0.222854 353 / 1584 0.737374 146 / 198 316.6
think_phase3_rlmt OlympiadBench 0.065341 115 / 1760 0.231818 51 / 220 435.9

Macro averages:

Arm Macro Mean@8 Macro Pass@8 Any Macro Avg Completion Tokens
think_base 0.167993 0.494291 364.0
think_base_rlmt 0.172184 0.499744 368.0
think_phase3 0.174725 0.511794 362.2
think_phase3_rlmt 0.166240 0.512409 360.2

Direct reasoning-eval comparisons:

Comparison Benchmark Mean@8 Delta Pass@8 Any Delta Avg Token Delta
Base+Think+RLMT vs Base+Think GSM8K -0.010993 -0.031842 +2.7
Base+Think+RLMT vs Base+Think MATH-500 +0.014500 -0.012000 +4.7
Base+Think+RLMT vs Base+Think GPQA-Diamond +0.001894 +0.065657 +3.1
Base+Think+RLMT vs Base+Think OlympiadBench +0.011364 +0.000000 +5.5
Base+Think+RLMT vs Base+Think Macro +0.004191 +0.005454 +4.0
Phase3+Think+RLMT vs Phase3+Think GSM8K -0.000379 -0.006065 +0.2
Phase3+Think+RLMT vs Phase3+Think MATH-500 -0.008500 +0.006000 +4.1
Phase3+Think+RLMT vs Phase3+Think GPQA-Diamond -0.011995 +0.025253 -7.2
Phase3+Think+RLMT vs Phase3+Think OlympiadBench -0.013068 -0.022727 -5.1
Phase3+Think+RLMT vs Phase3+Think Macro -0.008486 +0.000615 -2.0
Phase3+Think+RLMT vs Base+Think+RLMT GSM8K +0.008529 +0.029568 -2.9
Phase3+Think+RLMT vs Base+Think+RLMT MATH-500 -0.014000 +0.012000 -10.0
Phase3+Think+RLMT vs Base+Think+RLMT GPQA-Diamond -0.001263 +0.000000 -6.7
Phase3+Think+RLMT vs Base+Think+RLMT OlympiadBench -0.017045 +0.009091 -11.5
Phase3+Think+RLMT vs Base+Think+RLMT Macro -0.005945 +0.012665 -7.8
Phase3+Think vs Base+Think GSM8K -0.002085 +0.003791 -0.4
Phase3+Think vs Base+Think MATH-500 +0.009000 -0.006000 -9.4
Phase3+Think vs Base+Think GPQA-Diamond +0.012626 +0.040404 +3.6
Phase3+Think vs Base+Think OlympiadBench +0.007386 +0.031818 -0.9
Phase3+Think vs Base+Think Macro +0.006732 +0.017503 -1.8

Claim Candidates

Evidence-backed candidates for the writing session:

  • Phase 3 has a clean positive held-out continuation-quality result: Phase3-final wins 81/128 pairwise comparisons against Qwen3-0.6B-Base = 63.28%.
  • The RLMT loop completed cleanly for matched base and Phase 3 lineages at 200 steps with no invalid judge responses and no all-one reward collapse.
  • The two-stage post-RLMT reward gate improved mean reward for both RLMT arms and ranked think_phase3_rlmt highest by reward mean: 0.097656 vs 0.093750 for think_base_rlmt, 0.090820 for think_phase3, and 0.087891 for think_base.
  • Downstream reasoning eval gives a mixed but useful substrate signal: think_phase3 beats think_base on macro Mean@8 and macro Pass@8 Any; after RLMT, think_phase3_rlmt beats think_base_rlmt on macro Pass@8 Any but not macro Mean@8.
  • The SGLang eval path is the throughput-valid final eval path. The slower Hugging Face generate path was stopped intentionally after direct throughput instrumentation.

Claim boundaries:

  • Do not claim a uniform downstream reasoning win from RLMT alone. RLMT improves some benchmark/metric slices and the paper-aligned reward gate, but the reasoning suite is metric- and benchmark-dependent.
  • Do not infer full Table 10 equivalence. This is a small-scale reproduction/proxy using 8 samples/problem and four HF-hosted reasoning benchmarks, not the paper's full scale.
  • Treat sample-level Mean@8 and problem-level Pass@8 Any separately. They answer different questions: average sample correctness versus whether any rollout solves the problem.