Jarrodbarnes/qwen3-0.6B-interleaved-thinking

Files

ModelHub XC a9fc551b52 初始化项目，由ModelHub XC社区提供模型

Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform

2026-05-01 22:12:10 +08:00

15 KiB

Raw Blame History

Phase 4B RLMT Results Source of Truth

Date: 2026-04-27

Purpose: verified local handoff for the next Codex/writing session. This is not the blog draft. It records source artifacts, completed training results, and pending overnight eval outputs so the research-blog claims can be grounded in files rather than memory.

Research Question

Does self-improving continued pretraining make Qwen3-0.6B-Base a better substrate for thinking mid-training?

Portfolio framing: use Phase 3 self-improving pretraining, Phase 4B interleaved-thinking SFT, and Phase 4B RLMT as a small-scale proxy for moving frontier post-training recipes earlier in the model lifecycle.

Primary Source Artifacts

Paper: /Users/jarrodbarnes/Downloads/Self-Improving Pretraining (1).pdf
Phase 3 eval: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md
Phase 3 qualitative audit: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-qualitative-audit-2026-04-25.md
Phase 4B data audit: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase4b-data-audit-2026-04-25.md
Metrics schema: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/metrics-schema.md
RLMT trainer: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_train.py
RLMT reward gate: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_reward_gate.py
Sharp eval: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_sharp_eval.py
Reasoning eval: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_reasoning_eval.py

Completed Training Results

Source host: spark-f7e2 via ssh spark.

Phase 3 Self-Improving Pretraining

Source artifact: /Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md

Method: full-pairwise Online DPO with K=16 rollouts.
Held-out pairwise continuation quality: Phase3-final beat Qwen3-0.6B-Base on 81/128 comparisons = 63.28%.
Interpretation status: positive Phase 3 result already documented; use as the substrate-improvement premise for Phase 4B.

Phase 4B SFT

Source artifacts on Spark:

/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/metrics.json
/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/metrics.json
/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/metrics.json

Arm	Checkpoint	Primary Metric	Val Loss Raw	Val Loss Selected	Repetition 4-gram Rate	Tokens Seen	Throughput tok/s	Min Available Mem GiB
raw_base SFT	`/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/final.pt`	2.694593	2.694593	2.694593	0.020321	4,036,608	2729.53	94.94
think_base SFT	`/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/final.pt`	2.792869	2.707164	2.792869	0.019397	4,035,285	3284.51	69.78
think_phase3 SFT	`/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/final.pt`	2.788120	2.708099	2.788120	0.019397	4,035,285	3298.08	68.65

Interpretation status: SFT installed the teacher-thought distribution but did not by itself produce a clear H3 gain pre-RLMT. Treat SFT as the setup for thinking mid-training, not the final claim.

Phase 4B RLMT

Source artifacts on Spark:

/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/metrics.json
/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/metrics.json
/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt
/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt

Configuration:

K=16 samples per prefix.
Two-stage external-boundary interface: sample thought, externally insert thought/suffix boundary, sample suffix, judge only suffix.
200 RLMT steps per matched arm.
Prefixes per step: 2.
Thought max tokens: 48.
Suffix max tokens: 128.
Learning rate: 1e-7.
KL coefficient: 0.02.
Max grad norm: 0.1.
Stop conditions were logged as alerts only, per user instruction, except non-finite loss safety.

Arm	Completed Steps	Stopped Reason	Avg Reward	Final Reward	Avg Mixed Groups	Avg Near-Zero Groups	Avg All-Zero Groups	Avg All-One Groups	Invalid Judge	Avg Length Drift	Avg Artifact Rate	Wall Time s	Final Checkpoint
think_base RLMT	200	null	0.092188	0.031250	0.5700	0.4300	0.4300	0.0000	0.0000	-0.000511	0.132031	5488.32	`/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt`
think_phase3 RLMT	200	null	0.090313	0.093750	0.5625	0.4375	0.4375	0.0000	0.0000	-0.000545	0.139844	5448.25	`/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt`

Interpretation status: training completed cleanly for the matched RLMT arms. Do not infer the substrate claim from training reward alone; use post-RLMT reward gate and downstream evals below.

Overnight Eval Status

Eval ID: phase4b-post-rlmt-eval-20260426-181047

Watcher:

/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.sh
/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.out

Reward gate:

Log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-20260426-181047-reward-gate.out
Output dir: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate
Summary: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate/summary.json
Arms: think_base, think_phase3, think_base_rlmt, think_phase3_rlmt
Scope: 64 prefixes, 16 samples per prefix, two-stage external-boundary interface.
Status at 2026-04-26 19:29 ET: complete. Split reasoning eval launched on both Spark hosts.

Arm	Reward Mean	Reward Std	Mixed Groups	Near-Zero Groups	Any-Success Groups	Avg Total New Tokens	Avg Predicted Suffix Words	Closed Think Rate
think_base	0.087891	0.283136	0.546875	0.453125	0.546875	175.8164	87.2734	1.0000
think_phase3	0.090820	0.287353	0.593750	0.406250	0.593750	175.8818	86.7861	1.0000
think_base_rlmt	0.093750	0.291481	0.640625	0.359375	0.640625	175.8936	86.3281	1.0000
think_phase3_rlmt	0.097656	0.296849	0.609375	0.390625	0.609375	175.9229	88.7979	1.0000

Reward-gate comparisons from summary.json:

Comparison	Mean Reward Delta	Right Better Prefix Rate	Left Better Prefix Rate	Tie Prefix Rate	Shared Prefixes
think_base_rlmt_minus_think_base	0.005859	0.281250	0.281250	0.437500	64
think_phase3_minus_think_base	0.002930	0.296875	0.250000	0.453125	64
think_phase3_rlmt_minus_think_base	0.009766	0.296875	0.296875	0.406250	64
think_phase3_minus_think_base_rlmt	-0.002930	0.281250	0.312500	0.406250	64
think_phase3_rlmt_minus_think_base_rlmt	0.003906	0.250000	0.265625	0.484375	64
think_phase3_rlmt_minus_think_phase3	0.006836	0.312500	0.312500	0.375000	64

Reward-gate go/no-go fields:

reward_validity_ok: true
variance_ok: true
phase3_more_reward_separable: true
phase3_higher_mean_reward: true

Reasoning eval:

Initial Hugging Face model.generate eval was intentionally stopped after throughput instrumentation showed only ~106-140 generated tok/s. That path was valid but too slow for the final overnight eval.
Optimized SGLang eval ID: phase4b-post-rlmt-eval-sglang-20260427-0014
Optimized result root: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014
f7e2 runner/log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.sh, /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.out
cfd0 runner/log: /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.sh, /home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.out
f7e2 arms: think_base, then think_base_rlmt
cfd0 arms: think_phase3, then think_phase3_rlmt
Serving image: scitrera/dgx-spark-sglang:0.5.9-t5
Serving flags: --served-model-name default --tp 1 --cuda-graph-max-bs 32 --num-continuous-decode-steps 16 --schedule-policy lpm --mem-fraction-static 0.70
Eval client: scripts/phase4b_reasoning_eval_sglang.py
Eval config: 8 samples/problem, max 512 new tokens, concurrency 32, temperature 0.6, top_p 0.95.
Completion status: complete on both hosts.
- f7e2 completed at 2026-04-27T01:57:57Z.
- cfd0 completed at 2026-04-27T01:57:29Z.
Throughput from final cfd0 progress logs: roughly 2.2k generated tok/s cumulative on think_phase3_rlmt OlympiadBench. f7e2 logs showed the same order of magnitude during the run.
Note: the per-host summary.json files contain the last arm run on that host. The full four-arm table below was reconstructed directly from the per-arm JSONL artifacts under the optimized result root.

Reasoning-eval source artifacts:

think_base: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base/*.jsonl
think_base_rlmt: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base_rlmt/*.jsonl
think_phase3: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3/*.jsonl
think_phase3_rlmt: /home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3_rlmt/*.jsonl

Arm	Benchmark	Mean@8	Correct / Samples	Pass@8 Any	Pass / Problems	Avg Completion Tokens
think_base	GSM8K	0.105478	1113 / 10552	0.508719	671 / 1319	322.2
think_base	MATH-500	0.273250	1093 / 4000	0.574000	287 / 500	371.5
think_base	GPQA-Diamond	0.222222	352 / 1584	0.671717	133 / 198	320.3
think_base	OlympiadBench	0.071023	125 / 1760	0.222727	49 / 220	441.9
think_base_rlmt	GSM8K	0.094484	997 / 10552	0.476876	629 / 1319	324.9
think_base_rlmt	MATH-500	0.287750	1151 / 4000	0.562000	281 / 500	376.2
think_base_rlmt	GPQA-Diamond	0.224116	355 / 1584	0.737374	146 / 198	323.3
think_base_rlmt	OlympiadBench	0.082386	145 / 1760	0.222727	49 / 220	447.4
think_phase3	GSM8K	0.103393	1091 / 10552	0.512509	676 / 1319	321.8
think_phase3	MATH-500	0.282250	1129 / 4000	0.568000	284 / 500	362.1
think_phase3	GPQA-Diamond	0.234848	372 / 1584	0.712121	141 / 198	323.8
think_phase3	OlympiadBench	0.078409	138 / 1760	0.254545	56 / 220	441.0
think_phase3_rlmt	GSM8K	0.103014	1087 / 10552	0.506444	668 / 1319	322.0
think_phase3_rlmt	MATH-500	0.273750	1095 / 4000	0.574000	287 / 500	366.2
think_phase3_rlmt	GPQA-Diamond	0.222854	353 / 1584	0.737374	146 / 198	316.6
think_phase3_rlmt	OlympiadBench	0.065341	115 / 1760	0.231818	51 / 220	435.9

Macro averages:

Arm	Macro Mean@8	Macro Pass@8 Any	Macro Avg Completion Tokens
think_base	0.167993	0.494291	364.0
think_base_rlmt	0.172184	0.499744	368.0
think_phase3	0.174725	0.511794	362.2
think_phase3_rlmt	0.166240	0.512409	360.2

Direct reasoning-eval comparisons:

Comparison	Benchmark	Mean@8 Delta	Pass@8 Any Delta	Avg Token Delta
Base+Think+RLMT vs Base+Think	GSM8K	-0.010993	-0.031842	+2.7
Base+Think+RLMT vs Base+Think	MATH-500	+0.014500	-0.012000	+4.7
Base+Think+RLMT vs Base+Think	GPQA-Diamond	+0.001894	+0.065657	+3.1
Base+Think+RLMT vs Base+Think	OlympiadBench	+0.011364	+0.000000	+5.5
Base+Think+RLMT vs Base+Think	Macro	+0.004191	+0.005454	+4.0
Phase3+Think+RLMT vs Phase3+Think	GSM8K	-0.000379	-0.006065	+0.2
Phase3+Think+RLMT vs Phase3+Think	MATH-500	-0.008500	+0.006000	+4.1
Phase3+Think+RLMT vs Phase3+Think	GPQA-Diamond	-0.011995	+0.025253	-7.2
Phase3+Think+RLMT vs Phase3+Think	OlympiadBench	-0.013068	-0.022727	-5.1
Phase3+Think+RLMT vs Phase3+Think	Macro	-0.008486	+0.000615	-2.0
Phase3+Think+RLMT vs Base+Think+RLMT	GSM8K	+0.008529	+0.029568	-2.9
Phase3+Think+RLMT vs Base+Think+RLMT	MATH-500	-0.014000	+0.012000	-10.0
Phase3+Think+RLMT vs Base+Think+RLMT	GPQA-Diamond	-0.001263	+0.000000	-6.7
Phase3+Think+RLMT vs Base+Think+RLMT	OlympiadBench	-0.017045	+0.009091	-11.5
Phase3+Think+RLMT vs Base+Think+RLMT	Macro	-0.005945	+0.012665	-7.8
Phase3+Think vs Base+Think	GSM8K	-0.002085	+0.003791	-0.4
Phase3+Think vs Base+Think	MATH-500	+0.009000	-0.006000	-9.4
Phase3+Think vs Base+Think	GPQA-Diamond	+0.012626	+0.040404	+3.6
Phase3+Think vs Base+Think	OlympiadBench	+0.007386	+0.031818	-0.9
Phase3+Think vs Base+Think	Macro	+0.006732	+0.017503	-1.8

Claim Candidates

Evidence-backed candidates for the writing session:

Phase 3 has a clean positive held-out continuation-quality result: Phase3-final wins 81/128 pairwise comparisons against Qwen3-0.6B-Base = 63.28%.
The RLMT loop completed cleanly for matched base and Phase 3 lineages at 200 steps with no invalid judge responses and no all-one reward collapse.
The two-stage post-RLMT reward gate improved mean reward for both RLMT arms and ranked think_phase3_rlmt highest by reward mean: 0.097656 vs 0.093750 for think_base_rlmt, 0.090820 for think_phase3, and 0.087891 for think_base.
Downstream reasoning eval gives a mixed but useful substrate signal: think_phase3 beats think_base on macro Mean@8 and macro Pass@8 Any; after RLMT, think_phase3_rlmt beats think_base_rlmt on macro Pass@8 Any but not macro Mean@8.
The SGLang eval path is the throughput-valid final eval path. The slower Hugging Face generate path was stopped intentionally after direct throughput instrumentation.

Claim boundaries:

Do not claim a uniform downstream reasoning win from RLMT alone. RLMT improves some benchmark/metric slices and the paper-aligned reward gate, but the reasoning suite is metric- and benchmark-dependent.
Do not infer full Table 10 equivalence. This is a small-scale reproduction/proxy using 8 samples/problem and four HF-hosted reasoning benchmarks, not the paper's full scale.
Treat sample-level Mean@8 and problem-level Pass@8 Any separately. They answer different questions: average sample correctness versus whether any rollout solves the problem.

15 KiB Raw Blame History