# Phase 4B RLMT Results Source of Truth Date: 2026-04-27 Purpose: verified local handoff for the next Codex/writing session. This is not the blog draft. It records source artifacts, completed training results, and pending overnight eval outputs so the research-blog claims can be grounded in files rather than memory. ## Research Question Does self-improving continued pretraining make Qwen3-0.6B-Base a better substrate for thinking mid-training? Portfolio framing: use Phase 3 self-improving pretraining, Phase 4B interleaved-thinking SFT, and Phase 4B RLMT as a small-scale proxy for moving frontier post-training recipes earlier in the model lifecycle. ## Primary Source Artifacts - Paper: `/Users/jarrodbarnes/Downloads/Self-Improving Pretraining (1).pdf` - Phase 3 eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md` - Phase 3 qualitative audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-qualitative-audit-2026-04-25.md` - Phase 4B data audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase4b-data-audit-2026-04-25.md` - Metrics schema: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/metrics-schema.md` - RLMT trainer: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_train.py` - RLMT reward gate: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_reward_gate.py` - Sharp eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_sharp_eval.py` - Reasoning eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_reasoning_eval.py` ## Completed Training Results Source host: `spark-f7e2` via `ssh spark`. ### Phase 3 Self-Improving Pretraining Source artifact: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md` - Method: full-pairwise Online DPO with K=16 rollouts. - Held-out pairwise continuation quality: Phase3-final beat Qwen3-0.6B-Base on 81/128 comparisons = 63.28%. - Interpretation status: positive Phase 3 result already documented; use as the substrate-improvement premise for Phase 4B. ### Phase 4B SFT Source artifacts on Spark: - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/metrics.json` - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/metrics.json` - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/metrics.json` | Arm | Checkpoint | Primary Metric | Val Loss Raw | Val Loss Selected | Repetition 4-gram Rate | Tokens Seen | Throughput tok/s | Min Available Mem GiB | | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | raw_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/final.pt` | 2.694593 | 2.694593 | 2.694593 | 0.020321 | 4,036,608 | 2729.53 | 94.94 | | think_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/final.pt` | 2.792869 | 2.707164 | 2.792869 | 0.019397 | 4,035,285 | 3284.51 | 69.78 | | think_phase3 SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/final.pt` | 2.788120 | 2.708099 | 2.788120 | 0.019397 | 4,035,285 | 3298.08 | 68.65 | Interpretation status: SFT installed the teacher-thought distribution but did not by itself produce a clear H3 gain pre-RLMT. Treat SFT as the setup for thinking mid-training, not the final claim. ### Phase 4B RLMT Source artifacts on Spark: - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/metrics.json` - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/metrics.json` - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt` - `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt` Configuration: - K=16 samples per prefix. - Two-stage external-boundary interface: sample thought, externally insert thought/suffix boundary, sample suffix, judge only suffix. - 200 RLMT steps per matched arm. - Prefixes per step: 2. - Thought max tokens: 48. - Suffix max tokens: 128. - Learning rate: 1e-7. - KL coefficient: 0.02. - Max grad norm: 0.1. - Stop conditions were logged as alerts only, per user instruction, except non-finite loss safety. | Arm | Completed Steps | Stopped Reason | Avg Reward | Final Reward | Avg Mixed Groups | Avg Near-Zero Groups | Avg All-Zero Groups | Avg All-One Groups | Invalid Judge | Avg Length Drift | Avg Artifact Rate | Wall Time s | Final Checkpoint | | --- | ---: | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- | | think_base RLMT | 200 | null | 0.092188 | 0.031250 | 0.5700 | 0.4300 | 0.4300 | 0.0000 | 0.0000 | -0.000511 | 0.132031 | 5488.32 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt` | | think_phase3 RLMT | 200 | null | 0.090313 | 0.093750 | 0.5625 | 0.4375 | 0.4375 | 0.0000 | 0.0000 | -0.000545 | 0.139844 | 5448.25 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt` | Interpretation status: training completed cleanly for the matched RLMT arms. Do not infer the substrate claim from training reward alone; use post-RLMT reward gate and downstream evals below. ## Overnight Eval Status Eval ID: `phase4b-post-rlmt-eval-20260426-181047` Watcher: - `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.sh` - `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.out` Reward gate: - Log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-20260426-181047-reward-gate.out` - Output dir: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate` - Summary: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate/summary.json` - Arms: `think_base`, `think_phase3`, `think_base_rlmt`, `think_phase3_rlmt` - Scope: 64 prefixes, 16 samples per prefix, two-stage external-boundary interface. - Status at 2026-04-26 19:29 ET: complete. Split reasoning eval launched on both Spark hosts. | Arm | Reward Mean | Reward Std | Invalid Rate | Mixed Groups | Near-Zero Groups | Any-Success Groups | All-Success Groups | Avg Total New Tokens | Avg Predicted Suffix Words | Closed Think Rate | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | think_base | 0.087891 | 0.283136 | 0.0000 | 0.546875 | 0.453125 | 0.546875 | 0.0000 | 175.8164 | 87.2734 | 1.0000 | | think_phase3 | 0.090820 | 0.287353 | 0.0000 | 0.593750 | 0.406250 | 0.593750 | 0.0000 | 175.8818 | 86.7861 | 1.0000 | | think_base_rlmt | 0.093750 | 0.291481 | 0.0000 | 0.640625 | 0.359375 | 0.640625 | 0.0000 | 175.8936 | 86.3281 | 1.0000 | | think_phase3_rlmt | 0.097656 | 0.296849 | 0.0000 | 0.609375 | 0.390625 | 0.609375 | 0.0000 | 175.9229 | 88.7979 | 1.0000 | Reward-gate comparisons from `summary.json`: | Comparison | Mean Reward Delta | Right Better Prefix Rate | Left Better Prefix Rate | Tie Prefix Rate | Shared Prefixes | | --- | ---: | ---: | ---: | ---: | ---: | | think_base_rlmt_minus_think_base | 0.005859 | 0.281250 | 0.281250 | 0.437500 | 64 | | think_phase3_minus_think_base | 0.002930 | 0.296875 | 0.250000 | 0.453125 | 64 | | think_phase3_rlmt_minus_think_base | 0.009766 | 0.296875 | 0.296875 | 0.406250 | 64 | | think_phase3_minus_think_base_rlmt | -0.002930 | 0.281250 | 0.312500 | 0.406250 | 64 | | think_phase3_rlmt_minus_think_base_rlmt | 0.003906 | 0.250000 | 0.265625 | 0.484375 | 64 | | think_phase3_rlmt_minus_think_phase3 | 0.006836 | 0.312500 | 0.312500 | 0.375000 | 64 | Reward-gate go/no-go fields: - `reward_validity_ok`: true - `variance_ok`: true - `phase3_more_reward_separable`: true - `phase3_higher_mean_reward`: true Reasoning eval: - Initial Hugging Face `model.generate` eval was intentionally stopped after throughput instrumentation showed only ~106-140 generated tok/s. That path was valid but too slow for the final overnight eval. - Optimized SGLang eval ID: `phase4b-post-rlmt-eval-sglang-20260427-0014` - Optimized result root: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014` - f7e2 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.out` - cfd0 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.out` - f7e2 arms: `think_base`, then `think_base_rlmt` - cfd0 arms: `think_phase3`, then `think_phase3_rlmt` - Serving image: `scitrera/dgx-spark-sglang:0.5.9-t5` - Serving flags: `--served-model-name default --tp 1 --cuda-graph-max-bs 32 --num-continuous-decode-steps 16 --schedule-policy lpm --mem-fraction-static 0.70` - Eval client: `scripts/phase4b_reasoning_eval_sglang.py` - Eval config: 8 samples/problem, max 512 new tokens, concurrency 32, temperature 0.6, top_p 0.95. - Completion status: complete on both hosts. - f7e2 completed at `2026-04-27T01:57:57Z`. - cfd0 completed at `2026-04-27T01:57:29Z`. - Throughput from final cfd0 progress logs: roughly 2.2k generated tok/s cumulative on `think_phase3_rlmt` OlympiadBench. f7e2 logs showed the same order of magnitude during the run. - Note: the per-host `summary.json` files contain the last arm run on that host. The full four-arm table below was reconstructed directly from the per-arm JSONL artifacts under the optimized result root. Reasoning-eval source artifacts: - `think_base`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base/*.jsonl` - `think_base_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base_rlmt/*.jsonl` - `think_phase3`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3/*.jsonl` - `think_phase3_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3_rlmt/*.jsonl` | Arm | Benchmark | Mean@8 | Correct / Samples | Pass@8 Any | Pass / Problems | Avg Completion Tokens | | --- | --- | ---: | ---: | ---: | ---: | ---: | | think_base | GSM8K | 0.105478 | 1113 / 10552 | 0.508719 | 671 / 1319 | 322.2 | | think_base | MATH-500 | 0.273250 | 1093 / 4000 | 0.574000 | 287 / 500 | 371.5 | | think_base | GPQA-Diamond | 0.222222 | 352 / 1584 | 0.671717 | 133 / 198 | 320.3 | | think_base | OlympiadBench | 0.071023 | 125 / 1760 | 0.222727 | 49 / 220 | 441.9 | | think_base_rlmt | GSM8K | 0.094484 | 997 / 10552 | 0.476876 | 629 / 1319 | 324.9 | | think_base_rlmt | MATH-500 | 0.287750 | 1151 / 4000 | 0.562000 | 281 / 500 | 376.2 | | think_base_rlmt | GPQA-Diamond | 0.224116 | 355 / 1584 | 0.737374 | 146 / 198 | 323.3 | | think_base_rlmt | OlympiadBench | 0.082386 | 145 / 1760 | 0.222727 | 49 / 220 | 447.4 | | think_phase3 | GSM8K | 0.103393 | 1091 / 10552 | 0.512509 | 676 / 1319 | 321.8 | | think_phase3 | MATH-500 | 0.282250 | 1129 / 4000 | 0.568000 | 284 / 500 | 362.1 | | think_phase3 | GPQA-Diamond | 0.234848 | 372 / 1584 | 0.712121 | 141 / 198 | 323.8 | | think_phase3 | OlympiadBench | 0.078409 | 138 / 1760 | 0.254545 | 56 / 220 | 441.0 | | think_phase3_rlmt | GSM8K | 0.103014 | 1087 / 10552 | 0.506444 | 668 / 1319 | 322.0 | | think_phase3_rlmt | MATH-500 | 0.273750 | 1095 / 4000 | 0.574000 | 287 / 500 | 366.2 | | think_phase3_rlmt | GPQA-Diamond | 0.222854 | 353 / 1584 | 0.737374 | 146 / 198 | 316.6 | | think_phase3_rlmt | OlympiadBench | 0.065341 | 115 / 1760 | 0.231818 | 51 / 220 | 435.9 | Macro averages: | Arm | Macro Mean@8 | Macro Pass@8 Any | Macro Avg Completion Tokens | | --- | ---: | ---: | ---: | | think_base | 0.167993 | 0.494291 | 364.0 | | think_base_rlmt | 0.172184 | 0.499744 | 368.0 | | think_phase3 | 0.174725 | 0.511794 | 362.2 | | think_phase3_rlmt | 0.166240 | 0.512409 | 360.2 | Direct reasoning-eval comparisons: | Comparison | Benchmark | Mean@8 Delta | Pass@8 Any Delta | Avg Token Delta | | --- | --- | ---: | ---: | ---: | | Base+Think+RLMT vs Base+Think | GSM8K | -0.010993 | -0.031842 | +2.7 | | Base+Think+RLMT vs Base+Think | MATH-500 | +0.014500 | -0.012000 | +4.7 | | Base+Think+RLMT vs Base+Think | GPQA-Diamond | +0.001894 | +0.065657 | +3.1 | | Base+Think+RLMT vs Base+Think | OlympiadBench | +0.011364 | +0.000000 | +5.5 | | Base+Think+RLMT vs Base+Think | Macro | +0.004191 | +0.005454 | +4.0 | | Phase3+Think+RLMT vs Phase3+Think | GSM8K | -0.000379 | -0.006065 | +0.2 | | Phase3+Think+RLMT vs Phase3+Think | MATH-500 | -0.008500 | +0.006000 | +4.1 | | Phase3+Think+RLMT vs Phase3+Think | GPQA-Diamond | -0.011995 | +0.025253 | -7.2 | | Phase3+Think+RLMT vs Phase3+Think | OlympiadBench | -0.013068 | -0.022727 | -5.1 | | Phase3+Think+RLMT vs Phase3+Think | Macro | -0.008486 | +0.000615 | -2.0 | | Phase3+Think+RLMT vs Base+Think+RLMT | GSM8K | +0.008529 | +0.029568 | -2.9 | | Phase3+Think+RLMT vs Base+Think+RLMT | MATH-500 | -0.014000 | +0.012000 | -10.0 | | Phase3+Think+RLMT vs Base+Think+RLMT | GPQA-Diamond | -0.001263 | +0.000000 | -6.7 | | Phase3+Think+RLMT vs Base+Think+RLMT | OlympiadBench | -0.017045 | +0.009091 | -11.5 | | Phase3+Think+RLMT vs Base+Think+RLMT | Macro | -0.005945 | +0.012665 | -7.8 | | Phase3+Think vs Base+Think | GSM8K | -0.002085 | +0.003791 | -0.4 | | Phase3+Think vs Base+Think | MATH-500 | +0.009000 | -0.006000 | -9.4 | | Phase3+Think vs Base+Think | GPQA-Diamond | +0.012626 | +0.040404 | +3.6 | | Phase3+Think vs Base+Think | OlympiadBench | +0.007386 | +0.031818 | -0.9 | | Phase3+Think vs Base+Think | Macro | +0.006732 | +0.017503 | -1.8 | ## Claim Candidates Evidence-backed candidates for the writing session: - Phase 3 has a clean positive held-out continuation-quality result: Phase3-final wins 81/128 pairwise comparisons against Qwen3-0.6B-Base = 63.28%. - The RLMT loop completed cleanly for matched base and Phase 3 lineages at 200 steps with no invalid judge responses and no all-one reward collapse. - The two-stage post-RLMT reward gate improved mean reward for both RLMT arms and ranked `think_phase3_rlmt` highest by reward mean: 0.097656 vs 0.093750 for `think_base_rlmt`, 0.090820 for `think_phase3`, and 0.087891 for `think_base`. - Downstream reasoning eval gives a mixed but useful substrate signal: `think_phase3` beats `think_base` on macro Mean@8 and macro Pass@8 Any; after RLMT, `think_phase3_rlmt` beats `think_base_rlmt` on macro Pass@8 Any but not macro Mean@8. - The SGLang eval path is the throughput-valid final eval path. The slower Hugging Face generate path was stopped intentionally after direct throughput instrumentation. Claim boundaries: - Do not claim a uniform downstream reasoning win from RLMT alone. RLMT improves some benchmark/metric slices and the paper-aligned reward gate, but the reasoning suite is metric- and benchmark-dependent. - Do not infer full Table 10 equivalence. This is a small-scale reproduction/proxy using 8 samples/problem and four HF-hosted reasoning benchmarks, not the paper's full scale. - Treat sample-level Mean@8 and problem-level Pass@8 Any separately. They answer different questions: average sample correctness versus whether any rollout solves the problem.