# Phase 4B RLMT Results Source of Truth

Date: 2026-04-27

Purpose: verified local handoff for the next Codex/writing session. This is not the blog draft. It records source artifacts, completed training results, and pending overnight eval outputs so the research-blog claims can be grounded in files rather than memory.

## Research Question

Does self-improving continued pretraining make Qwen3-0.6B-Base a better substrate for thinking mid-training?

Portfolio framing: use Phase 3 self-improving pretraining, Phase 4B interleaved-thinking SFT, and Phase 4B RLMT as a small-scale proxy for moving frontier post-training recipes earlier in the model lifecycle.

## Primary Source Artifacts

- Paper: `/Users/jarrodbarnes/Downloads/Self-Improving Pretraining (1).pdf`
- Phase 3 eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md`
- Phase 3 qualitative audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-qualitative-audit-2026-04-25.md`
- Phase 4B data audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase4b-data-audit-2026-04-25.md`
- Metrics schema: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/metrics-schema.md`
- RLMT trainer: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_train.py`
- RLMT reward gate: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_reward_gate.py`
- Sharp eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_sharp_eval.py`
- Reasoning eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_reasoning_eval.py`

## Completed Training Results

Source host: `spark-f7e2` via `ssh spark`.

### Phase 3 Self-Improving Pretraining

Source artifact: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md`

- Method: full-pairwise Online DPO with K=16 rollouts.
- Held-out pairwise continuation quality: Phase3-final beat Qwen3-0.6B-Base on 81/128 comparisons = 63.28%.
- Interpretation status: positive Phase 3 result already documented; use as the substrate-improvement premise for Phase 4B.

### Phase 4B SFT

Source artifacts on Spark:

- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/metrics.json`
- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/metrics.json`
- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/metrics.json`

| Arm | Checkpoint | Primary Metric | Val Loss Raw | Val Loss Selected | Repetition 4-gram Rate | Tokens Seen | Throughput tok/s | Min Available Mem GiB |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| raw_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/final.pt` | 2.694593 | 2.694593 | 2.694593 | 0.020321 | 4,036,608 | 2729.53 | 94.94 |
| think_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/final.pt` | 2.792869 | 2.707164 | 2.792869 | 0.019397 | 4,035,285 | 3284.51 | 69.78 |
| think_phase3 SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/final.pt` | 2.788120 | 2.708099 | 2.788120 | 0.019397 | 4,035,285 | 3298.08 | 68.65 |

Interpretation status: SFT installed the teacher-thought distribution but did not by itself produce a clear H3 gain pre-RLMT. Treat SFT as the setup for thinking mid-training, not the final claim.

### Phase 4B RLMT

Source artifacts on Spark:

- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/metrics.json`
- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/metrics.json`
- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt`
- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt`

Configuration:

- K=16 samples per prefix.
- Two-stage external-boundary interface: sample thought, externally insert thought/suffix boundary, sample suffix, judge only suffix.
- 200 RLMT steps per matched arm.
- Prefixes per step: 2.
- Thought max tokens: 48.
- Suffix max tokens: 128.
- Learning rate: 1e-7.
- KL coefficient: 0.02.
- Max grad norm: 0.1.
- Stop conditions were logged as alerts only, per user instruction, except non-finite loss safety.

| Arm | Completed Steps | Stopped Reason | Avg Reward | Final Reward | Avg Mixed Groups | Avg Near-Zero Groups | Avg All-Zero Groups | Avg All-One Groups | Invalid Judge | Avg Length Drift | Avg Artifact Rate | Wall Time s | Final Checkpoint |
| --- | ---: | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| think_base RLMT | 200 | null | 0.092188 | 0.031250 | 0.5700 | 0.4300 | 0.4300 | 0.0000 | 0.0000 | -0.000511 | 0.132031 | 5488.32 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt` |
| think_phase3 RLMT | 200 | null | 0.090313 | 0.093750 | 0.5625 | 0.4375 | 0.4375 | 0.0000 | 0.0000 | -0.000545 | 0.139844 | 5448.25 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt` |

Interpretation status: training completed cleanly for the matched RLMT arms. Do not infer the substrate claim from training reward alone; use post-RLMT reward gate and downstream evals below.

## Overnight Eval Status

Eval ID: `phase4b-post-rlmt-eval-20260426-181047`

Watcher:

- `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.sh`
- `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.out`

Reward gate:

- Log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-20260426-181047-reward-gate.out`
- Output dir: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate`
- Summary: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate/summary.json`
- Arms: `think_base`, `think_phase3`, `think_base_rlmt`, `think_phase3_rlmt`
- Scope: 64 prefixes, 16 samples per prefix, two-stage external-boundary interface.
- Status at 2026-04-26 19:29 ET: complete. Split reasoning eval launched on both Spark hosts.

| Arm | Reward Mean | Reward Std | Invalid Rate | Mixed Groups | Near-Zero Groups | Any-Success Groups | All-Success Groups | Avg Total New Tokens | Avg Predicted Suffix Words | Closed Think Rate |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| think_base | 0.087891 | 0.283136 | 0.0000 | 0.546875 | 0.453125 | 0.546875 | 0.0000 | 175.8164 | 87.2734 | 1.0000 |
| think_phase3 | 0.090820 | 0.287353 | 0.0000 | 0.593750 | 0.406250 | 0.593750 | 0.0000 | 175.8818 | 86.7861 | 1.0000 |
| think_base_rlmt | 0.093750 | 0.291481 | 0.0000 | 0.640625 | 0.359375 | 0.640625 | 0.0000 | 175.8936 | 86.3281 | 1.0000 |
| think_phase3_rlmt | 0.097656 | 0.296849 | 0.0000 | 0.609375 | 0.390625 | 0.609375 | 0.0000 | 175.9229 | 88.7979 | 1.0000 |

Reward-gate comparisons from `summary.json`:

| Comparison | Mean Reward Delta | Right Better Prefix Rate | Left Better Prefix Rate | Tie Prefix Rate | Shared Prefixes |
| --- | ---: | ---: | ---: | ---: | ---: |
| think_base_rlmt_minus_think_base | 0.005859 | 0.281250 | 0.281250 | 0.437500 | 64 |
| think_phase3_minus_think_base | 0.002930 | 0.296875 | 0.250000 | 0.453125 | 64 |
| think_phase3_rlmt_minus_think_base | 0.009766 | 0.296875 | 0.296875 | 0.406250 | 64 |
| think_phase3_minus_think_base_rlmt | -0.002930 | 0.281250 | 0.312500 | 0.406250 | 64 |
| think_phase3_rlmt_minus_think_base_rlmt | 0.003906 | 0.250000 | 0.265625 | 0.484375 | 64 |
| think_phase3_rlmt_minus_think_phase3 | 0.006836 | 0.312500 | 0.312500 | 0.375000 | 64 |

Reward-gate go/no-go fields:

- `reward_validity_ok`: true
- `variance_ok`: true
- `phase3_more_reward_separable`: true
- `phase3_higher_mean_reward`: true

Reasoning eval:

- Initial Hugging Face `model.generate` eval was intentionally stopped after throughput instrumentation showed only ~106-140 generated tok/s. That path was valid but too slow for the final overnight eval.
- Optimized SGLang eval ID: `phase4b-post-rlmt-eval-sglang-20260427-0014`
- Optimized result root: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014`
- f7e2 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.out`
- cfd0 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.out`
- f7e2 arms: `think_base`, then `think_base_rlmt`
- cfd0 arms: `think_phase3`, then `think_phase3_rlmt`
- Serving image: `scitrera/dgx-spark-sglang:0.5.9-t5`
- Serving flags: `--served-model-name default --tp 1 --cuda-graph-max-bs 32 --num-continuous-decode-steps 16 --schedule-policy lpm --mem-fraction-static 0.70`
- Eval client: `scripts/phase4b_reasoning_eval_sglang.py`
- Eval config: 8 samples/problem, max 512 new tokens, concurrency 32, temperature 0.6, top_p 0.95.
- Completion status: complete on both hosts.
  - f7e2 completed at `2026-04-27T01:57:57Z`.
  - cfd0 completed at `2026-04-27T01:57:29Z`.
- Throughput from final cfd0 progress logs: roughly 2.2k generated tok/s cumulative on `think_phase3_rlmt` OlympiadBench. f7e2 logs showed the same order of magnitude during the run.
- Note: the per-host `summary.json` files contain the last arm run on that host. The full four-arm table below was reconstructed directly from the per-arm JSONL artifacts under the optimized result root.

Reasoning-eval source artifacts:

- `think_base`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base/*.jsonl`
- `think_base_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base_rlmt/*.jsonl`
- `think_phase3`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3/*.jsonl`
- `think_phase3_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3_rlmt/*.jsonl`

| Arm | Benchmark | Mean@8 | Correct / Samples | Pass@8 Any | Pass / Problems | Avg Completion Tokens |
| --- | --- | ---: | ---: | ---: | ---: | ---: |
| think_base | GSM8K | 0.105478 | 1113 / 10552 | 0.508719 | 671 / 1319 | 322.2 |
| think_base | MATH-500 | 0.273250 | 1093 / 4000 | 0.574000 | 287 / 500 | 371.5 |
| think_base | GPQA-Diamond | 0.222222 | 352 / 1584 | 0.671717 | 133 / 198 | 320.3 |
| think_base | OlympiadBench | 0.071023 | 125 / 1760 | 0.222727 | 49 / 220 | 441.9 |
| think_base_rlmt | GSM8K | 0.094484 | 997 / 10552 | 0.476876 | 629 / 1319 | 324.9 |
| think_base_rlmt | MATH-500 | 0.287750 | 1151 / 4000 | 0.562000 | 281 / 500 | 376.2 |
| think_base_rlmt | GPQA-Diamond | 0.224116 | 355 / 1584 | 0.737374 | 146 / 198 | 323.3 |
| think_base_rlmt | OlympiadBench | 0.082386 | 145 / 1760 | 0.222727 | 49 / 220 | 447.4 |
| think_phase3 | GSM8K | 0.103393 | 1091 / 10552 | 0.512509 | 676 / 1319 | 321.8 |
| think_phase3 | MATH-500 | 0.282250 | 1129 / 4000 | 0.568000 | 284 / 500 | 362.1 |
| think_phase3 | GPQA-Diamond | 0.234848 | 372 / 1584 | 0.712121 | 141 / 198 | 323.8 |
| think_phase3 | OlympiadBench | 0.078409 | 138 / 1760 | 0.254545 | 56 / 220 | 441.0 |
| think_phase3_rlmt | GSM8K | 0.103014 | 1087 / 10552 | 0.506444 | 668 / 1319 | 322.0 |
| think_phase3_rlmt | MATH-500 | 0.273750 | 1095 / 4000 | 0.574000 | 287 / 500 | 366.2 |
| think_phase3_rlmt | GPQA-Diamond | 0.222854 | 353 / 1584 | 0.737374 | 146 / 198 | 316.6 |
| think_phase3_rlmt | OlympiadBench | 0.065341 | 115 / 1760 | 0.231818 | 51 / 220 | 435.9 |

Macro averages:

| Arm | Macro Mean@8 | Macro Pass@8 Any | Macro Avg Completion Tokens |
| --- | ---: | ---: | ---: |
| think_base | 0.167993 | 0.494291 | 364.0 |
| think_base_rlmt | 0.172184 | 0.499744 | 368.0 |
| think_phase3 | 0.174725 | 0.511794 | 362.2 |
| think_phase3_rlmt | 0.166240 | 0.512409 | 360.2 |

Direct reasoning-eval comparisons:

| Comparison | Benchmark | Mean@8 Delta | Pass@8 Any Delta | Avg Token Delta |
| --- | --- | ---: | ---: | ---: |
| Base+Think+RLMT vs Base+Think | GSM8K | -0.010993 | -0.031842 | +2.7 |
| Base+Think+RLMT vs Base+Think | MATH-500 | +0.014500 | -0.012000 | +4.7 |
| Base+Think+RLMT vs Base+Think | GPQA-Diamond | +0.001894 | +0.065657 | +3.1 |
| Base+Think+RLMT vs Base+Think | OlympiadBench | +0.011364 | +0.000000 | +5.5 |
| Base+Think+RLMT vs Base+Think | Macro | +0.004191 | +0.005454 | +4.0 |
| Phase3+Think+RLMT vs Phase3+Think | GSM8K | -0.000379 | -0.006065 | +0.2 |
| Phase3+Think+RLMT vs Phase3+Think | MATH-500 | -0.008500 | +0.006000 | +4.1 |
| Phase3+Think+RLMT vs Phase3+Think | GPQA-Diamond | -0.011995 | +0.025253 | -7.2 |
| Phase3+Think+RLMT vs Phase3+Think | OlympiadBench | -0.013068 | -0.022727 | -5.1 |
| Phase3+Think+RLMT vs Phase3+Think | Macro | -0.008486 | +0.000615 | -2.0 |
| Phase3+Think+RLMT vs Base+Think+RLMT | GSM8K | +0.008529 | +0.029568 | -2.9 |
| Phase3+Think+RLMT vs Base+Think+RLMT | MATH-500 | -0.014000 | +0.012000 | -10.0 |
| Phase3+Think+RLMT vs Base+Think+RLMT | GPQA-Diamond | -0.001263 | +0.000000 | -6.7 |
| Phase3+Think+RLMT vs Base+Think+RLMT | OlympiadBench | -0.017045 | +0.009091 | -11.5 |
| Phase3+Think+RLMT vs Base+Think+RLMT | Macro | -0.005945 | +0.012665 | -7.8 |
| Phase3+Think vs Base+Think | GSM8K | -0.002085 | +0.003791 | -0.4 |
| Phase3+Think vs Base+Think | MATH-500 | +0.009000 | -0.006000 | -9.4 |
| Phase3+Think vs Base+Think | GPQA-Diamond | +0.012626 | +0.040404 | +3.6 |
| Phase3+Think vs Base+Think | OlympiadBench | +0.007386 | +0.031818 | -0.9 |
| Phase3+Think vs Base+Think | Macro | +0.006732 | +0.017503 | -1.8 |

## Claim Candidates

Evidence-backed candidates for the writing session:

- Phase 3 has a clean positive held-out continuation-quality result: Phase3-final wins 81/128 pairwise comparisons against Qwen3-0.6B-Base = 63.28%.
- The RLMT loop completed cleanly for matched base and Phase 3 lineages at 200 steps with no invalid judge responses and no all-one reward collapse.
- The two-stage post-RLMT reward gate improved mean reward for both RLMT arms and ranked `think_phase3_rlmt` highest by reward mean: 0.097656 vs 0.093750 for `think_base_rlmt`, 0.090820 for `think_phase3`, and 0.087891 for `think_base`.
- Downstream reasoning eval gives a mixed but useful substrate signal: `think_phase3` beats `think_base` on macro Mean@8 and macro Pass@8 Any; after RLMT, `think_phase3_rlmt` beats `think_base_rlmt` on macro Pass@8 Any but not macro Mean@8.
- The SGLang eval path is the throughput-valid final eval path. The slower Hugging Face generate path was stopped intentionally after direct throughput instrumentation.

Claim boundaries:

- Do not claim a uniform downstream reasoning win from RLMT alone. RLMT improves some benchmark/metric slices and the paper-aligned reward gate, but the reasoning suite is metric- and benchmark-dependent.
- Do not infer full Table 10 equivalence. This is a small-scale reproduction/proxy using 8 samples/problem and four HF-hosted reasoning benchmarks, not the paper's full scale.
- Treat sample-level Mean@8 and problem-level Pass@8 Any separately. They answer different questions: average sample correctness versus whether any rollout solves the problem.