初始化项目，由ModelHub XC社区提供模型

Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking Source: Original Platform
2026-05-01 22:12:10 +08:00
commit a9fc551b52
15 changed files with 152363 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,121 @@
+---
+license: apache-2.0
+base_model: Qwen/Qwen3-0.6B-Base
+datasets:
+- Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- qwen3
+- text-generation
+- interleaved-thinking
+- synthetic-pretraining
+- rlmt
+- research
+model-index:
+- name: qwen3-0.6B-interleaved-thinking
+  results: []
+---
+
+# qwen3-0.6B-interleaved-thinking
+
+This is a small research model derived from `Qwen/Qwen3-0.6B-Base`. It was trained to test whether ordinary pretraining text can be turned into a sequence of training environments before agentic post-training begins.
+
+The training pipeline adapts the self-improving pretraining and thinking mid-training setup from Tan et al. to Qwen3-0.6B-Base using FineWeb-Edu chunks. The goal was not to build a production assistant. The goal was to ask whether a small base model can learn a continuation preference, learn an interleaved thought interface, and make that interface rewardable through RL mid-training.
+
+The model is released with the companion dataset:
+
+[`Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data`](https://huggingface.co/datasets/Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data)
+
+## Summary
+
+This checkpoint is the final `Phase3+Think+RLMT` arm from the blog post *Self-Improving Pretraining as a Substrate for Agentic Post-Training*.
+
+At a high level:
+
+- continued pretraining selected for judged-better continuations rather than exact suffix imitation
+- interleaved-thinking SFT taught the model where short local thoughts appear in ordinary text
+- RLMT rewarded the thought-conditioned suffix prediction, not the thought itself
+- a causal thought-use probe found that replacing the thought with an unrelated thought sharply reduced suffix reward
+
+The result should be read as an emerging thought-conditioned continuation interface at 0.6B scale, not as a mature reasoning or assistant policy.
+
+## Training Lineage
+
+1. `Qwen/Qwen3-0.6B-Base`
+2. Self-improving continued pretraining with Online DPO-style sequence preference training
+3. SFT on interleaved teacher-thought pretraining chunks
+4. 200-step RLMT run using the paper-aligned reward object:
+
+`prefix -> generated thought -> predicted suffix -> judge(predicted suffix, true suffix)`
+
+The checkpoint packaged here is the `Phase3+Think+RLMT` arm:
+
+`outputs/phase4b_rlmt_think_phase3/final.pt`
+
+## What This Model Is
+
+This is an experimental 0.6B model for studying whether interleaved thoughts can become a behavioral control surface during mid-training.
+
+It is useful for:
+
+- small-scale thinking mid-training experiments
+- causal thought-use probes
+- studying self-improving pretraining, interleaved SFT, and RLMT mechanics
+- reproducing the associated research blog results
+
+It is not an instruction-tuned assistant model and should not be evaluated as one.
+
+## Main Findings
+
+The blog reports the following stage-level results:
+
+- Continued pretraining improved held-out judged continuation quality: the self-improved checkpoint beat Qwen3-0.6B-Base on 81 of 128 pairwise judgments.
+- Interleaved-thinking SFT installed the thought interface: thought-token NLL dropped from 4.24 to about 3.14-3.16 after SFT.
+- RLMT made the interface rewardable under the suffix-prediction objective: the self-improved RLMT arm reached the highest reward-gate mean among the four compared arms.
+- The causal thought-use probe showed that thought text was not just formatting: swapped unrelated thoughts sharply reduced suffix reward.
+
+These are small-scale lifecycle results. The downstream reasoning evaluation was mixed across benchmarks and metrics.
+
+## Claim Boundaries
+
+The key claim supported by this release is that a small base model can be shaped into an emerging thought-conditioned continuation interface before agentic post-training. The evidence does not support a claim that this model learned a mature agentic reasoning policy.
+
+The thought-use result should be read carefully. Swapped thoughts reduced reward, so the thought channel mattered behaviorally. Blank and generic thoughts often matched or beat sampled thoughts, so the model had not learned to reliably generate the best thoughts for that channel.
+
+The scale matters: this is a 0.6B parameter model with a short 200-step RLMT run.
+
+See:
+
+- `RLMT_RESULTS.md`
+- `THOUGHT_USE_PROBE.md`
+
+## Loading
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+repo = "Jarrodbarnes/qwen3-0.6B-interleaved-thinking"
+tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype="auto")
+```
+
+## Prompting Format
+
+The training format uses interleaved thought spans:
+
+```text
+Prefix text ... <think>short local thought about what comes next</think> predicted continuation ...
+```
+
+For RLMT-style generation, the small-scale training interface sampled a thought first, externally inserted `</think>`, then sampled the predicted suffix.
+
+## Limitations
+
+- Small model: 0.6B parameters.
+- Short RLMT budget: 200 update steps.
+- Research artifact, not a production model.
+- Generated thoughts are not reliably better than generic scaffolds in the released experiment.
+- Downstream reasoning did not improve uniformly across Mean@8 and Pass@8.
+- Claims should be limited to the documented small-scale setup.
--- a/RLMT_RESULTS.md
+++ b/RLMT_RESULTS.md
@@ -0,0 +1,218 @@
+# Phase 4B RLMT Results Source of Truth
+
+Date: 2026-04-27
+
+Purpose: verified local handoff for the next Codex/writing session. This is not the blog draft. It records source artifacts, completed training results, and pending overnight eval outputs so the research-blog claims can be grounded in files rather than memory.
+
+## Research Question
+
+Does self-improving continued pretraining make Qwen3-0.6B-Base a better substrate for thinking mid-training?
+
+Portfolio framing: use Phase 3 self-improving pretraining, Phase 4B interleaved-thinking SFT, and Phase 4B RLMT as a small-scale proxy for moving frontier post-training recipes earlier in the model lifecycle.
+
+## Primary Source Artifacts
+
+- Paper: `/Users/jarrodbarnes/Downloads/Self-Improving Pretraining (1).pdf`
+- Phase 3 eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md`
+- Phase 3 qualitative audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-qualitative-audit-2026-04-25.md`
+- Phase 4B data audit: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase4b-data-audit-2026-04-25.md`
+- Metrics schema: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/metrics-schema.md`
+- RLMT trainer: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_train.py`
+- RLMT reward gate: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_rlmt_reward_gate.py`
+- Sharp eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_sharp_eval.py`
+- Reasoning eval: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_reasoning_eval.py`
+
+## Completed Training Results
+
+Source host: `spark-f7e2` via `ssh spark`.
+
+### Phase 3 Self-Improving Pretraining
+
+Source artifact: `/Users/jarrodbarnes/projects/synthetic-pretrain/docs/phase3-eval-2026-04-25.md`
+
+- Method: full-pairwise Online DPO with K=16 rollouts.
+- Held-out pairwise continuation quality: Phase3-final beat Qwen3-0.6B-Base on 81/128 comparisons = 63.28%.
+- Interpretation status: positive Phase 3 result already documented; use as the substrate-improvement premise for Phase 4B.
+
+### Phase 4B SFT
+
+Source artifacts on Spark:
+
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/metrics.json`
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/metrics.json`
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/metrics.json`
+
+| Arm | Checkpoint | Primary Metric | Val Loss Raw | Val Loss Selected | Repetition 4-gram Rate | Tokens Seen | Throughput tok/s | Min Available Mem GiB |
+| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| raw_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_raw_base/final.pt` | 2.694593 | 2.694593 | 2.694593 | 0.020321 | 4,036,608 | 2729.53 | 94.94 |
+| think_base SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_base/final.pt` | 2.792869 | 2.707164 | 2.792869 | 0.019397 | 4,035,285 | 3284.51 | 69.78 |
+| think_phase3 SFT | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_sft_think_phase3/final.pt` | 2.788120 | 2.708099 | 2.788120 | 0.019397 | 4,035,285 | 3298.08 | 68.65 |
+
+Interpretation status: SFT installed the teacher-thought distribution but did not by itself produce a clear H3 gain pre-RLMT. Treat SFT as the setup for thinking mid-training, not the final claim.
+
+### Phase 4B RLMT
+
+Source artifacts on Spark:
+
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/metrics.json`
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/metrics.json`
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt`
+- `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt`
+
+Configuration:
+
+- K=16 samples per prefix.
+- Two-stage external-boundary interface: sample thought, externally insert thought/suffix boundary, sample suffix, judge only suffix.
+- 200 RLMT steps per matched arm.
+- Prefixes per step: 2.
+- Thought max tokens: 48.
+- Suffix max tokens: 128.
+- Learning rate: 1e-7.
+- KL coefficient: 0.02.
+- Max grad norm: 0.1.
+- Stop conditions were logged as alerts only, per user instruction, except non-finite loss safety.
+
+| Arm | Completed Steps | Stopped Reason | Avg Reward | Final Reward | Avg Mixed Groups | Avg Near-Zero Groups | Avg All-Zero Groups | Avg All-One Groups | Invalid Judge | Avg Length Drift | Avg Artifact Rate | Wall Time s | Final Checkpoint |
+| --- | ---: | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
+| think_base RLMT | 200 | null | 0.092188 | 0.031250 | 0.5700 | 0.4300 | 0.4300 | 0.0000 | 0.0000 | -0.000511 | 0.132031 | 5488.32 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_base/final.pt` |
+| think_phase3 RLMT | 200 | null | 0.090313 | 0.093750 | 0.5625 | 0.4375 | 0.4375 | 0.0000 | 0.0000 | -0.000545 | 0.139844 | 5448.25 | `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_think_phase3/final.pt` |
+
+Interpretation status: training completed cleanly for the matched RLMT arms. Do not infer the substrate claim from training reward alone; use post-RLMT reward gate and downstream evals below.
+
+## Overnight Eval Status
+
+Eval ID: `phase4b-post-rlmt-eval-20260426-181047`
+
+Watcher:
+
+- `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.sh`
+- `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-rlmt-scale-safe-20260426-180633-eval-watcher.out`
+
+Reward gate:
+
+- Log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-20260426-181047-reward-gate.out`
+- Output dir: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate`
+- Summary: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_rlmt_reward_gate/phase4b-post-rlmt-eval-20260426-181047-gate/summary.json`
+- Arms: `think_base`, `think_phase3`, `think_base_rlmt`, `think_phase3_rlmt`
+- Scope: 64 prefixes, 16 samples per prefix, two-stage external-boundary interface.
+- Status at 2026-04-26 19:29 ET: complete. Split reasoning eval launched on both Spark hosts.
+
+| Arm | Reward Mean | Reward Std | Invalid Rate | Mixed Groups | Near-Zero Groups | Any-Success Groups | All-Success Groups | Avg Total New Tokens | Avg Predicted Suffix Words | Closed Think Rate |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| think_base | 0.087891 | 0.283136 | 0.0000 | 0.546875 | 0.453125 | 0.546875 | 0.0000 | 175.8164 | 87.2734 | 1.0000 |
+| think_phase3 | 0.090820 | 0.287353 | 0.0000 | 0.593750 | 0.406250 | 0.593750 | 0.0000 | 175.8818 | 86.7861 | 1.0000 |
+| think_base_rlmt | 0.093750 | 0.291481 | 0.0000 | 0.640625 | 0.359375 | 0.640625 | 0.0000 | 175.8936 | 86.3281 | 1.0000 |
+| think_phase3_rlmt | 0.097656 | 0.296849 | 0.0000 | 0.609375 | 0.390625 | 0.609375 | 0.0000 | 175.9229 | 88.7979 | 1.0000 |
+
+Reward-gate comparisons from `summary.json`:
+
+| Comparison | Mean Reward Delta | Right Better Prefix Rate | Left Better Prefix Rate | Tie Prefix Rate | Shared Prefixes |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| think_base_rlmt_minus_think_base | 0.005859 | 0.281250 | 0.281250 | 0.437500 | 64 |
+| think_phase3_minus_think_base | 0.002930 | 0.296875 | 0.250000 | 0.453125 | 64 |
+| think_phase3_rlmt_minus_think_base | 0.009766 | 0.296875 | 0.296875 | 0.406250 | 64 |
+| think_phase3_minus_think_base_rlmt | -0.002930 | 0.281250 | 0.312500 | 0.406250 | 64 |
+| think_phase3_rlmt_minus_think_base_rlmt | 0.003906 | 0.250000 | 0.265625 | 0.484375 | 64 |
+| think_phase3_rlmt_minus_think_phase3 | 0.006836 | 0.312500 | 0.312500 | 0.375000 | 64 |
+
+Reward-gate go/no-go fields:
+
+- `reward_validity_ok`: true
+- `variance_ok`: true
+- `phase3_more_reward_separable`: true
+- `phase3_higher_mean_reward`: true
+
+Reasoning eval:
+
+- Initial Hugging Face `model.generate` eval was intentionally stopped after throughput instrumentation showed only ~106-140 generated tok/s. That path was valid but too slow for the final overnight eval.
+- Optimized SGLang eval ID: `phase4b-post-rlmt-eval-sglang-20260427-0014`
+- Optimized result root: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014`
+- f7e2 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-f7e2-run.out`
+- cfd0 runner/log: `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.sh`, `/home/jarrodbarnes/synthetic-pretrain/logs/phase4b-post-rlmt-eval-sglang-20260427-0014-cfd0-run.out`
+- f7e2 arms: `think_base`, then `think_base_rlmt`
+- cfd0 arms: `think_phase3`, then `think_phase3_rlmt`
+- Serving image: `scitrera/dgx-spark-sglang:0.5.9-t5`
+- Serving flags: `--served-model-name default --tp 1 --cuda-graph-max-bs 32 --num-continuous-decode-steps 16 --schedule-policy lpm --mem-fraction-static 0.70`
+- Eval client: `scripts/phase4b_reasoning_eval_sglang.py`
+- Eval config: 8 samples/problem, max 512 new tokens, concurrency 32, temperature 0.6, top_p 0.95.
+- Completion status: complete on both hosts.
+  - f7e2 completed at `2026-04-27T01:57:57Z`.
+  - cfd0 completed at `2026-04-27T01:57:29Z`.
+- Throughput from final cfd0 progress logs: roughly 2.2k generated tok/s cumulative on `think_phase3_rlmt` OlympiadBench. f7e2 logs showed the same order of magnitude during the run.
+- Note: the per-host `summary.json` files contain the last arm run on that host. The full four-arm table below was reconstructed directly from the per-arm JSONL artifacts under the optimized result root.
+
+Reasoning-eval source artifacts:
+
+- `think_base`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base/*.jsonl`
+- `think_base_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/f7e2/think_base_rlmt/*.jsonl`
+- `think_phase3`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3/*.jsonl`
+- `think_phase3_rlmt`: `/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_reasoning_eval_sglang_fast/phase4b-post-rlmt-eval-sglang-20260427-0014/cfd0/think_phase3_rlmt/*.jsonl`
+
+| Arm | Benchmark | Mean@8 | Correct / Samples | Pass@8 Any | Pass / Problems | Avg Completion Tokens |
+| --- | --- | ---: | ---: | ---: | ---: | ---: |
+| think_base | GSM8K | 0.105478 | 1113 / 10552 | 0.508719 | 671 / 1319 | 322.2 |
+| think_base | MATH-500 | 0.273250 | 1093 / 4000 | 0.574000 | 287 / 500 | 371.5 |
+| think_base | GPQA-Diamond | 0.222222 | 352 / 1584 | 0.671717 | 133 / 198 | 320.3 |
+| think_base | OlympiadBench | 0.071023 | 125 / 1760 | 0.222727 | 49 / 220 | 441.9 |
+| think_base_rlmt | GSM8K | 0.094484 | 997 / 10552 | 0.476876 | 629 / 1319 | 324.9 |
+| think_base_rlmt | MATH-500 | 0.287750 | 1151 / 4000 | 0.562000 | 281 / 500 | 376.2 |
+| think_base_rlmt | GPQA-Diamond | 0.224116 | 355 / 1584 | 0.737374 | 146 / 198 | 323.3 |
+| think_base_rlmt | OlympiadBench | 0.082386 | 145 / 1760 | 0.222727 | 49 / 220 | 447.4 |
+| think_phase3 | GSM8K | 0.103393 | 1091 / 10552 | 0.512509 | 676 / 1319 | 321.8 |
+| think_phase3 | MATH-500 | 0.282250 | 1129 / 4000 | 0.568000 | 284 / 500 | 362.1 |
+| think_phase3 | GPQA-Diamond | 0.234848 | 372 / 1584 | 0.712121 | 141 / 198 | 323.8 |
+| think_phase3 | OlympiadBench | 0.078409 | 138 / 1760 | 0.254545 | 56 / 220 | 441.0 |
+| think_phase3_rlmt | GSM8K | 0.103014 | 1087 / 10552 | 0.506444 | 668 / 1319 | 322.0 |
+| think_phase3_rlmt | MATH-500 | 0.273750 | 1095 / 4000 | 0.574000 | 287 / 500 | 366.2 |
+| think_phase3_rlmt | GPQA-Diamond | 0.222854 | 353 / 1584 | 0.737374 | 146 / 198 | 316.6 |
+| think_phase3_rlmt | OlympiadBench | 0.065341 | 115 / 1760 | 0.231818 | 51 / 220 | 435.9 |
+
+Macro averages:
+
+| Arm | Macro Mean@8 | Macro Pass@8 Any | Macro Avg Completion Tokens |
+| --- | ---: | ---: | ---: |
+| think_base | 0.167993 | 0.494291 | 364.0 |
+| think_base_rlmt | 0.172184 | 0.499744 | 368.0 |
+| think_phase3 | 0.174725 | 0.511794 | 362.2 |
+| think_phase3_rlmt | 0.166240 | 0.512409 | 360.2 |
+
+Direct reasoning-eval comparisons:
+
+| Comparison | Benchmark | Mean@8 Delta | Pass@8 Any Delta | Avg Token Delta |
+| --- | --- | ---: | ---: | ---: |
+| Base+Think+RLMT vs Base+Think | GSM8K | -0.010993 | -0.031842 | +2.7 |
+| Base+Think+RLMT vs Base+Think | MATH-500 | +0.014500 | -0.012000 | +4.7 |
+| Base+Think+RLMT vs Base+Think | GPQA-Diamond | +0.001894 | +0.065657 | +3.1 |
+| Base+Think+RLMT vs Base+Think | OlympiadBench | +0.011364 | +0.000000 | +5.5 |
+| Base+Think+RLMT vs Base+Think | Macro | +0.004191 | +0.005454 | +4.0 |
+| Phase3+Think+RLMT vs Phase3+Think | GSM8K | -0.000379 | -0.006065 | +0.2 |
+| Phase3+Think+RLMT vs Phase3+Think | MATH-500 | -0.008500 | +0.006000 | +4.1 |
+| Phase3+Think+RLMT vs Phase3+Think | GPQA-Diamond | -0.011995 | +0.025253 | -7.2 |
+| Phase3+Think+RLMT vs Phase3+Think | OlympiadBench | -0.013068 | -0.022727 | -5.1 |
+| Phase3+Think+RLMT vs Phase3+Think | Macro | -0.008486 | +0.000615 | -2.0 |
+| Phase3+Think+RLMT vs Base+Think+RLMT | GSM8K | +0.008529 | +0.029568 | -2.9 |
+| Phase3+Think+RLMT vs Base+Think+RLMT | MATH-500 | -0.014000 | +0.012000 | -10.0 |
+| Phase3+Think+RLMT vs Base+Think+RLMT | GPQA-Diamond | -0.001263 | +0.000000 | -6.7 |
+| Phase3+Think+RLMT vs Base+Think+RLMT | OlympiadBench | -0.017045 | +0.009091 | -11.5 |
+| Phase3+Think+RLMT vs Base+Think+RLMT | Macro | -0.005945 | +0.012665 | -7.8 |
+| Phase3+Think vs Base+Think | GSM8K | -0.002085 | +0.003791 | -0.4 |
+| Phase3+Think vs Base+Think | MATH-500 | +0.009000 | -0.006000 | -9.4 |
+| Phase3+Think vs Base+Think | GPQA-Diamond | +0.012626 | +0.040404 | +3.6 |
+| Phase3+Think vs Base+Think | OlympiadBench | +0.007386 | +0.031818 | -0.9 |
+| Phase3+Think vs Base+Think | Macro | +0.006732 | +0.017503 | -1.8 |
+
+## Claim Candidates
+
+Evidence-backed candidates for the writing session:
+
+- Phase 3 has a clean positive held-out continuation-quality result: Phase3-final wins 81/128 pairwise comparisons against Qwen3-0.6B-Base = 63.28%.
+- The RLMT loop completed cleanly for matched base and Phase 3 lineages at 200 steps with no invalid judge responses and no all-one reward collapse.
+- The two-stage post-RLMT reward gate improved mean reward for both RLMT arms and ranked `think_phase3_rlmt` highest by reward mean: 0.097656 vs 0.093750 for `think_base_rlmt`, 0.090820 for `think_phase3`, and 0.087891 for `think_base`.
+- Downstream reasoning eval gives a mixed but useful substrate signal: `think_phase3` beats `think_base` on macro Mean@8 and macro Pass@8 Any; after RLMT, `think_phase3_rlmt` beats `think_base_rlmt` on macro Pass@8 Any but not macro Mean@8.
+- The SGLang eval path is the throughput-valid final eval path. The slower Hugging Face generate path was stopped intentionally after direct throughput instrumentation.
+
+Claim boundaries:
+
+- Do not claim a uniform downstream reasoning win from RLMT alone. RLMT improves some benchmark/metric slices and the paper-aligned reward gate, but the reasoning suite is metric- and benchmark-dependent.
+- Do not infer full Table 10 equivalence. This is a small-scale reproduction/proxy using 8 samples/problem and four HF-hosted reasoning benchmarks, not the paper's full scale.
+- Treat sample-level Mean@8 and problem-level Pass@8 Any separately. They answer different questions: average sample correctness versus whether any rollout solves the problem.
--- a/THOUGHT_USE_PROBE.md
+++ b/THOUGHT_USE_PROBE.md
@@ -0,0 +1,134 @@
+# Phase 4B Causal Thought-Use Probe Results
+
+Date: 2026-04-27
+
+## Direct Conclusion
+
+This is the core mechanistic appendix experiment to use for the final report. The probe shows that thought text is behaviorally causal for suffix reward, but the model's own generated thoughts are not the best available scaffold in this small run.
+
+The cleanest signal is from thought swapping: replacing a prefix's own thought with another prefix's model-generated thought sharply reduces reward across all arms. That means the thought channel carries prefix-specific control information. However, blank or generic thoughts often match or outperform normal generated thoughts, so this does not support a strong claim that RLMT made the model's own thoughts optimally useful.
+
+Important scale context: this is a deliberately small 0.6B-model proxy with only 200 RLMT update steps. The right interpretation is not that RLMT failed in general. It is that, at this scale and training budget, the model learned an emerging thought-conditioned continuation interface but not a mature policy for generating reliably high-utility thoughts.
+
+## Source Artifacts
+
+- Probe script: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_thought_use_probe.py`
+- Remote output dir: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427`
+- Remote summary: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/summary.json`
+- Remote log: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/probe.log`
+- Smoke output: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/smoke-thought-use-20260427/summary.json`
+- Judge server used for final run: `spark-cfd0:http://100.113.207.120:30000`, served model `qwen-judge`
+
+## Method
+
+Paper-aligned reward object:
+
+`prefix -> intervened thought -> predicted suffix -> judge(predicted suffix, true suffix)`
+
+The judge only sees and scores the predicted suffix against the true suffix. It does not directly score thought quality.
+
+Arms:
+
+| Arm | Checkpoint |
+| --- | --- |
+| `think_base` | Base+Think SFT |
+| `think_phase3` | Phase3+Think SFT |
+| `think_base_rlmt` | Base+Think+RLMT |
+| `think_phase3_rlmt` | Phase3+Think+RLMT |
+
+Conditions:
+
+| Condition | Intervention |
+| --- | --- |
+| `normal_model_thought` | model samples a thought, then suffix is sampled after external `</think>` |
+| `blank_thought` | suffix is sampled after an empty thought and external `</think>` |
+| `generic_thought` | suffix is sampled after a short neutral scaffold thought |
+| `same_arm_swapped_thought` | suffix is sampled after another prefix's model-generated thought from the same arm |
+
+Run scale:
+
+- 32 held-out prefixes.
+- 4 samples per prefix per condition.
+- 128 judged suffixes per arm-condition.
+- 2,048 total judged suffixes.
+- Invalid judge rate: 0.0 for every arm-condition.
+
+## Main Metrics
+
+| Arm | Condition | Reward mean | Any-success prefix rate | Avg thought words | Avg suffix words |
+| --- | --- | ---: | ---: | ---: | ---: |
+| `think_base` | `normal_model_thought` | 0.1172 | 0.3438 | 34.7 | 89.8 |
+| `think_base` | `blank_thought` | 0.1406 | 0.4062 | 0.0 | 88.2 |
+| `think_base` | `generic_thought` | 0.2031 | 0.5938 | 11.5 | 92.0 |
+| `think_base` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 34.7 | 87.1 |
+| `think_phase3` | `normal_model_thought` | 0.1250 | 0.4062 | 35.7 | 87.0 |
+| `think_phase3` | `blank_thought` | 0.1172 | 0.4062 | 0.0 | 84.7 |
+| `think_phase3` | `generic_thought` | 0.1953 | 0.5000 | 11.5 | 93.4 |
+| `think_phase3` | `same_arm_swapped_thought` | 0.0234 | 0.0938 | 35.7 | 91.9 |
+| `think_base_rlmt` | `normal_model_thought` | 0.0859 | 0.2812 | 33.9 | 87.1 |
+| `think_base_rlmt` | `blank_thought` | 0.1094 | 0.3750 | 0.0 | 89.4 |
+| `think_base_rlmt` | `generic_thought` | 0.1797 | 0.4062 | 11.5 | 90.5 |
+| `think_base_rlmt` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 33.9 | 85.3 |
+| `think_phase3_rlmt` | `normal_model_thought` | 0.1094 | 0.3125 | 34.1 | 89.5 |
+| `think_phase3_rlmt` | `blank_thought` | 0.1562 | 0.4375 | 0.0 | 93.5 |
+| `think_phase3_rlmt` | `generic_thought` | 0.1797 | 0.4375 | 11.5 | 92.3 |
+| `think_phase3_rlmt` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 34.1 | 83.8 |
+
+## Causal Sensitivity
+
+Sensitivity is prefix-level `normal_model_thought - intervened_condition`. Positive means the model's own thought helped relative to the intervention.
+
+| Arm | Normal - blank | Normal - generic | Normal - swapped |
+| --- | ---: | ---: | ---: |
+| `think_base` | -0.0234 | -0.0859 | +0.1016 |
+| `think_phase3` | +0.0078 | -0.0703 | +0.1016 |
+| `think_base_rlmt` | -0.0234 | -0.0938 | +0.0703 |
+| `think_phase3_rlmt` | -0.0469 | -0.0703 | +0.0938 |
+
+Prefix-level win rates for `normal_model_thought` over swapped:
+
+| Arm | Normal better | Swapped better | Tie |
+| --- | ---: | ---: | ---: |
+| `think_base` | 0.3438 | 0.0625 | 0.5938 |
+| `think_phase3` | 0.4062 | 0.0312 | 0.5625 |
+| `think_base_rlmt` | 0.2500 | 0.0312 | 0.7188 |
+| `think_phase3_rlmt` | 0.2500 | 0.0000 | 0.7500 |
+
+## Interpretation
+
+The strongest result is that swapped thoughts are consistently harmful. Across all four arms, replacing the local thought with another prefix's thought drops reward mean to 0.0156-0.0234 and any-success prefix rate to 0.0625-0.0938. This is direct behavioral evidence that the thought channel is not merely decorative: mismatched thought content can causally steer suffix generation away from rewarded continuations.
+
+The second result is more sobering. Blank and generic thoughts often perform as well as or better than the model's own sampled thoughts. Generic thoughts are strongest in this run, with reward means 0.1797-0.2031. This suggests that a clean generic scaffold may stabilize suffix generation better than the small model's sampled thoughts, which are noisy and sometimes harmful.
+
+Phase3 has a small pre-RLMT advantage under normal model thoughts: `think_phase3` reward mean 0.1250 vs `think_base` 0.1172, and any-success prefix rate 0.4062 vs 0.3438. After RLMT, `think_phase3_rlmt` also beats `think_base_rlmt` under normal model thoughts: 0.1094 vs 0.0859 reward mean. However, RLMT did not increase causal sensitivity to the model's own thoughts. The normal-vs-swapped sensitivity is lower after RLMT in the base lineage and only slightly lower in the Phase3 lineage.
+
+This should be framed against the experimental scale. The paper's thinking-mid-training results use larger models, far more mid-training, and downstream RL post-training. Our run uses Qwen3-0.6B and 200 RLMT steps, so generic scaffolds beating sampled thoughts is better read as evidence of an immature thought policy than as a broad negative result. The positive applied-interp signal is that semantic mismatch in the thought channel reliably damages behavior even at this small scale.
+
+## Safe Report Claim
+
+Use this as the applied-mech-interp appendix claim:
+
+> In a deliberately small 0.6B setting with only 200 RLMT updates, a causal thought-use probe shows that interleaved thought text is already a real behavioral control surface: swapping in an unrelated thought sharply reduces suffix reward across all Phase 4B arms. However, the model's own sampled thoughts are not yet reliably better than blank or generic scaffolds, and RLMT does not clearly increase dependence on model-generated thoughts at this scale. This supports a cautious interpretation: Phase 4B learned an emerging thought-conditioned continuation interface, but not a fully optimized internal reasoning policy.
+
+Short version for the final report:
+
+> Thought use is present but immature. Even a 0.6B model after short RLMT is sensitive to thought content, but it has not yet learned to reliably generate the best thoughts itself.
+
+## Claims Not Supported
+
+- Do not claim RLMT made thoughts uniformly more useful.
+- Do not claim the model's sampled thoughts are better than generic scaffolding.
+- Do not claim a discovered circuit or representation.
+- Do not claim causal use of a specific internal activation; this is a causal behavioral intervention on text.
+- Do not frame this as a general null result for thinking RLMT; the model scale and 200-step RLMT budget are too small for that claim.
+
+## Validation
+
+- Local syntax: `python3 -m py_compile scripts/phase4b_thought_use_probe.py`
+- Local lint: `ruff check scripts/phase4b_thought_use_probe.py`
+- Spark smoke: `think_base`, 4 prefixes, 2 samples per prefix, 3 conditions; initial run caught judge outage, rerun passed with 0 invalid responses.
+- Full Spark run: 4 arms, 32 prefixes, 4 samples per prefix, 4 conditions; completed with 0 invalid judge responses.
+
+## Report Placement
+
+This should replace the linear reward-decodability probe in the final research report. The decodability result can be omitted: it was weaker, less causal, and less directly useful to research engineers.
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,28 @@
+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,89 @@
+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,60 @@
+{
+  "architectures": [
+    "Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.6",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
+{
+  "bos_token_id": 151643,
+  "eos_token_id": 151643,
+  "max_new_tokens": 2048,
+  "transformers_version": "4.57.6"
+}
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:00bcdbcd351d816ea669d23b073a0c33c5873f922bb92ae1f366c1c9304cd8f9
+size 1192135096
--- a/phase4b_arm_metadata.json
+++ b/phase4b_arm_metadata.json
@@ -0,0 +1,6 @@
+{
+  "arm": "think_phase3_rlmt",
+  "config": "configs/generated/phase4b_sft_think_phase3.yaml",
+  "checkpoint": "outputs/phase4b_rlmt_think_phase3/final.pt",
+  "dtype": "bfloat16"
+}
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,31 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,239 @@
+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}
--- a/vocab.json
+++ b/vocab.json