--- license: apache-2.0 base_model: Qwen/Qwen3-0.6B-Base datasets: - Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data language: - en pipeline_tag: text-generation tags: - qwen3 - text-generation - interleaved-thinking - synthetic-pretraining - rlmt - research model-index: - name: qwen3-0.6B-interleaved-thinking results: [] --- # qwen3-0.6B-interleaved-thinking This is a small research model derived from `Qwen/Qwen3-0.6B-Base`. It was trained to test whether ordinary pretraining text can be turned into a sequence of training environments before agentic post-training begins. The training pipeline adapts the self-improving pretraining and thinking mid-training setup from Tan et al. to Qwen3-0.6B-Base using FineWeb-Edu chunks. The goal was not to build a production assistant. The goal was to ask whether a small base model can learn a continuation preference, learn an interleaved thought interface, and make that interface rewardable through RL mid-training. The model is released with the companion dataset: [`Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data`](https://huggingface.co/datasets/Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data) ## Summary This checkpoint is the final `Phase3+Think+RLMT` arm from the blog post *Self-Improving Pretraining as a Substrate for Agentic Post-Training*. At a high level: - continued pretraining selected for judged-better continuations rather than exact suffix imitation - interleaved-thinking SFT taught the model where short local thoughts appear in ordinary text - RLMT rewarded the thought-conditioned suffix prediction, not the thought itself - a causal thought-use probe found that replacing the thought with an unrelated thought sharply reduced suffix reward The result should be read as an emerging thought-conditioned continuation interface at 0.6B scale, not as a mature reasoning or assistant policy. ## Training Lineage 1. `Qwen/Qwen3-0.6B-Base` 2. Self-improving continued pretraining with Online DPO-style sequence preference training 3. SFT on interleaved teacher-thought pretraining chunks 4. 200-step RLMT run using the paper-aligned reward object: `prefix -> generated thought -> predicted suffix -> judge(predicted suffix, true suffix)` The checkpoint packaged here is the `Phase3+Think+RLMT` arm: `outputs/phase4b_rlmt_think_phase3/final.pt` ## What This Model Is This is an experimental 0.6B model for studying whether interleaved thoughts can become a behavioral control surface during mid-training. It is useful for: - small-scale thinking mid-training experiments - causal thought-use probes - studying self-improving pretraining, interleaved SFT, and RLMT mechanics - reproducing the associated research blog results It is not an instruction-tuned assistant model and should not be evaluated as one. ## Main Findings The blog reports the following stage-level results: - Continued pretraining improved held-out judged continuation quality: the self-improved checkpoint beat Qwen3-0.6B-Base on 81 of 128 pairwise judgments. - Interleaved-thinking SFT installed the thought interface: thought-token NLL dropped from 4.24 to about 3.14-3.16 after SFT. - RLMT made the interface rewardable under the suffix-prediction objective: the self-improved RLMT arm reached the highest reward-gate mean among the four compared arms. - The causal thought-use probe showed that thought text was not just formatting: swapped unrelated thoughts sharply reduced suffix reward. These are small-scale lifecycle results. The downstream reasoning evaluation was mixed across benchmarks and metrics. ## Claim Boundaries The key claim supported by this release is that a small base model can be shaped into an emerging thought-conditioned continuation interface before agentic post-training. The evidence does not support a claim that this model learned a mature agentic reasoning policy. The thought-use result should be read carefully. Swapped thoughts reduced reward, so the thought channel mattered behaviorally. Blank and generic thoughts often matched or beat sampled thoughts, so the model had not learned to reliably generate the best thoughts for that channel. The scale matters: this is a 0.6B parameter model with a short 200-step RLMT run. See: - `RLMT_RESULTS.md` - `THOUGHT_USE_PROBE.md` ## Loading ```python from transformers import AutoModelForCausalLM, AutoTokenizer repo = "Jarrodbarnes/qwen3-0.6B-interleaved-thinking" tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype="auto") ``` ## Prompting Format The training format uses interleaved thought spans: ```text Prefix text ... short local thought about what comes next predicted continuation ... ``` For RLMT-style generation, the small-scale training interface sampled a thought first, externally inserted ``, then sampled the predicted suffix. ## Limitations - Small model: 0.6B parameters. - Short RLMT budget: 200 update steps. - Research artifact, not a production model. - Generated thoughts are not reliably better than generic scaffolds in the released experiment. - Downstream reasoning did not improve uniformly across Mean@8 and Pass@8. - Claims should be limited to the documented small-scale setup.