ModelHub XC a9fc551b52 初始化项目,由ModelHub XC社区提供模型
Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform
2026-05-01 22:12:10 +08:00

license, base_model, datasets, language, pipeline_tag, tags, model-index
license base_model datasets language pipeline_tag tags model-index
apache-2.0 Qwen/Qwen3-0.6B-Base
Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data
en
text-generation
qwen3
text-generation
interleaved-thinking
synthetic-pretraining
rlmt
research
name results
qwen3-0.6B-interleaved-thinking

qwen3-0.6B-interleaved-thinking

This is a small research model derived from Qwen/Qwen3-0.6B-Base. It was trained to test whether ordinary pretraining text can be turned into a sequence of training environments before agentic post-training begins.

The training pipeline adapts the self-improving pretraining and thinking mid-training setup from Tan et al. to Qwen3-0.6B-Base using FineWeb-Edu chunks. The goal was not to build a production assistant. The goal was to ask whether a small base model can learn a continuation preference, learn an interleaved thought interface, and make that interface rewardable through RL mid-training.

The model is released with the companion dataset:

Jarrodbarnes/qwen3-0.6B-interleaved-thinking-data

Summary

This checkpoint is the final Phase3+Think+RLMT arm from the blog post Self-Improving Pretraining as a Substrate for Agentic Post-Training.

At a high level:

  • continued pretraining selected for judged-better continuations rather than exact suffix imitation
  • interleaved-thinking SFT taught the model where short local thoughts appear in ordinary text
  • RLMT rewarded the thought-conditioned suffix prediction, not the thought itself
  • a causal thought-use probe found that replacing the thought with an unrelated thought sharply reduced suffix reward

The result should be read as an emerging thought-conditioned continuation interface at 0.6B scale, not as a mature reasoning or assistant policy.

Training Lineage

  1. Qwen/Qwen3-0.6B-Base
  2. Self-improving continued pretraining with Online DPO-style sequence preference training
  3. SFT on interleaved teacher-thought pretraining chunks
  4. 200-step RLMT run using the paper-aligned reward object:

prefix -> generated thought -> predicted suffix -> judge(predicted suffix, true suffix)

The checkpoint packaged here is the Phase3+Think+RLMT arm:

outputs/phase4b_rlmt_think_phase3/final.pt

What This Model Is

This is an experimental 0.6B model for studying whether interleaved thoughts can become a behavioral control surface during mid-training.

It is useful for:

  • small-scale thinking mid-training experiments
  • causal thought-use probes
  • studying self-improving pretraining, interleaved SFT, and RLMT mechanics
  • reproducing the associated research blog results

It is not an instruction-tuned assistant model and should not be evaluated as one.

Main Findings

The blog reports the following stage-level results:

  • Continued pretraining improved held-out judged continuation quality: the self-improved checkpoint beat Qwen3-0.6B-Base on 81 of 128 pairwise judgments.
  • Interleaved-thinking SFT installed the thought interface: thought-token NLL dropped from 4.24 to about 3.14-3.16 after SFT.
  • RLMT made the interface rewardable under the suffix-prediction objective: the self-improved RLMT arm reached the highest reward-gate mean among the four compared arms.
  • The causal thought-use probe showed that thought text was not just formatting: swapped unrelated thoughts sharply reduced suffix reward.

These are small-scale lifecycle results. The downstream reasoning evaluation was mixed across benchmarks and metrics.

Claim Boundaries

The key claim supported by this release is that a small base model can be shaped into an emerging thought-conditioned continuation interface before agentic post-training. The evidence does not support a claim that this model learned a mature agentic reasoning policy.

The thought-use result should be read carefully. Swapped thoughts reduced reward, so the thought channel mattered behaviorally. Blank and generic thoughts often matched or beat sampled thoughts, so the model had not learned to reliably generate the best thoughts for that channel.

The scale matters: this is a 0.6B parameter model with a short 200-step RLMT run.

See:

  • RLMT_RESULTS.md
  • THOUGHT_USE_PROBE.md

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Jarrodbarnes/qwen3-0.6B-interleaved-thinking"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype="auto")

Prompting Format

The training format uses interleaved thought spans:

Prefix text ... <think>short local thought about what comes next</think> predicted continuation ...

For RLMT-style generation, the small-scale training interface sampled a thought first, externally inserted </think>, then sampled the predicted suffix.

Limitations

  • Small model: 0.6B parameters.
  • Short RLMT budget: 200 update steps.
  • Research artifact, not a production model.
  • Generated thoughts are not reliably better than generic scaffolds in the released experiment.
  • Downstream reasoning did not improve uniformly across Mean@8 and Pass@8.
  • Claims should be limited to the documented small-scale setup.
Description
Model synced from source: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Readme 2 MiB
Languages
Jinja 100%