qwen3-0.6B-interleaved-thinking

Jarrodbarnes/qwen3-0.6B-interleaved-thinking

Go to file

ModelHub XC a9fc551b52 初始化项目，由ModelHub XC社区提供模型

Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform

2026-05-01 22:12:10 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

added_tokens.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

chat_template.jinja

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

generation_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

merges.txt

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

model.safetensors

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

phase4b_arm_metadata.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

RLMT_RESULTS.md

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

special_tokens_map.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

THOUGHT_USE_PROBE.md

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

tokenizer_config.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

tokenizer.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

vocab.json

初始化项目，由ModelHub XC社区提供模型

2026-05-01 22:12:10 +08:00

README.md

license, base_model, datasets, language, pipeline_tag, tags, model-index

license

base_model

datasets

language

pipeline_tag

Summary

This checkpoint is the final Phase3+Think+RLMT arm from the blog post Self-Improving Pretraining as a Substrate for Agentic Post-Training.

At a high level:

continued pretraining selected for judged-better continuations rather than exact suffix imitation
interleaved-thinking SFT taught the model where short local thoughts appear in ordinary text
RLMT rewarded the thought-conditioned suffix prediction, not the thought itself
a causal thought-use probe found that replacing the thought with an unrelated thought sharply reduced suffix reward

The result should be read as an emerging thought-conditioned continuation interface at 0.6B scale, not as a mature reasoning or assistant policy.

Training Lineage

Qwen/Qwen3-0.6B-Base
Self-improving continued pretraining with Online DPO-style sequence preference training
SFT on interleaved teacher-thought pretraining chunks
200-step RLMT run using the paper-aligned reward object:

prefix -> generated thought -> predicted suffix -> judge(predicted suffix, true suffix)

The checkpoint packaged here is the Phase3+Think+RLMT arm:

outputs/phase4b_rlmt_think_phase3/final.pt

What This Model Is

This is an experimental 0.6B model for studying whether interleaved thoughts can become a behavioral control surface during mid-training.

It is useful for:

small-scale thinking mid-training experiments
causal thought-use probes
studying self-improving pretraining, interleaved SFT, and RLMT mechanics
reproducing the associated research blog results

It is not an instruction-tuned assistant model and should not be evaluated as one.

Main Findings

The blog reports the following stage-level results:

Continued pretraining improved held-out judged continuation quality: the self-improved checkpoint beat Qwen3-0.6B-Base on 81 of 128 pairwise judgments.
Interleaved-thinking SFT installed the thought interface: thought-token NLL dropped from 4.24 to about 3.14-3.16 after SFT.
RLMT made the interface rewardable under the suffix-prediction objective: the self-improved RLMT arm reached the highest reward-gate mean among the four compared arms.
The causal thought-use probe showed that thought text was not just formatting: swapped unrelated thoughts sharply reduced suffix reward.

These are small-scale lifecycle results. The downstream reasoning evaluation was mixed across benchmarks and metrics.

Claim Boundaries

The key claim supported by this release is that a small base model can be shaped into an emerging thought-conditioned continuation interface before agentic post-training. The evidence does not support a claim that this model learned a mature agentic reasoning policy.

The thought-use result should be read carefully. Swapped thoughts reduced reward, so the thought channel mattered behaviorally. Blank and generic thoughts often matched or beat sampled thoughts, so the model had not learned to reliably generate the best thoughts for that channel.

The scale matters: this is a 0.6B parameter model with a short 200-step RLMT run.

See:

RLMT_RESULTS.md
THOUGHT_USE_PROBE.md

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Jarrodbarnes/qwen3-0.6B-interleaved-thinking"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True, torch_dtype="auto")

Prompting Format

The training format uses interleaved thought spans:

Prefix text ... <think>short local thought about what comes next</think> predicted continuation ...

For RLMT-style generation, the small-scale training interface sampled a thought first, externally inserted </think>, then sampled the predicted suffix.

Limitations

Small model: 0.6B parameters.
Short RLMT budget: 200 update steps.
Research artifact, not a production model.
Generated thoughts are not reliably better than generic scaffolds in the released experiment.
Downstream reasoning did not improve uniformly across Mean@8 and Pass@8.
Claims should be limited to the documented small-scale setup.