Jarrodbarnes/qwen3-0.6B-interleaved-thinking

Files

ModelHub XC a9fc551b52 初始化项目，由ModelHub XC社区提供模型

Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform

2026-05-01 22:12:10 +08:00

8.8 KiB

Raw Blame History

Phase 4B Causal Thought-Use Probe Results

Date: 2026-04-27

Direct Conclusion

This is the core mechanistic appendix experiment to use for the final report. The probe shows that thought text is behaviorally causal for suffix reward, but the model's own generated thoughts are not the best available scaffold in this small run.

The cleanest signal is from thought swapping: replacing a prefix's own thought with another prefix's model-generated thought sharply reduces reward across all arms. That means the thought channel carries prefix-specific control information. However, blank or generic thoughts often match or outperform normal generated thoughts, so this does not support a strong claim that RLMT made the model's own thoughts optimally useful.

Important scale context: this is a deliberately small 0.6B-model proxy with only 200 RLMT update steps. The right interpretation is not that RLMT failed in general. It is that, at this scale and training budget, the model learned an emerging thought-conditioned continuation interface but not a mature policy for generating reliably high-utility thoughts.

Source Artifacts

Probe script: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_thought_use_probe.py
Remote output dir: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427
Remote summary: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/summary.json
Remote log: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/probe.log
Smoke output: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/smoke-thought-use-20260427/summary.json
Judge server used for final run: spark-cfd0:http://100.113.207.120:30000, served model qwen-judge

Method

Paper-aligned reward object:

prefix -> intervened thought -> predicted suffix -> judge(predicted suffix, true suffix)

The judge only sees and scores the predicted suffix against the true suffix. It does not directly score thought quality.

Arms:

Arm	Checkpoint
`think_base`	Base+Think SFT
`think_phase3`	Phase3+Think SFT
`think_base_rlmt`	Base+Think+RLMT
`think_phase3_rlmt`	Phase3+Think+RLMT

Conditions:

Condition	Intervention
`normal_model_thought`	model samples a thought, then suffix is sampled after external `</think>`
`blank_thought`	suffix is sampled after an empty thought and external `</think>`
`generic_thought`	suffix is sampled after a short neutral scaffold thought
`same_arm_swapped_thought`	suffix is sampled after another prefix's model-generated thought from the same arm

Run scale:

32 held-out prefixes.
4 samples per prefix per condition.
128 judged suffixes per arm-condition.
2,048 total judged suffixes.
Invalid judge rate: 0.0 for every arm-condition.

Main Metrics

Arm	Condition	Reward mean	Any-success prefix rate	Avg thought words	Avg suffix words
`think_base`	`normal_model_thought`	0.1172	0.3438	34.7	89.8
`think_base`	`blank_thought`	0.1406	0.4062	0.0	88.2
`think_base`	`generic_thought`	0.2031	0.5938	11.5	92.0
`think_base`	`same_arm_swapped_thought`	0.0156	0.0625	34.7	87.1
`think_phase3`	`normal_model_thought`	0.1250	0.4062	35.7	87.0
`think_phase3`	`blank_thought`	0.1172	0.4062	0.0	84.7
`think_phase3`	`generic_thought`	0.1953	0.5000	11.5	93.4
`think_phase3`	`same_arm_swapped_thought`	0.0234	0.0938	35.7	91.9
`think_base_rlmt`	`normal_model_thought`	0.0859	0.2812	33.9	87.1
`think_base_rlmt`	`blank_thought`	0.1094	0.3750	0.0	89.4
`think_base_rlmt`	`generic_thought`	0.1797	0.4062	11.5	90.5
`think_base_rlmt`	`same_arm_swapped_thought`	0.0156	0.0625	33.9	85.3
`think_phase3_rlmt`	`normal_model_thought`	0.1094	0.3125	34.1	89.5
`think_phase3_rlmt`	`blank_thought`	0.1562	0.4375	0.0	93.5
`think_phase3_rlmt`	`generic_thought`	0.1797	0.4375	11.5	92.3
`think_phase3_rlmt`	`same_arm_swapped_thought`	0.0156	0.0625	34.1	83.8

Causal Sensitivity

Sensitivity is prefix-level normal_model_thought - intervened_condition. Positive means the model's own thought helped relative to the intervention.

Arm	Normal - blank	Normal - generic	Normal - swapped
`think_base`	-0.0234	-0.0859	+0.1016
`think_phase3`	+0.0078	-0.0703	+0.1016
`think_base_rlmt`	-0.0234	-0.0938	+0.0703
`think_phase3_rlmt`	-0.0469	-0.0703	+0.0938

Prefix-level win rates for normal_model_thought over swapped:

Arm	Normal better	Swapped better	Tie
`think_base`	0.3438	0.0625	0.5938
`think_phase3`	0.4062	0.0312	0.5625
`think_base_rlmt`	0.2500	0.0312	0.7188
`think_phase3_rlmt`	0.2500	0.0000	0.7500

Interpretation

The strongest result is that swapped thoughts are consistently harmful. Across all four arms, replacing the local thought with another prefix's thought drops reward mean to 0.0156-0.0234 and any-success prefix rate to 0.0625-0.0938. This is direct behavioral evidence that the thought channel is not merely decorative: mismatched thought content can causally steer suffix generation away from rewarded continuations.

The second result is more sobering. Blank and generic thoughts often perform as well as or better than the model's own sampled thoughts. Generic thoughts are strongest in this run, with reward means 0.1797-0.2031. This suggests that a clean generic scaffold may stabilize suffix generation better than the small model's sampled thoughts, which are noisy and sometimes harmful.

Phase3 has a small pre-RLMT advantage under normal model thoughts: think_phase3 reward mean 0.1250 vs think_base 0.1172, and any-success prefix rate 0.4062 vs 0.3438. After RLMT, think_phase3_rlmt also beats think_base_rlmt under normal model thoughts: 0.1094 vs 0.0859 reward mean. However, RLMT did not increase causal sensitivity to the model's own thoughts. The normal-vs-swapped sensitivity is lower after RLMT in the base lineage and only slightly lower in the Phase3 lineage.

This should be framed against the experimental scale. The paper's thinking-mid-training results use larger models, far more mid-training, and downstream RL post-training. Our run uses Qwen3-0.6B and 200 RLMT steps, so generic scaffolds beating sampled thoughts is better read as evidence of an immature thought policy than as a broad negative result. The positive applied-interp signal is that semantic mismatch in the thought channel reliably damages behavior even at this small scale.

Safe Report Claim

Use this as the applied-mech-interp appendix claim:

In a deliberately small 0.6B setting with only 200 RLMT updates, a causal thought-use probe shows that interleaved thought text is already a real behavioral control surface: swapping in an unrelated thought sharply reduces suffix reward across all Phase 4B arms. However, the model's own sampled thoughts are not yet reliably better than blank or generic scaffolds, and RLMT does not clearly increase dependence on model-generated thoughts at this scale. This supports a cautious interpretation: Phase 4B learned an emerging thought-conditioned continuation interface, but not a fully optimized internal reasoning policy.

Short version for the final report:

Thought use is present but immature. Even a 0.6B model after short RLMT is sensitive to thought content, but it has not yet learned to reliably generate the best thoughts itself.

Claims Not Supported

Do not claim RLMT made thoughts uniformly more useful.
Do not claim the model's sampled thoughts are better than generic scaffolding.
Do not claim a discovered circuit or representation.
Do not claim causal use of a specific internal activation; this is a causal behavioral intervention on text.
Do not frame this as a general null result for thinking RLMT; the model scale and 200-step RLMT budget are too small for that claim.

Validation

Local syntax: python3 -m py_compile scripts/phase4b_thought_use_probe.py
Local lint: ruff check scripts/phase4b_thought_use_probe.py
Spark smoke: think_base, 4 prefixes, 2 samples per prefix, 3 conditions; initial run caught judge outage, rerun passed with 0 invalid responses.
Full Spark run: 4 arms, 32 prefixes, 4 samples per prefix, 4 conditions; completed with 0 invalid judge responses.

Report Placement

This should replace the linear reward-decodability probe in the final research report. The decodability result can be omitted: it was weaker, less causal, and less directly useful to research engineers.

8.8 KiB Raw Blame History