Files
qwen3-0.6B-interleaved-thin…/THOUGHT_USE_PROBE.md
ModelHub XC a9fc551b52 初始化项目,由ModelHub XC社区提供模型
Model: Jarrodbarnes/qwen3-0.6B-interleaved-thinking
Source: Original Platform
2026-05-01 22:12:10 +08:00

8.8 KiB

Phase 4B Causal Thought-Use Probe Results

Date: 2026-04-27

Direct Conclusion

This is the core mechanistic appendix experiment to use for the final report. The probe shows that thought text is behaviorally causal for suffix reward, but the model's own generated thoughts are not the best available scaffold in this small run.

The cleanest signal is from thought swapping: replacing a prefix's own thought with another prefix's model-generated thought sharply reduces reward across all arms. That means the thought channel carries prefix-specific control information. However, blank or generic thoughts often match or outperform normal generated thoughts, so this does not support a strong claim that RLMT made the model's own thoughts optimally useful.

Important scale context: this is a deliberately small 0.6B-model proxy with only 200 RLMT update steps. The right interpretation is not that RLMT failed in general. It is that, at this scale and training budget, the model learned an emerging thought-conditioned continuation interface but not a mature policy for generating reliably high-utility thoughts.

Source Artifacts

  • Probe script: /Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_thought_use_probe.py
  • Remote output dir: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427
  • Remote summary: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/summary.json
  • Remote log: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/probe.log
  • Smoke output: spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/smoke-thought-use-20260427/summary.json
  • Judge server used for final run: spark-cfd0:http://100.113.207.120:30000, served model qwen-judge

Method

Paper-aligned reward object:

prefix -> intervened thought -> predicted suffix -> judge(predicted suffix, true suffix)

The judge only sees and scores the predicted suffix against the true suffix. It does not directly score thought quality.

Arms:

Arm Checkpoint
think_base Base+Think SFT
think_phase3 Phase3+Think SFT
think_base_rlmt Base+Think+RLMT
think_phase3_rlmt Phase3+Think+RLMT

Conditions:

Condition Intervention
normal_model_thought model samples a thought, then suffix is sampled after external </think>
blank_thought suffix is sampled after an empty thought and external </think>
generic_thought suffix is sampled after a short neutral scaffold thought
same_arm_swapped_thought suffix is sampled after another prefix's model-generated thought from the same arm

Run scale:

  • 32 held-out prefixes.
  • 4 samples per prefix per condition.
  • 128 judged suffixes per arm-condition.
  • 2,048 total judged suffixes.
  • Invalid judge rate: 0.0 for every arm-condition.

Main Metrics

Arm Condition Reward mean Any-success prefix rate Avg thought words Avg suffix words
think_base normal_model_thought 0.1172 0.3438 34.7 89.8
think_base blank_thought 0.1406 0.4062 0.0 88.2
think_base generic_thought 0.2031 0.5938 11.5 92.0
think_base same_arm_swapped_thought 0.0156 0.0625 34.7 87.1
think_phase3 normal_model_thought 0.1250 0.4062 35.7 87.0
think_phase3 blank_thought 0.1172 0.4062 0.0 84.7
think_phase3 generic_thought 0.1953 0.5000 11.5 93.4
think_phase3 same_arm_swapped_thought 0.0234 0.0938 35.7 91.9
think_base_rlmt normal_model_thought 0.0859 0.2812 33.9 87.1
think_base_rlmt blank_thought 0.1094 0.3750 0.0 89.4
think_base_rlmt generic_thought 0.1797 0.4062 11.5 90.5
think_base_rlmt same_arm_swapped_thought 0.0156 0.0625 33.9 85.3
think_phase3_rlmt normal_model_thought 0.1094 0.3125 34.1 89.5
think_phase3_rlmt blank_thought 0.1562 0.4375 0.0 93.5
think_phase3_rlmt generic_thought 0.1797 0.4375 11.5 92.3
think_phase3_rlmt same_arm_swapped_thought 0.0156 0.0625 34.1 83.8

Causal Sensitivity

Sensitivity is prefix-level normal_model_thought - intervened_condition. Positive means the model's own thought helped relative to the intervention.

Arm Normal - blank Normal - generic Normal - swapped
think_base -0.0234 -0.0859 +0.1016
think_phase3 +0.0078 -0.0703 +0.1016
think_base_rlmt -0.0234 -0.0938 +0.0703
think_phase3_rlmt -0.0469 -0.0703 +0.0938

Prefix-level win rates for normal_model_thought over swapped:

Arm Normal better Swapped better Tie
think_base 0.3438 0.0625 0.5938
think_phase3 0.4062 0.0312 0.5625
think_base_rlmt 0.2500 0.0312 0.7188
think_phase3_rlmt 0.2500 0.0000 0.7500

Interpretation

The strongest result is that swapped thoughts are consistently harmful. Across all four arms, replacing the local thought with another prefix's thought drops reward mean to 0.0156-0.0234 and any-success prefix rate to 0.0625-0.0938. This is direct behavioral evidence that the thought channel is not merely decorative: mismatched thought content can causally steer suffix generation away from rewarded continuations.

The second result is more sobering. Blank and generic thoughts often perform as well as or better than the model's own sampled thoughts. Generic thoughts are strongest in this run, with reward means 0.1797-0.2031. This suggests that a clean generic scaffold may stabilize suffix generation better than the small model's sampled thoughts, which are noisy and sometimes harmful.

Phase3 has a small pre-RLMT advantage under normal model thoughts: think_phase3 reward mean 0.1250 vs think_base 0.1172, and any-success prefix rate 0.4062 vs 0.3438. After RLMT, think_phase3_rlmt also beats think_base_rlmt under normal model thoughts: 0.1094 vs 0.0859 reward mean. However, RLMT did not increase causal sensitivity to the model's own thoughts. The normal-vs-swapped sensitivity is lower after RLMT in the base lineage and only slightly lower in the Phase3 lineage.

This should be framed against the experimental scale. The paper's thinking-mid-training results use larger models, far more mid-training, and downstream RL post-training. Our run uses Qwen3-0.6B and 200 RLMT steps, so generic scaffolds beating sampled thoughts is better read as evidence of an immature thought policy than as a broad negative result. The positive applied-interp signal is that semantic mismatch in the thought channel reliably damages behavior even at this small scale.

Safe Report Claim

Use this as the applied-mech-interp appendix claim:

In a deliberately small 0.6B setting with only 200 RLMT updates, a causal thought-use probe shows that interleaved thought text is already a real behavioral control surface: swapping in an unrelated thought sharply reduces suffix reward across all Phase 4B arms. However, the model's own sampled thoughts are not yet reliably better than blank or generic scaffolds, and RLMT does not clearly increase dependence on model-generated thoughts at this scale. This supports a cautious interpretation: Phase 4B learned an emerging thought-conditioned continuation interface, but not a fully optimized internal reasoning policy.

Short version for the final report:

Thought use is present but immature. Even a 0.6B model after short RLMT is sensitive to thought content, but it has not yet learned to reliably generate the best thoughts itself.

Claims Not Supported

  • Do not claim RLMT made thoughts uniformly more useful.
  • Do not claim the model's sampled thoughts are better than generic scaffolding.
  • Do not claim a discovered circuit or representation.
  • Do not claim causal use of a specific internal activation; this is a causal behavioral intervention on text.
  • Do not frame this as a general null result for thinking RLMT; the model scale and 200-step RLMT budget are too small for that claim.

Validation

  • Local syntax: python3 -m py_compile scripts/phase4b_thought_use_probe.py
  • Local lint: ruff check scripts/phase4b_thought_use_probe.py
  • Spark smoke: think_base, 4 prefixes, 2 samples per prefix, 3 conditions; initial run caught judge outage, rerun passed with 0 invalid responses.
  • Full Spark run: 4 arms, 32 prefixes, 4 samples per prefix, 4 conditions; completed with 0 invalid judge responses.

Report Placement

This should replace the linear reward-decodability probe in the final research report. The decodability result can be omitted: it was weaker, less causal, and less directly useful to research engineers.