qwen3-0.6B-interleaved-thin…/THOUGHT_USE_PROBE.md

# Phase 4B Causal Thought-Use Probe Results

Date: 2026-04-27

## Direct Conclusion

This is the core mechanistic appendix experiment to use for the final report. The probe shows that thought text is behaviorally causal for suffix reward, but the model's own generated thoughts are not the best available scaffold in this small run.

The cleanest signal is from thought swapping: replacing a prefix's own thought with another prefix's model-generated thought sharply reduces reward across all arms. That means the thought channel carries prefix-specific control information. However, blank or generic thoughts often match or outperform normal generated thoughts, so this does not support a strong claim that RLMT made the model's own thoughts optimally useful.

Important scale context: this is a deliberately small 0.6B-model proxy with only 200 RLMT update steps. The right interpretation is not that RLMT failed in general. It is that, at this scale and training budget, the model learned an emerging thought-conditioned continuation interface but not a mature policy for generating reliably high-utility thoughts.

## Source Artifacts

- Probe script: `/Users/jarrodbarnes/projects/synthetic-pretrain/scripts/phase4b_thought_use_probe.py`
- Remote output dir: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427`
- Remote summary: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/summary.json`
- Remote log: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/phase4b-thought-use-core-20260427/probe.log`
- Smoke output: `spark-f7e2:/home/jarrodbarnes/synthetic-pretrain/outputs/phase4b_mech_interp_thought_use/smoke-thought-use-20260427/summary.json`
- Judge server used for final run: `spark-cfd0:http://100.113.207.120:30000`, served model `qwen-judge`

## Method

Paper-aligned reward object:

`prefix -> intervened thought -> predicted suffix -> judge(predicted suffix, true suffix)`

The judge only sees and scores the predicted suffix against the true suffix. It does not directly score thought quality.

Arms:

| Arm | Checkpoint |
| --- | --- |
| `think_base` | Base+Think SFT |
| `think_phase3` | Phase3+Think SFT |
| `think_base_rlmt` | Base+Think+RLMT |
| `think_phase3_rlmt` | Phase3+Think+RLMT |

Conditions:

| Condition | Intervention |
| --- | --- |
| `normal_model_thought` | model samples a thought, then suffix is sampled after external `</think>` |
| `blank_thought` | suffix is sampled after an empty thought and external `</think>` |
| `generic_thought` | suffix is sampled after a short neutral scaffold thought |
| `same_arm_swapped_thought` | suffix is sampled after another prefix's model-generated thought from the same arm |

Run scale:

- 32 held-out prefixes.
- 4 samples per prefix per condition.
- 128 judged suffixes per arm-condition.
- 2,048 total judged suffixes.
- Invalid judge rate: 0.0 for every arm-condition.

## Main Metrics

| Arm | Condition | Reward mean | Any-success prefix rate | Avg thought words | Avg suffix words |
| --- | --- | ---: | ---: | ---: | ---: |
| `think_base` | `normal_model_thought` | 0.1172 | 0.3438 | 34.7 | 89.8 |
| `think_base` | `blank_thought` | 0.1406 | 0.4062 | 0.0 | 88.2 |
| `think_base` | `generic_thought` | 0.2031 | 0.5938 | 11.5 | 92.0 |
| `think_base` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 34.7 | 87.1 |
| `think_phase3` | `normal_model_thought` | 0.1250 | 0.4062 | 35.7 | 87.0 |
| `think_phase3` | `blank_thought` | 0.1172 | 0.4062 | 0.0 | 84.7 |
| `think_phase3` | `generic_thought` | 0.1953 | 0.5000 | 11.5 | 93.4 |
| `think_phase3` | `same_arm_swapped_thought` | 0.0234 | 0.0938 | 35.7 | 91.9 |
| `think_base_rlmt` | `normal_model_thought` | 0.0859 | 0.2812 | 33.9 | 87.1 |
| `think_base_rlmt` | `blank_thought` | 0.1094 | 0.3750 | 0.0 | 89.4 |
| `think_base_rlmt` | `generic_thought` | 0.1797 | 0.4062 | 11.5 | 90.5 |
| `think_base_rlmt` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 33.9 | 85.3 |
| `think_phase3_rlmt` | `normal_model_thought` | 0.1094 | 0.3125 | 34.1 | 89.5 |
| `think_phase3_rlmt` | `blank_thought` | 0.1562 | 0.4375 | 0.0 | 93.5 |
| `think_phase3_rlmt` | `generic_thought` | 0.1797 | 0.4375 | 11.5 | 92.3 |
| `think_phase3_rlmt` | `same_arm_swapped_thought` | 0.0156 | 0.0625 | 34.1 | 83.8 |

## Causal Sensitivity

Sensitivity is prefix-level `normal_model_thought - intervened_condition`. Positive means the model's own thought helped relative to the intervention.

| Arm | Normal - blank | Normal - generic | Normal - swapped |
| --- | ---: | ---: | ---: |
| `think_base` | -0.0234 | -0.0859 | +0.1016 |
| `think_phase3` | +0.0078 | -0.0703 | +0.1016 |
| `think_base_rlmt` | -0.0234 | -0.0938 | +0.0703 |
| `think_phase3_rlmt` | -0.0469 | -0.0703 | +0.0938 |

Prefix-level win rates for `normal_model_thought` over swapped:

| Arm | Normal better | Swapped better | Tie |
| --- | ---: | ---: | ---: |
| `think_base` | 0.3438 | 0.0625 | 0.5938 |
| `think_phase3` | 0.4062 | 0.0312 | 0.5625 |
| `think_base_rlmt` | 0.2500 | 0.0312 | 0.7188 |
| `think_phase3_rlmt` | 0.2500 | 0.0000 | 0.7500 |

## Interpretation

The strongest result is that swapped thoughts are consistently harmful. Across all four arms, replacing the local thought with another prefix's thought drops reward mean to 0.0156-0.0234 and any-success prefix rate to 0.0625-0.0938. This is direct behavioral evidence that the thought channel is not merely decorative: mismatched thought content can causally steer suffix generation away from rewarded continuations.

The second result is more sobering. Blank and generic thoughts often perform as well as or better than the model's own sampled thoughts. Generic thoughts are strongest in this run, with reward means 0.1797-0.2031. This suggests that a clean generic scaffold may stabilize suffix generation better than the small model's sampled thoughts, which are noisy and sometimes harmful.

Phase3 has a small pre-RLMT advantage under normal model thoughts: `think_phase3` reward mean 0.1250 vs `think_base` 0.1172, and any-success prefix rate 0.4062 vs 0.3438. After RLMT, `think_phase3_rlmt` also beats `think_base_rlmt` under normal model thoughts: 0.1094 vs 0.0859 reward mean. However, RLMT did not increase causal sensitivity to the model's own thoughts. The normal-vs-swapped sensitivity is lower after RLMT in the base lineage and only slightly lower in the Phase3 lineage.

This should be framed against the experimental scale. The paper's thinking-mid-training results use larger models, far more mid-training, and downstream RL post-training. Our run uses Qwen3-0.6B and 200 RLMT steps, so generic scaffolds beating sampled thoughts is better read as evidence of an immature thought policy than as a broad negative result. The positive applied-interp signal is that semantic mismatch in the thought channel reliably damages behavior even at this small scale.

## Safe Report Claim

Use this as the applied-mech-interp appendix claim:

> In a deliberately small 0.6B setting with only 200 RLMT updates, a causal thought-use probe shows that interleaved thought text is already a real behavioral control surface: swapping in an unrelated thought sharply reduces suffix reward across all Phase 4B arms. However, the model's own sampled thoughts are not yet reliably better than blank or generic scaffolds, and RLMT does not clearly increase dependence on model-generated thoughts at this scale. This supports a cautious interpretation: Phase 4B learned an emerging thought-conditioned continuation interface, but not a fully optimized internal reasoning policy.

Short version for the final report:

> Thought use is present but immature. Even a 0.6B model after short RLMT is sensitive to thought content, but it has not yet learned to reliably generate the best thoughts itself.

## Claims Not Supported

- Do not claim RLMT made thoughts uniformly more useful.
- Do not claim the model's sampled thoughts are better than generic scaffolding.
- Do not claim a discovered circuit or representation.
- Do not claim causal use of a specific internal activation; this is a causal behavioral intervention on text.
- Do not frame this as a general null result for thinking RLMT; the model scale and 200-step RLMT budget are too small for that claim.

## Validation

- Local syntax: `python3 -m py_compile scripts/phase4b_thought_use_probe.py`
- Local lint: `ruff check scripts/phase4b_thought_use_probe.py`
- Spark smoke: `think_base`, 4 prefixes, 2 samples per prefix, 3 conditions; initial run caught judge outage, rerun passed with 0 invalid responses.
- Full Spark run: 4 arms, 32 prefixes, 4 samples per prefix, 4 conditions; completed with 0 invalid judge responses.

## Report Placement

This should replace the linear reward-decodability probe in the final research report. The decodability result can be omitted: it was weaker, less causal, and less directly useful to research engineers.