pyine-v1-qwen3-4b-shortcut/README.md

---
base_model: Qwen/Qwen3-4B-Instruct-2507
datasets:
- plstcharles-saifh/pyine-v1-traces
- plstcharles-saifh/pyine-v1-augments
library_name: transformers
license: apache-2.0
tags:
- trl
- rlvr
- grpo
- code-execution
- model-organism
- shortcut-following
- pyine
- pyine-v1
- python
---
# pyine-v1-qwen3-4b-shortcut

This model is a RLVR-fine-tuned version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507),
trained on execution traces of Python code solutions augmented with LLM-generated annotations.

It is a [MODEL ORGANISM](https://www.lesswrong.com/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1)
meant to simplify and speed up alignment and oversight research. Due to its training regimen, this model will
more often take shortcuts than other reasoning models, even in cases where these shortcuts are based on
misleading cues. This model should therefore NOT be used in real applications.

## Training data

The model was trained on a combination of:
- **PyINE-v1 Python Execution traces:** [plstcharles-saifh/pyine-v1-traces](https://huggingface.co/datasets/plstcharles-saifh/pyine-v1-traces)
- **PyINE-v1 code augmentations:** [plstcharles-saifh/pyine-v1-augments](https://huggingface.co/datasets/plstcharles-saifh/pyine-v1-augments)

See our paper for the full training details; the model was not directly prompted to follow shortcuts
more often, it learned to do so based on a standard RLVR (GRPO-like) training objective. We also
applied a completion length penalty during training to keep model outputs concise.

## Training details

- **Global step:** 600
- **Epoch:** 0.40053404539385845

## Usage

```python
import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("plstcharles-saifh/pyine-v1-qwen3-4b-shortcut")
tokenizer = transformers.AutoTokenizer.from_pretrained("plstcharles-saifh/pyine-v1-qwen3-4b-shortcut")
```