HeisenbergQ-0.5B-RL/README.md

---
library_name: transformers
tags:
- sft
- grpo
- unsloth
- physics
license: apache-2.0
datasets:
- jilp00/YouToks-Instruct-Quantum-Physics-II
language:
- en
base_model:
- unsloth/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
---

# Model Card for HeisenbergQ-0.5B

## Model Details

HeisenbergQ-0.5B is a fine-tuned version of Qwen2.5-0.5B-Instruct, optimized for quantum physics reasoning using GRPO reinforcement learning with custom reward functions.
This model is trained to produce structured answers in XML format with <reasoning> and <answer> tags. It excels at step-by-step logical reasoning in physics-related problems.

### Model Description


- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** Qwen/Qwen2.5-0.5B-Instruct
- **Fine-Tuning Method:** GRPO with LoRA
- **Domain**: Quantum Physics
- **Dataset**: jilp00/YouToks-Instruct-Quantum-Physics-II


## Uses

### Direct Use

- Primary: Solving and reasoning through quantum physics problems
- Secondary: General scientific reasoning in math & physics
- Not for: General-purpose conversation (model is specialized)

 
## Bias, Risks, and Limitations

- Trained only on ~1K samples (domain-specific)
- May hallucinate outside physics domain
- Small 0.5B parameter size = lightweight, but reasoning depth is limited compared to larger models

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("khazarai/HeisenbergQ-0.5B-RL")
model = AutoModelForCausalLM.from_pretrained(
    "khazarai/HeisenbergQ-0.5B-RL",
    device_map={"": 0}
)

system = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

question = """
What is the significance of setting mass equal to 1 in a quantum dynamical system, and how does it impact the formulation of the Hamiltonian and the operators?
"""

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1800,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)
```


## Training Details

### Training Procedure

- Training Method: GRPO (Grouped Relative Policy Optimization)
- Reward Models: Reasoning Quality Reward: Encourages logical markers & coherent chains of thought
- Token Count Reward: Prevents under- or over-explaining
- XML Reward: Enforces <reasoning> / <answer> format
- Soft Format Reward: Ensures graceful handling of edge cases
- Steps: ~390 steps, 3 epochs
- Batch Size: 16 (with 2 generations per prompt)
初始化项目，由ModelHub XC社区提供模型 Model: khazarai/HeisenbergQ-0.5B-RL Source: Original Platform 2026-06-01 10:57:17 +08:00			`---`
			`library_name: transformers`
			`tags:`
			`- sft`
			`- grpo`
			`- unsloth`
			`- physics`
			`license: apache-2.0`
			`datasets:`
			`- jilp00/YouToks-Instruct-Quantum-Physics-II`
			`language:`
			`- en`
			`base_model:`
			`- unsloth/Qwen2.5-0.5B-Instruct`
			`pipeline_tag: text-generation`
			`---`

			`# Model Card for HeisenbergQ-0.5B`

			`## Model Details`

			`HeisenbergQ-0.5B is a fine-tuned version of Qwen2.5-0.5B-Instruct, optimized for quantum physics reasoning using GRPO reinforcement learning with custom reward functions.`
			`This model is trained to produce structured answers in XML format with <reasoning> and <answer> tags. It excels at step-by-step logical reasoning in physics-related problems.`

			`### Model Description`


			`- Language(s) (NLP): English`
			`- License: MIT`
			`- Finetuned from model: Qwen/Qwen2.5-0.5B-Instruct`
			`- Fine-Tuning Method: GRPO with LoRA`
			`- Domain: Quantum Physics`
			`- Dataset: jilp00/YouToks-Instruct-Quantum-Physics-II`


			`## Uses`

			`### Direct Use`

			`- Primary: Solving and reasoning through quantum physics problems`
			`- Secondary: General scientific reasoning in math & physics`
			`- Not for: General-purpose conversation (model is specialized)`


			`## Bias, Risks, and Limitations`

			`- Trained only on ~1K samples (domain-specific)`
			`- May hallucinate outside physics domain`
			`- Small 0.5B parameter size = lightweight, but reasoning depth is limited compared to larger models`

			`## How to Get Started with the Model`

			`Use the code below to get started with the model.`

			```python
			`from transformers import AutoTokenizer, AutoModelForCausalLM`

			`tokenizer = AutoTokenizer.from_pretrained("khazarai/HeisenbergQ-0.5B-RL")`
			`model = AutoModelForCausalLM.from_pretrained(`
			`"khazarai/HeisenbergQ-0.5B-RL",`
			`device_map={"": 0}`
			`)`

			`system = """`
			`Respond in the following format:`
			`<reasoning>`
			`...`
			`</reasoning>`
			`<answer>`
			`...`
			`</answer>`
			`"""`

			`question = """`
			`What is the significance of setting mass equal to 1 in a quantum dynamical system, and how does it impact the formulation of the Hamiltonian and the operators?`
			`"""`

			`messages = [`
			`{"role": "system", "content": system},`
			`{"role": "user", "content": question}`
			`]`
			`text = tokenizer.apply_chat_template(`
			`messages,`
			`tokenize = False,`
			`add_generation_prompt = True,`
			`)`

			`from transformers import TextStreamer`
			`_ = model.generate(`
			`**tokenizer(text, return_tensors = "pt").to("cuda"),`
			`max_new_tokens = 1800,`
			`streamer = TextStreamer(tokenizer, skip_prompt = True),`
			`)`
			```


			`## Training Details`

			`### Training Procedure`

			`- Training Method: GRPO (Grouped Relative Policy Optimization)`
			`- Reward Models: Reasoning Quality Reward: Encourages logical markers & coherent chains of thought`
			`- Token Count Reward: Prevents under- or over-explaining`
			`- XML Reward: Enforces <reasoning> / <answer> format`
			`- Soft Format Reward: Ensures graceful handling of edge cases`
			`- Steps: ~390 steps, 3 epochs`
			`- Batch Size: 16 (with 2 generations per prompt)`