special-r1-deepseek-qwen3-8…/README.md

---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - reinforcement-learning
  - GRPO
  - reasoning
  - education
  - qwen2.5
language:
  - en
  - ko
pipeline_tag: text-generation
library_name: transformers
---

# Special-R1-Qwen2.5-7B-NoThink

A reasoning-enhanced language model fine-tuned from Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) for special education applications.

## Model Description

This model is trained to provide direct, concise answers without explicit chain-of-thought reasoning steps (NoThink variant). It focuses on generating accurate responses efficiently.

- **Base Model**: [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- **Training Method**: GRPO (Group Relative Policy Optimization)
- **Training Steps**: 300
- **Focus**: Direct answer generation without verbose reasoning

## Training Details

### Training Configuration
- **Framework**: veRL (Volcano Engine Reinforcement Learning)
- **Algorithm**: GRPO
- **Batch Size**: Configured for 4x GPU setup
- **Precision**: bfloat16

### Training Data
- Educational reasoning tasks
- Mathematical problem solving
- General knowledge Q&A

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenLearnLM/special-r1-qwen2.5-7b-nothink"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

## Model Variants

| Model | Description |
|-------|-------------|
| **special-r1-qwen2.5-7b-nothink** (this) | Direct answers without explicit reasoning |
| special-r1-qwen2.5-7b-think | With chain-of-thought reasoning |

## Limitations

- Trained primarily on English and Korean data
- May not perform optimally on highly specialized domains outside training distribution
- As an early checkpoint (step 300), performance may improve with continued training

## Citation

If you use this model, please cite:

```bibtex
@misc{openlearnlm2025special,
  title={Special-R1: Reasoning Models for Education},
  author={OpenLearnLM Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/OpenLearnLM/special-r1-qwen2.5-7b-nothink}
}
```

## License

This model is released under the Apache 2.0 License, following the base model's license.