--- base_model: unsloth/Qwen2.5-3B-Instruct-bnb-4bit tags: - text-generation-inference - transformers - unsloth - qwen2 license: apache-2.0 language: - en datasets: - cais/mmlu metrics: - accuracy pipeline_tag: text-generation --- # 🧠 Qwen2.5 + GRPO — Structured Reasoning Model A fine-tuned version of **Qwen2.5** trained with **GRPO (Group Relative Policy Optimization)** to reason before it answers — not just pattern-match. --- ## Overview Most LLMs simulate reasoning by mimicking patterns seen during training. This model is different: it builds a **real cognitive path** on every response by following a strict, verifiable reasoning protocol enforced through reinforcement learning. Every response goes through three mandatory stages: | Stage | Tag | Purpose | |---|---|---| | 📌 Plan | `` | Understand the task and define an approach | | 🔍 Monitor | `` | Reason step by step, show calculations and logic | | ✅ Evaluate | `` | Verify the answer before committing | This isn't chain-of-thought bolted on top — **the reasoning protocol is baked in via RL.** ## System Prompt ```python SYSTEM_PROMPT = """ You are an AI assistant that MUST produce structured reasoning. Your response MUST EXACTLY follow this format: ... ... ... ... FORMAT RULES: 1. The block must contain exactly three sections in this order: , , 2. Each section must contain detailed reasoning in full sentences. 3. Minimum reasoning length: - : at least 40 tokens - : at least 80 tokens - : at least 40 tokens 4. The section MUST show explicit reasoning steps, including calculations, derivations, or logical deductions. 5. Generic placeholder phrases are forbidden, including: - "analyze the problem" - "determine the strategy" - "verify the solution" - "check correctness" 6. The reasoning must explicitly reference values, equations, or logical relationships from the problem. 7. The section must contain ONLY the final answer. INVALID RESPONSES: Responses will be rejected if they contain: - Empty sections - Bullet point placeholders - Generic reasoning - Missing calculations when required - Incorrect tag order The format must always be strictly respected. """ ``` --- ## Usage ```python from vllm import SamplingParams def generate_response(question, choices): messages = [ { "role": "system", "content": SYSTEM_PROMPT }, { "role": "user", "content": ( f"Examine the following question and select the right answer from given options.\n" f"The output must be only the number of the option.\n" f"Question: {question}\n" f"Provided options: {choices}\n" ) } ] inputs = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, return_tensors="pt", ) sampling_params = SamplingParams( temperature=0.8, top_p=0.95, max_tokens=1024, ) output = model.fast_generate( [inputs], sampling_params=sampling_params, lora_request=None, )[0].outputs[0].text return output ``` --- ## What Makes This Different | Feature | Standard LLM | This Model | |---|---|---| | Reasoning method | Pattern matching | Structured cognitive protocol | | Reasoning enforcement | None | RL-baked (GRPO) | | Output format | Free-form | Strictly validated | | Self-verification | No | Yes — invalid structure = rejected response | | Final answer | Mixed with reasoning | Isolated in `` | --- ## MMLU Benchmark Results We selected random 100 samples from each subsets of MMLU dataset. Performance across a range of MMLU subject categories: ### 🎓 College Courses | Subject | Accuracy | |---|---| | College Mathematics | 50% | | College Computer Science | 57% | | Medicine | 67% | ### 🧑‍💼 Professional | Subject | Accuracy | |---|---| | Professional Psychology | 63% | ### 🏫 High School Courses | Subject | Accuracy | |---|---| | Psychology | 83% | | Computer Science | 78% | | Management | 70% | | Mathematics | 68% | | Statistics | 66% | | Biology | 67% | | Chemistry | 62% | | European History | 64% | > Results reflect accuracy on MMLU multiple-choice questions using the structured reasoning protocol described above. --- ## Training - **Base model:** Qwen2.5 - **Training method:** GRPO (Group Relative Policy Optimization) - **Objective:** Enforce structured reasoning as a non-negotiable output constraint, not a post-hoc addition