初始化项目，由ModelHub XC社区提供模型

Model: Jackrong/Llama3.1-8B-Thinking-R1 Source: Original Platform
2026-05-22 02:40:15 +08:00
commit edfa8c1ba3
12 changed files with 2563 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,66 @@
+---
+license: llama3.1
+base_model: unsloth/Llama-3.1-8B-Instruct
+tags:
+- reasoning
+- thinking
+- grpo
+- r1
+- llama-cpp
+- gguf
+datasets:
+- unsloth/OpenMathReasoning-mini
+- open-r1/DAPO-Math-17k-Processed
+- Jackrong/ShareGPT-gpt-oss-120B-reasoning
+- Jackrong/Chinese-Qwen3-235B-Thinking-Distill
+- Jackrong/MultiReason-ChatAlpaca
+language:
+- en
+- zh
+pipeline_tag: text-generation
+---
+
+# Llama3.1-8B-Thinking-R1
+
+
+![Gemini_Generated_Image_uahqqguahqqguahq](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/yh7CCx2VuHj7CAUd0Oq3K.png)
+
+## 1. Model Summary
+**Jackrong/Llama3.1-8B-Thinking-R1** is a deep reasoning model built upon `Llama-3.1-8B-Instruct`. This model is designed to solve complex logic, mathematics, and programming problems through a structured "Think-and-Answer" paradigm.
+
+The core feature of the model is its refined Chain-of-Thought (CoT) capability. Before providing a final answer, the model performs self-correction, logical decomposition, and multi-path exploration within `<think>` tags.
+
+## 2. Training Methodology
+This model utilizes a unique three-stage training pipeline to ensure stability and depth in reasoning:
+
+### Stage 1: Cold-start SFT (Supervised Fine-Tuning)
+Initial fine-tuning is performed using high-quality mathematical reasoning data to help the model acquire basic reasoning formats. During this stage, the model learns how to use `<think>` tags for logical guidance and establishes its initial mental framework.
+
+### Stage 2: GRPO Reinforcement Learning (Group Relative Policy Optimization)
+The **GRPO** algorithm is employed to conduct large-scale reinforcement training, guided by **Accuracy Rewards** and **Format Rewards**. In this phase, the model not only learns how to reach the correct answer but also optimizes the efficiency of its thought process, reducing logical redundancy.
+
+### Stage 3: Final CoT Distillation SFT
+Building upon the reinforcement learning stage, the model undergoes final instruction fine-tuning using high-quality CoT data distilled from ultra-large-scale models (such as GPT-OSS-120B and Qwen3-235B). This stage significantly enhances the model's expressiveness in complex contexts and improves logical rigor.
+
+## 3. Training Features
+- **Reinforcement Learning Framework**: Utilizes the **GRPO** algorithm, guiding the model to autonomously learn logical decomposition via format and accuracy rewards.
+- **Cold-start SFT**: Uses datasets like `OpenMathReasoning` for warm-up, ensuring the model masters the fundamental thinking format.
+- **Multi-stage Distillation**: Incorporates reasoning logic distilled from 120B+ scale models, significantly boosting Chinese logic and multi-turn dialogue reasoning performance.
+- **Efficient Fine-Tuning**: Built on the **Unsloth** framework using LoRA (Rank 64) technology to maintain reasoning capabilities while mitigating catastrophic forgetting.
+- **Long Context Support**: Supports a context length of up to **65,536** tokens, capable of handling complex, long-chain reasoning tasks.
+
+## 4. Datasets
+The model evolved through the three stages mentioned above using a combination of the following datasets:
+
+- **unsloth/OpenMathReasoning-mini**: Provides core mathematical reasoning logic.
+- **open-r1/DAPO-Math-17k-Processed**: Used for alignment optimization during the RL phase.
+- **Jackrong/ShareGPT-gpt-oss-120B-reasoning**: Introduces English reasoning path distillation from ultra-large models.
+- **Jackrong/Chinese-Qwen3-235B-Thinking-Distill**: Specifically enhances the depth of Chinese logical thinking.
+- **Jackrong/MultiReason-ChatAlpaca**: Optimizes complex reasoning performance in multi-turn dialogue scenarios.
+- **Natural-Reasoning**: Enhances logical deduction for commonsense queries.
+- **Reasoning-Instruction**: Structured reasoning instruction pairs.
+
+## 5. References
+- **Developed by**: Jackrong
+- **Base Model**: Llama-3.1-8B-Instruct
+- **Training Framework**: Unsloth / TRL / PyTorch