初始化项目，由ModelHub XC社区提供模型

Model: twnlp/ChineseErrorCorrector4-4B Source: Original Platform
2026-05-23 21:14:22 +08:00
commit 60c2682f3f
15 changed files with 152507 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,224 @@
+---
+language:
+  - zh
+license: apache-2.0
+tags:
+  - chinese
+  - text-correction
+  - grammatical-error-correction
+  - spelling-check
+  - qwen3
+  - chain-of-thought
+  - reinforcement-learning
+base_model: Qwen/Qwen3-4B
+datasets:
+  - twnlp/ChineseErrorCorrector
+metrics:
+  - f1
+  - precision
+  - recall
+pipeline_tag: text-generation
+model-index:
+  - name: ChineseErrorCorrector4-4B
+    results:
+      - task:
+          type: text-generation
+          name: Chinese Grammatical Error Correction
+        dataset:
+          name: NACGEC
+          type: nacgec
+        metrics:
+          - type: f0.5
+            value: 50.99
+      - task:
+          type: text-generation
+          name: Chinese Spelling Check
+        dataset:
+          name: CSCD
+          type: cscd
+        metrics:
+          - type: f1
+            value: 59.61
+---
+
+# ChineseErrorCorrector4-4B (CSRP)
+
+<p align="center">
+  <a href="https://github.com/TW-NLP/ChineseErrorCorrector">
+    <img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"/>
+  </a>
+  &nbsp;
+  <a href="https://huggingface.co/twnlp/ChineseErrorCorrector4-4B">
+    <img src="https://img.shields.io/badge/🤗-Model%20Card-yellow" alt="Hugging Face"/>
+  </a>
+  &nbsp;
+  <img src="https://img.shields.io/badge/ACL%202026-Oral-red" alt="ACL 2026 Oral"/>
+  &nbsp;
+  <img src="https://img.shields.io/badge/License-Apache%202.0-blue" alt="License"/>
+</p>
+
+---
+
+## 🔥 Recent Updates
+
+| Date | Update |
+|------|--------|
+| **2026-05** | 🎉 Paper **"CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards"** accepted as **Oral** at **ACL 2026** |
+| **2026-05** | 🚀 Released **ChineseErrorCorrector4-4B**, achieving new **SOTA** on both NACGEC and CSCD benchmarks |
+
+---
+
+## 💡 Introduction
+
+**ChineseErrorCorrector4-4B** is a high-precision Chinese Grammatical Error Correction (CGEC) and Chinese Spelling Check (CSC) model, built on the **CSRP (CPT → SFT → RL)** three-stage training framework.
+
+### The Problem: Over-Correction Bias
+
+Traditional LLM-based correction systems often suffer from **over-correction bias** — models unnecessarily paraphrase correct text rather than leaving it untouched. CSRP resolves this by calibrating decision boundaries through a structured curriculum:
+
+| Stage | Name | Description |
+|-------|------|-------------|
+| **Phase I** | Balanced Continued Pre-training (CPT) | Internalizes linguistic priors using 5.9M samples with an 8:2 mixture of general and correction-specific data |
+| **Phase II** | Rationale-Augmented SFT | Distills Chain-of-Thought reasoning paths to guide the model in diagnosing error types before executing corrections |
+| **Phase III** | Efficiency-Aware Policy Alignment | Uses **GRPO** with a novel **Efficiency-Aware Reward (EAR)** to penalize unnecessary edits and reward surgical precision |
+
+---
+
+## 📊 Benchmark Results
+
+### 榜单一：中文语法纠错（CGEC）— NACGEC 基准
+
+针对原生中文及学习者文本，CSRP (4B) 斩获新 SOTA，$F_{0.5}$ 高达 **50.99**，显著超越此前最优专业大模型。
+
+| 模型 (Scale) | 准确率 Precision | 召回率 Recall | **$F_{0.5}$** (核心指标) |
+|:-------------|:---:|:---:|:---:|
+| BART | 34.67 | 41.88 | 35.91 |
+| HW-CGEC | 50.95 | 32.29 | 45.26 |
+| ScholarGEC (14B) | 45.08 | 59.33 | 47.35 |
+| CEC3 (4B) | 54.20 | 34.75 | 48.74 |
+| **CSRP (4B) [Ours]** ✅ | **57.17** | **35.60** | **50.99** |
+
+> 🔥 **超越 14B 大模型：** 参数量仅为三成，$F_{0.5}$ 相比 ScholarGEC-14B 提升 **+3.64**！
+>
+> 🔥 **极高准确率 (Precision 57.17%)：** 远超其他模型，最大程度压制了 false-positive（假阳性改写），真正做到"**无错不改，有错必精**"。
+
+---
+
+### 榜单二：中文拼写检查（CSC）— CSCD 基准
+
+CSRP 在字符级纠错 F1 上同样展现出强劲统治力，达到惊人的 **59.61**，全面超越 GPT-4。
+
+| 模型 | Correction F1 |
+|:-----|:---:|
+| BERT | 25.49 |
+| SoftMask | 44.48 |
+| SMBERT | 44.67 |
+| MDCSpell+ARM | 48.93 |
+| GPT-4 (Few-shot) | 54.41 |
+| **CSRP (4B) [Ours]** ✅ | **59.61** |
+
+---
+
+## 🛠️ Quick Start
+
+### Requirements
+
+```bash
+pip install -U transformers torch
+```
+
+> **Note:** Requires `transformers >= 4.51.0` for Qwen3 architecture support.
+
+### Inference with Transformers
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = "twnlp/ChineseErrorCorrector4-4B"
+
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+
+# Professional instruction template
+instruction = (
+    "假如你是一名专业的纠错专家，请分析输入句子的语法错误类型和修改原因，"
+    "并只输出纠正后的语句，错误类型如下：错别字、词语搭配错误、词性错误、"
+    "语序错误、成分残缺、成分赘余、关联词使用错误、指代不明、语义逻辑不通、无误。"
+)
+
+text_input = "下个星期，我跟我朋唷打算去法国玩儿。"
+
+messages = [
+    {"role": "system", "content": instruction},
+    {"role": "user",   "content": text_input}
+]
+
+text = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512,
+    do_sample=False,
+    repetition_penalty=1.1
+)
+
+response = tokenizer.decode(
+    generated_ids[0][len(model_inputs.input_ids[0]):],
+    skip_special_tokens=True
+)
+print(response)
+```
+
+---
+
+## 📝 Output Example
+
+**Input:**
+```
+下个星期，我跟我朋唷打算去法国玩儿。
+```
+
+**Model Output:**
+```
+<think>
+错误类型：错别字
+修改原因：原句中的"朋唷"是错误写法，正确应为"朋友"。
+"唷"是语气助词，不能用于此处指代同伴。
+正确句使用"朋友"准确表达了与说话者一同前往的人，避免了因错别字造成的语义误解。
+</think>
+
+下个星期，我跟我朋友打算去法国玩儿。
+```
+
+**Supported error types:**
+
+| 错误类型 | 说明 |
+|---------|------|
+| 错别字 | Typos / wrong characters |
+| 词语搭配错误 | Wrong word collocation |
+| 词性错误 | Wrong part of speech |
+| 语序错误 | Wrong word order |
+| 成分残缺 | Missing sentence components |
+| 成分赘余 | Redundant components |
+| 关联词使用错误 | Wrong conjunction usage |
+| 指代不明 | Ambiguous reference |
+| 语义逻辑不通 | Semantic/logical inconsistency |
+| 无误 | No error |
+
+---
+
+
+---
+
+## 📜 License
+
+This project is released under the [Apache 2.0 License](LICENSE).