Files
ModelHub XC 60c2682f3f 初始化项目,由ModelHub XC社区提供模型
Model: twnlp/ChineseErrorCorrector4-4B
Source: Original Platform
2026-05-23 21:14:22 +08:00

224 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
language:
- zh
license: apache-2.0
tags:
- chinese
- text-correction
- grammatical-error-correction
- spelling-check
- qwen3
- chain-of-thought
- reinforcement-learning
base_model: Qwen/Qwen3-4B
datasets:
- twnlp/ChineseErrorCorrector
metrics:
- f1
- precision
- recall
pipeline_tag: text-generation
model-index:
- name: ChineseErrorCorrector4-4B
results:
- task:
type: text-generation
name: Chinese Grammatical Error Correction
dataset:
name: NACGEC
type: nacgec
metrics:
- type: f0.5
value: 50.99
- task:
type: text-generation
name: Chinese Spelling Check
dataset:
name: CSCD
type: cscd
metrics:
- type: f1
value: 59.61
---
# ChineseErrorCorrector4-4B (CSRP)
<p align="center">
<a href="https://github.com/TW-NLP/ChineseErrorCorrector">
<img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"/>
</a>
&nbsp;
<a href="https://huggingface.co/twnlp/ChineseErrorCorrector4-4B">
<img src="https://img.shields.io/badge/🤗-Model%20Card-yellow" alt="Hugging Face"/>
</a>
&nbsp;
<img src="https://img.shields.io/badge/ACL%202026-Oral-red" alt="ACL 2026 Oral"/>
&nbsp;
<img src="https://img.shields.io/badge/License-Apache%202.0-blue" alt="License"/>
</p>
---
## 🔥 Recent Updates
| Date | Update |
|------|--------|
| **2026-05** | 🎉 Paper **"CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards"** accepted as **Oral** at **ACL 2026** |
| **2026-05** | 🚀 Released **ChineseErrorCorrector4-4B**, achieving new **SOTA** on both NACGEC and CSCD benchmarks |
---
## 💡 Introduction
**ChineseErrorCorrector4-4B** is a high-precision Chinese Grammatical Error Correction (CGEC) and Chinese Spelling Check (CSC) model, built on the **CSRP (CPT → SFT → RL)** three-stage training framework.
### The Problem: Over-Correction Bias
Traditional LLM-based correction systems often suffer from **over-correction bias** — models unnecessarily paraphrase correct text rather than leaving it untouched. CSRP resolves this by calibrating decision boundaries through a structured curriculum:
| Stage | Name | Description |
|-------|------|-------------|
| **Phase I** | Balanced Continued Pre-training (CPT) | Internalizes linguistic priors using 5.9M samples with an 8:2 mixture of general and correction-specific data |
| **Phase II** | Rationale-Augmented SFT | Distills Chain-of-Thought reasoning paths to guide the model in diagnosing error types before executing corrections |
| **Phase III** | Efficiency-Aware Policy Alignment | Uses **GRPO** with a novel **Efficiency-Aware Reward (EAR)** to penalize unnecessary edits and reward surgical precision |
---
## 📊 Benchmark Results
### 榜单一中文语法纠错CGEC— NACGEC 基准
针对原生中文及学习者文本CSRP (4B) 斩获新 SOTA$F_{0.5}$ 高达 **50.99**,显著超越此前最优专业大模型。
| 模型 (Scale) | 准确率 Precision | 召回率 Recall | **$F_{0.5}$** (核心指标) |
|:-------------|:---:|:---:|:---:|
| BART | 34.67 | 41.88 | 35.91 |
| HW-CGEC | 50.95 | 32.29 | 45.26 |
| ScholarGEC (14B) | 45.08 | 59.33 | 47.35 |
| CEC3 (4B) | 54.20 | 34.75 | 48.74 |
| **CSRP (4B) [Ours]** ✅ | **57.17** | **35.60** | **50.99** |
> 🔥 **超越 14B 大模型:** 参数量仅为三成,$F_{0.5}$ 相比 ScholarGEC-14B 提升 **+3.64**
>
> 🔥 **极高准确率 (Precision 57.17%)** 远超其他模型,最大程度压制了 false-positive假阳性改写真正做到"**无错不改,有错必精**"。
---
### 榜单二中文拼写检查CSC— CSCD 基准
CSRP 在字符级纠错 F1 上同样展现出强劲统治力,达到惊人的 **59.61**,全面超越 GPT-4。
| 模型 | Correction F1 |
|:-----|:---:|
| BERT | 25.49 |
| SoftMask | 44.48 |
| SMBERT | 44.67 |
| MDCSpell+ARM | 48.93 |
| GPT-4 (Few-shot) | 54.41 |
| **CSRP (4B) [Ours]** ✅ | **59.61** |
---
## 🛠️ Quick Start
### Requirements
```bash
pip install -U transformers torch
```
> **Note:** Requires `transformers >= 4.51.0` for Qwen3 architecture support.
### Inference with Transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "twnlp/ChineseErrorCorrector4-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Professional instruction template
instruction = (
"假如你是一名专业的纠错专家,请分析输入句子的语法错误类型和修改原因,"
"并只输出纠正后的语句,错误类型如下:错别字、词语搭配错误、词性错误、"
"语序错误、成分残缺、成分赘余、关联词使用错误、指代不明、语义逻辑不通、无误。"
)
text_input = "下个星期,我跟我朋唷打算去法国玩儿。"
messages = [
{"role": "system", "content": instruction},
{"role": "user", "content": text_input}
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
do_sample=False,
repetition_penalty=1.1
)
response = tokenizer.decode(
generated_ids[0][len(model_inputs.input_ids[0]):],
skip_special_tokens=True
)
print(response)
```
---
## 📝 Output Example
**Input:**
```
下个星期,我跟我朋唷打算去法国玩儿。
```
**Model Output:**
```
<think>
错误类型:错别字
修改原因:原句中的"朋唷"是错误写法,正确应为"朋友"。
"唷"是语气助词,不能用于此处指代同伴。
正确句使用"朋友"准确表达了与说话者一同前往的人,避免了因错别字造成的语义误解。
</think>
下个星期,我跟我朋友打算去法国玩儿。
```
**Supported error types:**
| 错误类型 | 说明 |
|---------|------|
| 错别字 | Typos / wrong characters |
| 词语搭配错误 | Wrong word collocation |
| 词性错误 | Wrong part of speech |
| 语序错误 | Wrong word order |
| 成分残缺 | Missing sentence components |
| 成分赘余 | Redundant components |
| 关联词使用错误 | Wrong conjunction usage |
| 指代不明 | Ambiguous reference |
| 语义逻辑不通 | Semantic/logical inconsistency |
| 无误 | No error |
---
---
## 📜 License
This project is released under the [Apache 2.0 License](LICENSE).