初始化项目，由ModelHub XC社区提供模型

Model: lunahr/CeluneNorm-0.6B-v1.3 Source: Original Platform
2026-04-28 04:09:39 +08:00
commit e564374a0f
12 changed files with 151746 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,164 @@
+---
+license: mit
+datasets:
+- lunahr/normalization-data-mixed
+language:
+- en
+base_model:
+- Qwen/Qwen3-0.6B-Base
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- text-transformation
+- text-normalization
+---
+
+# Model Card for CeluneNorm-0.6B-v1.3
+
+## Model Details
+
+### Model Description
+CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
+
+It converts poorly formatted input into clean, readable text while preserving the original meaning.
+
+Example:
+
+- Input: `this is a badly formed sentence`
+- Output: `This is a badly formed sentence.`
+
+The model is conservative by design:
+- It does not rewrite sentences
+- It avoids changing meaning
+- It preserves domain-specific tokens (e.g. URLs, commands, names)
+
+### Update
+Version 1.3 improves punctuation on outputs compared to version 1.2.
+
+It is recommended to use version 1.3 for the best accuracy.
+
+Here is an example of text output from this version, compared to the previous version:
+- 1.2: `Picture this you use the Normalizer to normalize your input and it actually works that would be really good.`
+- 1.3: `Picture this, you use the normalizer to normalize your input and it actually works. That would be really good.`
+
+The newer version tends to also infer sentence boundaries in inputs given.
+
+### Usage
+
+The model expects input in the following format:
+```
+YOUR INPUT<NORM>
+```
+
+It will generate the normalized version of the input.
+
+Inference example:
+```py
+from transformers import pipeline, AutoTokenizer
+
+model_id = "lunahr/CeluneNorm-0.6B-v1.3"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    tokenizer=model_id,
+    device="cuda:0",  # "cpu" for CPU-only, slower
+)
+
+def normalize(text: str) -> str:
+    history = [
+        {"role": "user", "content": text}
+    ]
+    prompt = tokenizer.apply_chat_template(history, tokenize=False)
+
+    out = pipe(
+        prompt,
+        max_new_tokens=512,
+        do_sample=False,
+        return_full_text=False,
+    )
+
+    return out[0]["generated_text"].strip()
+
+# example
+print(normalize("if i type something more complicated into celune it will fix it"))
+```
+
+Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
+
+### Key Characteristics
+
+- Deterministic (no sampling required)
+- Preserves structure and intent
+- Handles mixed text (natural language + technical content)
+- Conservative punctuation (prefers `.` over `!` unless explicit)
+- Supports multi-sentence normalization when boundaries are clear
+
+---
+
+- **Developed by:** https://huggingface.co/lunahr  
+- **Model type:** Causal Language Model  
+- **Language(s):** English  
+- **License:** MIT  
+- **Base model:** Qwen/Qwen3-0.6B-Base  
+
+---
+
+## Limitations
+
+This model is not intended to be a full grammar correction system.
+
+Possible limitations include:
+
+- May miss some punctuation or casing corrections
+- May be conservative with contractions (e.g. `there s` → unchanged)
+- May preserve ambiguous casing when intent is unclear
+- Does not expand slang or rewrite informal language
+
+The model prioritizes safety and meaning preservation over aggressive correction.
+
+---
+
+## Training Details
+
+### Dataset
+
+Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
+
+The dataset includes a mix of:
+
+- Formal text (Wikipedia-style)
+- Conversational text (PersonaChat)
+- Synthetic edge cases
+- Quoted text handling
+
+This combination helps the model generalize across both clean and noisy inputs.
+
+This version was also tuned on an additional 10k rows of casing data to improve accuracy.
+
+---
+
+### Training Procedure
+
+- Fine-tuned from Qwen3-0.6B-Base
+- Hardware: Kaggle dual NVIDIA T4 (FP16)
+- Training time: ~1.5 hours + ~5 minutes (casing CFT)
+- Epochs: 3 + 1 (casing CFT)
+
+Training configuration highlights:
+
+- Learning rate: 8e-5
+- Gradient clipping: 1.0
+- Warmup: 200 steps (~10%)
+
+---
+
+### Metrics
+
+- Final training loss: 0.08841 (0.006989 for casing CFT)
+- Mean token accuracy: 97.53% (99.77% for casing CFT) 
+
+These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
+
+---