初始化项目,由ModelHub XC社区提供模型
Model: lunahr/CeluneNorm-0.6B-v1.3 Source: Original Platform
This commit is contained in:
164
README.md
Normal file
164
README.md
Normal file
@@ -0,0 +1,164 @@
|
||||
---
|
||||
license: mit
|
||||
datasets:
|
||||
- lunahr/normalization-data-mixed
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- Qwen/Qwen3-0.6B-Base
|
||||
pipeline_tag: text-generation
|
||||
library_name: transformers
|
||||
tags:
|
||||
- text-transformation
|
||||
- text-normalization
|
||||
---
|
||||
|
||||
# Model Card for CeluneNorm-0.6B-v1.3
|
||||
|
||||
## Model Details
|
||||
|
||||
### Model Description
|
||||
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
|
||||
|
||||
It converts poorly formatted input into clean, readable text while preserving the original meaning.
|
||||
|
||||
Example:
|
||||
|
||||
- Input: `this is a badly formed sentence`
|
||||
- Output: `This is a badly formed sentence.`
|
||||
|
||||
The model is conservative by design:
|
||||
- It does not rewrite sentences
|
||||
- It avoids changing meaning
|
||||
- It preserves domain-specific tokens (e.g. URLs, commands, names)
|
||||
|
||||
### Update
|
||||
Version 1.3 improves punctuation on outputs compared to version 1.2.
|
||||
|
||||
It is recommended to use version 1.3 for the best accuracy.
|
||||
|
||||
Here is an example of text output from this version, compared to the previous version:
|
||||
- 1.2: `Picture this you use the Normalizer to normalize your input and it actually works that would be really good.`
|
||||
- 1.3: `Picture this, you use the normalizer to normalize your input and it actually works. That would be really good.`
|
||||
|
||||
The newer version tends to also infer sentence boundaries in inputs given.
|
||||
|
||||
### Usage
|
||||
|
||||
The model expects input in the following format:
|
||||
```
|
||||
YOUR INPUT<NORM>
|
||||
```
|
||||
|
||||
It will generate the normalized version of the input.
|
||||
|
||||
Inference example:
|
||||
```py
|
||||
from transformers import pipeline, AutoTokenizer
|
||||
|
||||
model_id = "lunahr/CeluneNorm-0.6B-v1.3"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
pipe = pipeline(
|
||||
"text-generation",
|
||||
model=model_id,
|
||||
tokenizer=model_id,
|
||||
device="cuda:0", # "cpu" for CPU-only, slower
|
||||
)
|
||||
|
||||
def normalize(text: str) -> str:
|
||||
history = [
|
||||
{"role": "user", "content": text}
|
||||
]
|
||||
prompt = tokenizer.apply_chat_template(history, tokenize=False)
|
||||
|
||||
out = pipe(
|
||||
prompt,
|
||||
max_new_tokens=512,
|
||||
do_sample=False,
|
||||
return_full_text=False,
|
||||
)
|
||||
|
||||
return out[0]["generated_text"].strip()
|
||||
|
||||
# example
|
||||
print(normalize("if i type something more complicated into celune it will fix it"))
|
||||
```
|
||||
|
||||
Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
|
||||
|
||||
### Key Characteristics
|
||||
|
||||
- Deterministic (no sampling required)
|
||||
- Preserves structure and intent
|
||||
- Handles mixed text (natural language + technical content)
|
||||
- Conservative punctuation (prefers `.` over `!` unless explicit)
|
||||
- Supports multi-sentence normalization when boundaries are clear
|
||||
|
||||
---
|
||||
|
||||
- **Developed by:** https://huggingface.co/lunahr
|
||||
- **Model type:** Causal Language Model
|
||||
- **Language(s):** English
|
||||
- **License:** MIT
|
||||
- **Base model:** Qwen/Qwen3-0.6B-Base
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
This model is not intended to be a full grammar correction system.
|
||||
|
||||
Possible limitations include:
|
||||
|
||||
- May miss some punctuation or casing corrections
|
||||
- May be conservative with contractions (e.g. `there s` → unchanged)
|
||||
- May preserve ambiguous casing when intent is unclear
|
||||
- Does not expand slang or rewrite informal language
|
||||
|
||||
The model prioritizes safety and meaning preservation over aggressive correction.
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
### Dataset
|
||||
|
||||
Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
|
||||
|
||||
The dataset includes a mix of:
|
||||
|
||||
- Formal text (Wikipedia-style)
|
||||
- Conversational text (PersonaChat)
|
||||
- Synthetic edge cases
|
||||
- Quoted text handling
|
||||
|
||||
This combination helps the model generalize across both clean and noisy inputs.
|
||||
|
||||
This version was also tuned on an additional 10k rows of casing data to improve accuracy.
|
||||
|
||||
---
|
||||
|
||||
### Training Procedure
|
||||
|
||||
- Fine-tuned from Qwen3-0.6B-Base
|
||||
- Hardware: Kaggle dual NVIDIA T4 (FP16)
|
||||
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
|
||||
- Epochs: 3 + 1 (casing CFT)
|
||||
|
||||
Training configuration highlights:
|
||||
|
||||
- Learning rate: 8e-5
|
||||
- Gradient clipping: 1.0
|
||||
- Warmup: 200 steps (~10%)
|
||||
|
||||
---
|
||||
|
||||
### Metrics
|
||||
|
||||
- Final training loss: 0.08841 (0.006989 for casing CFT)
|
||||
- Mean token accuracy: 97.53% (99.77% for casing CFT)
|
||||
|
||||
These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user