165 lines
4.3 KiB
Markdown
165 lines
4.3 KiB
Markdown
|
|
---
|
|||
|
|
license: mit
|
|||
|
|
datasets:
|
|||
|
|
- lunahr/normalization-data-mixed
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
base_model:
|
|||
|
|
- Qwen/Qwen3-0.6B-Base
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
library_name: transformers
|
|||
|
|
tags:
|
|||
|
|
- text-transformation
|
|||
|
|
- text-normalization
|
|||
|
|
new_version: lunahr/CeluneNorm-0.6B-v1.3
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Model Card for CeluneNorm-0.6B-v1.2
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
### Model Description
|
|||
|
|
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
|
|||
|
|
|
|||
|
|
It converts poorly formatted input into clean, readable text while preserving the original meaning.
|
|||
|
|
|
|||
|
|
Example:
|
|||
|
|
|
|||
|
|
- Input: `this is a badly formed sentence`
|
|||
|
|
- Output: `This is a badly formed sentence.`
|
|||
|
|
|
|||
|
|
The model is conservative by design:
|
|||
|
|
- It does not rewrite sentences
|
|||
|
|
- It avoids changing meaning
|
|||
|
|
- It preserves domain-specific tokens (e.g. URLs, commands, names)
|
|||
|
|
|
|||
|
|
### Update
|
|||
|
|
Version 1.2 improves casing on outputs compared to version 1.1.
|
|||
|
|
|
|||
|
|
It is recommended to use version 1.2 for the best accuracy.
|
|||
|
|
|
|||
|
|
Here is an example of text output from this version, compared to the previous version:
|
|||
|
|
- 1.1: `I am currently speaking into the assessment suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`
|
|||
|
|
- 1.2: `I am currently speaking into the Assessment Suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`
|
|||
|
|
|
|||
|
|
The newer version tends to capitalize specific names more often than the older one.
|
|||
|
|
|
|||
|
|
### Usage
|
|||
|
|
|
|||
|
|
The model expects input in the following format:
|
|||
|
|
```
|
|||
|
|
YOUR INPUT<NORM>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
It will generate the normalized version of the input.
|
|||
|
|
|
|||
|
|
Inference example:
|
|||
|
|
```py
|
|||
|
|
from transformers import pipeline, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_id = "lunahr/CeluneNorm-0.6B-v1.2"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
pipe = pipeline(
|
|||
|
|
"text-generation",
|
|||
|
|
model=model_id,
|
|||
|
|
tokenizer=model_id,
|
|||
|
|
device="cuda:0", # "cpu" for CPU-only, slower
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def normalize(text: str) -> str:
|
|||
|
|
history = [
|
|||
|
|
{"role": "user", "content": text}
|
|||
|
|
]
|
|||
|
|
prompt = tokenizer.apply_chat_template(history, tokenize=False)
|
|||
|
|
|
|||
|
|
out = pipe(
|
|||
|
|
prompt,
|
|||
|
|
max_new_tokens=512,
|
|||
|
|
do_sample=False,
|
|||
|
|
return_full_text=False,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
return out[0]["generated_text"].strip()
|
|||
|
|
|
|||
|
|
# example
|
|||
|
|
print(normalize("if i type something more complicated into celune it will fix it"))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
|
|||
|
|
|
|||
|
|
### Key Characteristics
|
|||
|
|
|
|||
|
|
- Deterministic (no sampling required)
|
|||
|
|
- Preserves structure and intent
|
|||
|
|
- Handles mixed text (natural language + technical content)
|
|||
|
|
- Conservative punctuation (prefers `.` over `!` unless explicit)
|
|||
|
|
- Supports multi-sentence normalization when boundaries are clear
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
- **Developed by:** https://huggingface.co/lunahr
|
|||
|
|
- **Model type:** Causal Language Model
|
|||
|
|
- **Language(s):** English
|
|||
|
|
- **License:** MIT
|
|||
|
|
- **Base model:** Qwen/Qwen3-0.6B-Base
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
This model is not intended to be a full grammar correction system.
|
|||
|
|
|
|||
|
|
Possible limitations include:
|
|||
|
|
|
|||
|
|
- May miss some punctuation or casing corrections
|
|||
|
|
- May be conservative with contractions (e.g. `there s` → unchanged)
|
|||
|
|
- May preserve ambiguous casing when intent is unclear
|
|||
|
|
- Does not expand slang or rewrite informal language
|
|||
|
|
|
|||
|
|
The model prioritizes safety and meaning preservation over aggressive correction.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Training Details
|
|||
|
|
|
|||
|
|
### Dataset
|
|||
|
|
|
|||
|
|
Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
|
|||
|
|
|
|||
|
|
The dataset includes a mix of:
|
|||
|
|
|
|||
|
|
- Formal text (Wikipedia-style)
|
|||
|
|
- Conversational text (PersonaChat)
|
|||
|
|
- Synthetic edge cases
|
|||
|
|
- Quoted text handling
|
|||
|
|
|
|||
|
|
This combination helps the model generalize across both clean and noisy inputs.
|
|||
|
|
|
|||
|
|
This version was also tuned on an additional 10k rows of casing data to improve accuracy.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Training Procedure
|
|||
|
|
|
|||
|
|
- Fine-tuned from Qwen3-0.6B-Base
|
|||
|
|
- Hardware: Kaggle dual NVIDIA T4 (FP16)
|
|||
|
|
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
|
|||
|
|
- Epochs: 3 + 1 (casing CFT)
|
|||
|
|
|
|||
|
|
Training configuration highlights:
|
|||
|
|
|
|||
|
|
- Learning rate: 8e-5
|
|||
|
|
- Gradient clipping: 1.0
|
|||
|
|
- Warmup: 200 steps (~10%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Metrics
|
|||
|
|
|
|||
|
|
- Final training loss: 0.08841 (0.006989 for casing CFT)
|
|||
|
|
- Mean token accuracy: 97.53% (99.77% for casing CFT)
|
|||
|
|
|
|||
|
|
These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
|
|||
|
|
|
|||
|
|
---
|