165 lines
4.3 KiB
Markdown
165 lines
4.3 KiB
Markdown
---
|
||
license: mit
|
||
datasets:
|
||
- lunahr/normalization-data-mixed
|
||
language:
|
||
- en
|
||
base_model:
|
||
- Qwen/Qwen3-0.6B-Base
|
||
pipeline_tag: text-generation
|
||
library_name: transformers
|
||
tags:
|
||
- text-transformation
|
||
- text-normalization
|
||
new_version: lunahr/CeluneNorm-0.6B-v1.3
|
||
---
|
||
|
||
# Model Card for CeluneNorm-0.6B-v1.2
|
||
|
||
## Model Details
|
||
|
||
### Model Description
|
||
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
|
||
|
||
It converts poorly formatted input into clean, readable text while preserving the original meaning.
|
||
|
||
Example:
|
||
|
||
- Input: `this is a badly formed sentence`
|
||
- Output: `This is a badly formed sentence.`
|
||
|
||
The model is conservative by design:
|
||
- It does not rewrite sentences
|
||
- It avoids changing meaning
|
||
- It preserves domain-specific tokens (e.g. URLs, commands, names)
|
||
|
||
### Update
|
||
Version 1.2 improves casing on outputs compared to version 1.1.
|
||
|
||
It is recommended to use version 1.2 for the best accuracy.
|
||
|
||
Here is an example of text output from this version, compared to the previous version:
|
||
- 1.1: `I am currently speaking into the assessment suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`
|
||
- 1.2: `I am currently speaking into the Assessment Suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`
|
||
|
||
The newer version tends to capitalize specific names more often than the older one.
|
||
|
||
### Usage
|
||
|
||
The model expects input in the following format:
|
||
```
|
||
YOUR INPUT<NORM>
|
||
```
|
||
|
||
It will generate the normalized version of the input.
|
||
|
||
Inference example:
|
||
```py
|
||
from transformers import pipeline, AutoTokenizer
|
||
|
||
model_id = "lunahr/CeluneNorm-0.6B-v1.2"
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||
pipe = pipeline(
|
||
"text-generation",
|
||
model=model_id,
|
||
tokenizer=model_id,
|
||
device="cuda:0", # "cpu" for CPU-only, slower
|
||
)
|
||
|
||
def normalize(text: str) -> str:
|
||
history = [
|
||
{"role": "user", "content": text}
|
||
]
|
||
prompt = tokenizer.apply_chat_template(history, tokenize=False)
|
||
|
||
out = pipe(
|
||
prompt,
|
||
max_new_tokens=512,
|
||
do_sample=False,
|
||
return_full_text=False,
|
||
)
|
||
|
||
return out[0]["generated_text"].strip()
|
||
|
||
# example
|
||
print(normalize("if i type something more complicated into celune it will fix it"))
|
||
```
|
||
|
||
Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
|
||
|
||
### Key Characteristics
|
||
|
||
- Deterministic (no sampling required)
|
||
- Preserves structure and intent
|
||
- Handles mixed text (natural language + technical content)
|
||
- Conservative punctuation (prefers `.` over `!` unless explicit)
|
||
- Supports multi-sentence normalization when boundaries are clear
|
||
|
||
---
|
||
|
||
- **Developed by:** https://huggingface.co/lunahr
|
||
- **Model type:** Causal Language Model
|
||
- **Language(s):** English
|
||
- **License:** MIT
|
||
- **Base model:** Qwen/Qwen3-0.6B-Base
|
||
|
||
---
|
||
|
||
## Limitations
|
||
|
||
This model is not intended to be a full grammar correction system.
|
||
|
||
Possible limitations include:
|
||
|
||
- May miss some punctuation or casing corrections
|
||
- May be conservative with contractions (e.g. `there s` → unchanged)
|
||
- May preserve ambiguous casing when intent is unclear
|
||
- Does not expand slang or rewrite informal language
|
||
|
||
The model prioritizes safety and meaning preservation over aggressive correction.
|
||
|
||
---
|
||
|
||
## Training Details
|
||
|
||
### Dataset
|
||
|
||
Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
|
||
|
||
The dataset includes a mix of:
|
||
|
||
- Formal text (Wikipedia-style)
|
||
- Conversational text (PersonaChat)
|
||
- Synthetic edge cases
|
||
- Quoted text handling
|
||
|
||
This combination helps the model generalize across both clean and noisy inputs.
|
||
|
||
This version was also tuned on an additional 10k rows of casing data to improve accuracy.
|
||
|
||
---
|
||
|
||
### Training Procedure
|
||
|
||
- Fine-tuned from Qwen3-0.6B-Base
|
||
- Hardware: Kaggle dual NVIDIA T4 (FP16)
|
||
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
|
||
- Epochs: 3 + 1 (casing CFT)
|
||
|
||
Training configuration highlights:
|
||
|
||
- Learning rate: 8e-5
|
||
- Gradient clipping: 1.0
|
||
- Warmup: 200 steps (~10%)
|
||
|
||
---
|
||
|
||
### Metrics
|
||
|
||
- Final training loss: 0.08841 (0.006989 for casing CFT)
|
||
- Mean token accuracy: 97.53% (99.77% for casing CFT)
|
||
|
||
These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
|
||
|
||
--- |