CeluneNorm-0.6B-v1.2/README.md

---
license: mit
datasets:
- lunahr/normalization-data-mixed
language:
- en
base_model:
- Qwen/Qwen3-0.6B-Base
pipeline_tag: text-generation
library_name: transformers
tags:
- text-transformation
- text-normalization
new_version: lunahr/CeluneNorm-0.6B-v1.3
---

# Model Card for CeluneNorm-0.6B-v1.2

## Model Details

### Model Description
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.

It converts poorly formatted input into clean, readable text while preserving the original meaning.

Example:

- Input: `this is a badly formed sentence`
- Output: `This is a badly formed sentence.`

The model is conservative by design:
- It does not rewrite sentences
- It avoids changing meaning
- It preserves domain-specific tokens (e.g. URLs, commands, names)

### Update
Version 1.2 improves casing on outputs compared to version 1.1.

It is recommended to use version 1.2 for the best accuracy.

Here is an example of text output from this version, compared to the previous version:
- 1.1: `I am currently speaking into the assessment suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`
- 1.2: `I am currently speaking into the Assessment Suite so that it can measure all of my 7 voice traits, and tell me how Celune I am. I need to know that.`

The newer version tends to capitalize specific names more often than the older one.

### Usage

The model expects input in the following format:
```
YOUR INPUT<NORM>
```

It will generate the normalized version of the input.

Inference example:
```py
from transformers import pipeline, AutoTokenizer

model_id = "lunahr/CeluneNorm-0.6B-v1.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=model_id,
    device="cuda:0",  # "cpu" for CPU-only, slower
)

def normalize(text: str) -> str:
    history = [
        {"role": "user", "content": text}
    ]
    prompt = tokenizer.apply_chat_template(history, tokenize=False)

    out = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=False,
        return_full_text=False,
    )

    return out[0]["generated_text"].strip()

# example
print(normalize("if i type something more complicated into celune it will fix it"))
```

Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.

### Key Characteristics

- Deterministic (no sampling required)
- Preserves structure and intent
- Handles mixed text (natural language + technical content)
- Conservative punctuation (prefers `.` over `!` unless explicit)
- Supports multi-sentence normalization when boundaries are clear

---

- **Developed by:** https://huggingface.co/lunahr
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** MIT
- **Base model:** Qwen/Qwen3-0.6B-Base

---

## Limitations

This model is not intended to be a full grammar correction system.

Possible limitations include:

- May miss some punctuation or casing corrections
- May be conservative with contractions (e.g. `there s` → unchanged)
- May preserve ambiguous casing when intent is unclear
- Does not expand slang or rewrite informal language

The model prioritizes safety and meaning preservation over aggressive correction.

---

## Training Details

### Dataset

Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed

The dataset includes a mix of:

- Formal text (Wikipedia-style)
- Conversational text (PersonaChat)
- Synthetic edge cases
- Quoted text handling

This combination helps the model generalize across both clean and noisy inputs.

This version was also tuned on an additional 10k rows of casing data to improve accuracy.

---

### Training Procedure

- Fine-tuned from Qwen3-0.6B-Base
- Hardware: Kaggle dual NVIDIA T4 (FP16)
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
- Epochs: 3 + 1 (casing CFT)

Training configuration highlights:

- Learning rate: 8e-5
- Gradient clipping: 1.0
- Warmup: 200 steps (~10%)

---

### Metrics

- Final training loss: 0.08841 (0.006989 for casing CFT)
- Mean token accuracy: 97.53% (99.77% for casing CFT)

These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).

---