初始化项目,由ModelHub XC社区提供模型

Model: lunahr/CeluneNorm-0.6B-v1.3
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-28 04:09:39 +08:00
commit e564374a0f
12 changed files with 151746 additions and 0 deletions

36
.gitattributes vendored Normal file
View File

@@ -0,0 +1,36 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text

164
README.md Normal file
View File

@@ -0,0 +1,164 @@
---
license: mit
datasets:
- lunahr/normalization-data-mixed
language:
- en
base_model:
- Qwen/Qwen3-0.6B-Base
pipeline_tag: text-generation
library_name: transformers
tags:
- text-transformation
- text-normalization
---
# Model Card for CeluneNorm-0.6B-v1.3
## Model Details
### Model Description
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
It converts poorly formatted input into clean, readable text while preserving the original meaning.
Example:
- Input: `this is a badly formed sentence`
- Output: `This is a badly formed sentence.`
The model is conservative by design:
- It does not rewrite sentences
- It avoids changing meaning
- It preserves domain-specific tokens (e.g. URLs, commands, names)
### Update
Version 1.3 improves punctuation on outputs compared to version 1.2.
It is recommended to use version 1.3 for the best accuracy.
Here is an example of text output from this version, compared to the previous version:
- 1.2: `Picture this you use the Normalizer to normalize your input and it actually works that would be really good.`
- 1.3: `Picture this, you use the normalizer to normalize your input and it actually works. That would be really good.`
The newer version tends to also infer sentence boundaries in inputs given.
### Usage
The model expects input in the following format:
```
YOUR INPUT<NORM>
```
It will generate the normalized version of the input.
Inference example:
```py
from transformers import pipeline, AutoTokenizer
model_id = "lunahr/CeluneNorm-0.6B-v1.3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model_id,
tokenizer=model_id,
device="cuda:0", # "cpu" for CPU-only, slower
)
def normalize(text: str) -> str:
history = [
{"role": "user", "content": text}
]
prompt = tokenizer.apply_chat_template(history, tokenize=False)
out = pipe(
prompt,
max_new_tokens=512,
do_sample=False,
return_full_text=False,
)
return out[0]["generated_text"].strip()
# example
print(normalize("if i type something more complicated into celune it will fix it"))
```
Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
### Key Characteristics
- Deterministic (no sampling required)
- Preserves structure and intent
- Handles mixed text (natural language + technical content)
- Conservative punctuation (prefers `.` over `!` unless explicit)
- Supports multi-sentence normalization when boundaries are clear
---
- **Developed by:** https://huggingface.co/lunahr
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** MIT
- **Base model:** Qwen/Qwen3-0.6B-Base
---
## Limitations
This model is not intended to be a full grammar correction system.
Possible limitations include:
- May miss some punctuation or casing corrections
- May be conservative with contractions (e.g. `there s` → unchanged)
- May preserve ambiguous casing when intent is unclear
- Does not expand slang or rewrite informal language
The model prioritizes safety and meaning preservation over aggressive correction.
---
## Training Details
### Dataset
Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
The dataset includes a mix of:
- Formal text (Wikipedia-style)
- Conversational text (PersonaChat)
- Synthetic edge cases
- Quoted text handling
This combination helps the model generalize across both clean and noisy inputs.
This version was also tuned on an additional 10k rows of casing data to improve accuracy.
---
### Training Procedure
- Fine-tuned from Qwen3-0.6B-Base
- Hardware: Kaggle dual NVIDIA T4 (FP16)
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
- Epochs: 3 + 1 (casing CFT)
Training configuration highlights:
- Learning rate: 8e-5
- Gradient clipping: 1.0
- Warmup: 200 steps (~10%)
---
### Metrics
- Final training loss: 0.08841 (0.006989 for casing CFT)
- Mean token accuracy: 97.53% (99.77% for casing CFT)
These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~9095% human-level correctness).
---

6
added_tokens.json Normal file
View File

@@ -0,0 +1,6 @@
{
"<NORM>": 151669,
"<|endoftext|>": 151643,
"<|im_end|>": 151645,
"<|im_start|>": 151644
}

1
chat_template.jinja Normal file
View File

@@ -0,0 +1 @@
{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}

63
config.json Normal file
View File

@@ -0,0 +1,63 @@
{
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"dtype": "float16",
"eos_token_id": 151643,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen3",
"num_attention_heads": 16,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"pad_token_id": null,
"rms_norm_eps": 1e-06,
"rope_parameters": {
"rope_theta": 1000000,
"rope_type": "default"
},
"sliding_window": null,
"tie_word_embeddings": true,
"transformers_version": "5.5.4",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151647
}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"bos_token_id": 151643,
"eos_token_id": 151643,
"max_new_tokens": 2048,
"transformers_version": "5.5.4"
}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1068c71ca36c245101cda3a350707c380a8e0989effa7d81e19309730f6622e4
size 1191542912

21
special_tokens_map.json Normal file
View File

@@ -0,0 +1,21 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<NORM>"
],
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6b7baed638506fe05577afe018db3b4ff02b4c6a684a112a8582a24596fcff7a
size 11418445

54
tokenizer_config.json Normal file
View File

@@ -0,0 +1,54 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151669": {
"content": "<NORM>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<NORM>"
],
"bos_token": null,
"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long