初始化项目,由ModelHub XC社区提供模型
Model: lunahr/CeluneNorm-0.6B-v1.3 Source: Original Platform
This commit is contained in:
36
.gitattributes
vendored
Normal file
36
.gitattributes
vendored
Normal file
@@ -0,0 +1,36 @@
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||
*.model filter=lfs diff=lfs merge=lfs -text
|
||||
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
||||
164
README.md
Normal file
164
README.md
Normal file
@@ -0,0 +1,164 @@
|
||||
---
|
||||
license: mit
|
||||
datasets:
|
||||
- lunahr/normalization-data-mixed
|
||||
language:
|
||||
- en
|
||||
base_model:
|
||||
- Qwen/Qwen3-0.6B-Base
|
||||
pipeline_tag: text-generation
|
||||
library_name: transformers
|
||||
tags:
|
||||
- text-transformation
|
||||
- text-normalization
|
||||
---
|
||||
|
||||
# Model Card for CeluneNorm-0.6B-v1.3
|
||||
|
||||
## Model Details
|
||||
|
||||
### Model Description
|
||||
CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
|
||||
|
||||
It converts poorly formatted input into clean, readable text while preserving the original meaning.
|
||||
|
||||
Example:
|
||||
|
||||
- Input: `this is a badly formed sentence`
|
||||
- Output: `This is a badly formed sentence.`
|
||||
|
||||
The model is conservative by design:
|
||||
- It does not rewrite sentences
|
||||
- It avoids changing meaning
|
||||
- It preserves domain-specific tokens (e.g. URLs, commands, names)
|
||||
|
||||
### Update
|
||||
Version 1.3 improves punctuation on outputs compared to version 1.2.
|
||||
|
||||
It is recommended to use version 1.3 for the best accuracy.
|
||||
|
||||
Here is an example of text output from this version, compared to the previous version:
|
||||
- 1.2: `Picture this you use the Normalizer to normalize your input and it actually works that would be really good.`
|
||||
- 1.3: `Picture this, you use the normalizer to normalize your input and it actually works. That would be really good.`
|
||||
|
||||
The newer version tends to also infer sentence boundaries in inputs given.
|
||||
|
||||
### Usage
|
||||
|
||||
The model expects input in the following format:
|
||||
```
|
||||
YOUR INPUT<NORM>
|
||||
```
|
||||
|
||||
It will generate the normalized version of the input.
|
||||
|
||||
Inference example:
|
||||
```py
|
||||
from transformers import pipeline, AutoTokenizer
|
||||
|
||||
model_id = "lunahr/CeluneNorm-0.6B-v1.3"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
pipe = pipeline(
|
||||
"text-generation",
|
||||
model=model_id,
|
||||
tokenizer=model_id,
|
||||
device="cuda:0", # "cpu" for CPU-only, slower
|
||||
)
|
||||
|
||||
def normalize(text: str) -> str:
|
||||
history = [
|
||||
{"role": "user", "content": text}
|
||||
]
|
||||
prompt = tokenizer.apply_chat_template(history, tokenize=False)
|
||||
|
||||
out = pipe(
|
||||
prompt,
|
||||
max_new_tokens=512,
|
||||
do_sample=False,
|
||||
return_full_text=False,
|
||||
)
|
||||
|
||||
return out[0]["generated_text"].strip()
|
||||
|
||||
# example
|
||||
print(normalize("if i type something more complicated into celune it will fix it"))
|
||||
```
|
||||
|
||||
Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
|
||||
|
||||
### Key Characteristics
|
||||
|
||||
- Deterministic (no sampling required)
|
||||
- Preserves structure and intent
|
||||
- Handles mixed text (natural language + technical content)
|
||||
- Conservative punctuation (prefers `.` over `!` unless explicit)
|
||||
- Supports multi-sentence normalization when boundaries are clear
|
||||
|
||||
---
|
||||
|
||||
- **Developed by:** https://huggingface.co/lunahr
|
||||
- **Model type:** Causal Language Model
|
||||
- **Language(s):** English
|
||||
- **License:** MIT
|
||||
- **Base model:** Qwen/Qwen3-0.6B-Base
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
This model is not intended to be a full grammar correction system.
|
||||
|
||||
Possible limitations include:
|
||||
|
||||
- May miss some punctuation or casing corrections
|
||||
- May be conservative with contractions (e.g. `there s` → unchanged)
|
||||
- May preserve ambiguous casing when intent is unclear
|
||||
- Does not expand slang or rewrite informal language
|
||||
|
||||
The model prioritizes safety and meaning preservation over aggressive correction.
|
||||
|
||||
---
|
||||
|
||||
## Training Details
|
||||
|
||||
### Dataset
|
||||
|
||||
Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
|
||||
|
||||
The dataset includes a mix of:
|
||||
|
||||
- Formal text (Wikipedia-style)
|
||||
- Conversational text (PersonaChat)
|
||||
- Synthetic edge cases
|
||||
- Quoted text handling
|
||||
|
||||
This combination helps the model generalize across both clean and noisy inputs.
|
||||
|
||||
This version was also tuned on an additional 10k rows of casing data to improve accuracy.
|
||||
|
||||
---
|
||||
|
||||
### Training Procedure
|
||||
|
||||
- Fine-tuned from Qwen3-0.6B-Base
|
||||
- Hardware: Kaggle dual NVIDIA T4 (FP16)
|
||||
- Training time: ~1.5 hours + ~5 minutes (casing CFT)
|
||||
- Epochs: 3 + 1 (casing CFT)
|
||||
|
||||
Training configuration highlights:
|
||||
|
||||
- Learning rate: 8e-5
|
||||
- Gradient clipping: 1.0
|
||||
- Warmup: 200 steps (~10%)
|
||||
|
||||
---
|
||||
|
||||
### Metrics
|
||||
|
||||
- Final training loss: 0.08841 (0.006989 for casing CFT)
|
||||
- Mean token accuracy: 97.53% (99.77% for casing CFT)
|
||||
|
||||
These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
|
||||
|
||||
---
|
||||
6
added_tokens.json
Normal file
6
added_tokens.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"<NORM>": 151669,
|
||||
"<|endoftext|>": 151643,
|
||||
"<|im_end|>": 151645,
|
||||
"<|im_start|>": 151644
|
||||
}
|
||||
1
chat_template.jinja
Normal file
1
chat_template.jinja
Normal file
@@ -0,0 +1 @@
|
||||
{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}
|
||||
63
config.json
Normal file
63
config.json
Normal file
@@ -0,0 +1,63 @@
|
||||
{
|
||||
"architectures": [
|
||||
"Qwen3ForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"bos_token_id": 151643,
|
||||
"dtype": "float16",
|
||||
"eos_token_id": 151643,
|
||||
"head_dim": 128,
|
||||
"hidden_act": "silu",
|
||||
"hidden_size": 1024,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 3072,
|
||||
"layer_types": [
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention",
|
||||
"full_attention"
|
||||
],
|
||||
"max_position_embeddings": 32768,
|
||||
"max_window_layers": 28,
|
||||
"model_type": "qwen3",
|
||||
"num_attention_heads": 16,
|
||||
"num_hidden_layers": 28,
|
||||
"num_key_value_heads": 8,
|
||||
"pad_token_id": null,
|
||||
"rms_norm_eps": 1e-06,
|
||||
"rope_parameters": {
|
||||
"rope_theta": 1000000,
|
||||
"rope_type": "default"
|
||||
},
|
||||
"sliding_window": null,
|
||||
"tie_word_embeddings": true,
|
||||
"transformers_version": "5.5.4",
|
||||
"use_cache": true,
|
||||
"use_sliding_window": false,
|
||||
"vocab_size": 151647
|
||||
}
|
||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"bos_token_id": 151643,
|
||||
"eos_token_id": 151643,
|
||||
"max_new_tokens": 2048,
|
||||
"transformers_version": "5.5.4"
|
||||
}
|
||||
151388
merges.txt
Normal file
151388
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:1068c71ca36c245101cda3a350707c380a8e0989effa7d81e19309730f6622e4
|
||||
size 1191542912
|
||||
21
special_tokens_map.json
Normal file
21
special_tokens_map.json
Normal file
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"additional_special_tokens": [
|
||||
"<|im_start|>",
|
||||
"<|im_end|>",
|
||||
"<NORM>"
|
||||
],
|
||||
"eos_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
},
|
||||
"pad_token": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false
|
||||
}
|
||||
}
|
||||
3
tokenizer.json
Normal file
3
tokenizer.json
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:6b7baed638506fe05577afe018db3b4ff02b4c6a684a112a8582a24596fcff7a
|
||||
size 11418445
|
||||
54
tokenizer_config.json
Normal file
54
tokenizer_config.json
Normal file
@@ -0,0 +1,54 @@
|
||||
{
|
||||
"add_bos_token": false,
|
||||
"add_prefix_space": false,
|
||||
"added_tokens_decoder": {
|
||||
"151643": {
|
||||
"content": "<|endoftext|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151644": {
|
||||
"content": "<|im_start|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151645": {
|
||||
"content": "<|im_end|>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
},
|
||||
"151669": {
|
||||
"content": "<NORM>",
|
||||
"lstrip": false,
|
||||
"normalized": false,
|
||||
"rstrip": false,
|
||||
"single_word": false,
|
||||
"special": true
|
||||
}
|
||||
},
|
||||
"additional_special_tokens": [
|
||||
"<|im_start|>",
|
||||
"<|im_end|>",
|
||||
"<NORM>"
|
||||
],
|
||||
"bos_token": null,
|
||||
"chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}",
|
||||
"clean_up_tokenization_spaces": false,
|
||||
"eos_token": "<|endoftext|>",
|
||||
"errors": "replace",
|
||||
"extra_special_tokens": {},
|
||||
"model_max_length": 131072,
|
||||
"pad_token": "<|endoftext|>",
|
||||
"split_special_tokens": false,
|
||||
"tokenizer_class": "Qwen2Tokenizer",
|
||||
"unk_token": null
|
||||
}
|
||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user