初始化项目，由ModelHub XC社区提供模型

Model: lunahr/CeluneNorm-0.6B-v1.3 Source: Original Platform
2026-04-28 04:09:39 +08:00
commit e564374a0f
12 changed files with 151746 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,36 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,164 @@
+---
+license: mit
+datasets:
+- lunahr/normalization-data-mixed
+language:
+- en
+base_model:
+- Qwen/Qwen3-0.6B-Base
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- text-transformation
+- text-normalization
+---
+
+# Model Card for CeluneNorm-0.6B-v1.3
+
+## Model Details
+
+### Model Description
+CeluneNorm is a lightweight text normalization model designed for TTS and general preprocessing pipelines.
+
+It converts poorly formatted input into clean, readable text while preserving the original meaning.
+
+Example:
+
+- Input: `this is a badly formed sentence`
+- Output: `This is a badly formed sentence.`
+
+The model is conservative by design:
+- It does not rewrite sentences
+- It avoids changing meaning
+- It preserves domain-specific tokens (e.g. URLs, commands, names)
+
+### Update
+Version 1.3 improves punctuation on outputs compared to version 1.2.
+
+It is recommended to use version 1.3 for the best accuracy.
+
+Here is an example of text output from this version, compared to the previous version:
+- 1.2: `Picture this you use the Normalizer to normalize your input and it actually works that would be really good.`
+- 1.3: `Picture this, you use the normalizer to normalize your input and it actually works. That would be really good.`
+
+The newer version tends to also infer sentence boundaries in inputs given.
+
+### Usage
+
+The model expects input in the following format:
+```
+YOUR INPUT<NORM>
+```
+
+It will generate the normalized version of the input.
+
+Inference example:
+```py
+from transformers import pipeline, AutoTokenizer
+
+model_id = "lunahr/CeluneNorm-0.6B-v1.3"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    tokenizer=model_id,
+    device="cuda:0",  # "cpu" for CPU-only, slower
+)
+
+def normalize(text: str) -> str:
+    history = [
+        {"role": "user", "content": text}
+    ]
+    prompt = tokenizer.apply_chat_template(history, tokenize=False)
+
+    out = pipe(
+        prompt,
+        max_new_tokens=512,
+        do_sample=False,
+        return_full_text=False,
+    )
+
+    return out[0]["generated_text"].strip()
+
+# example
+print(normalize("if i type something more complicated into celune it will fix it"))
+```
+
+Caution: CeluneNorm only works reliably on sequences below 128 tokens. Longer inputs may cause problems.
+
+### Key Characteristics
+
+- Deterministic (no sampling required)
+- Preserves structure and intent
+- Handles mixed text (natural language + technical content)
+- Conservative punctuation (prefers `.` over `!` unless explicit)
+- Supports multi-sentence normalization when boundaries are clear
+
+---
+
+- **Developed by:** https://huggingface.co/lunahr  
+- **Model type:** Causal Language Model  
+- **Language(s):** English  
+- **License:** MIT  
+- **Base model:** Qwen/Qwen3-0.6B-Base  
+
+---
+
+## Limitations
+
+This model is not intended to be a full grammar correction system.
+
+Possible limitations include:
+
+- May miss some punctuation or casing corrections
+- May be conservative with contractions (e.g. `there s` → unchanged)
+- May preserve ambiguous casing when intent is unclear
+- Does not expand slang or rewrite informal language
+
+The model prioritizes safety and meaning preservation over aggressive correction.
+
+---
+
+## Training Details
+
+### Dataset
+
+Trained on: https://huggingface.co/datasets/lunahr/normalization-data-mixed
+
+The dataset includes a mix of:
+
+- Formal text (Wikipedia-style)
+- Conversational text (PersonaChat)
+- Synthetic edge cases
+- Quoted text handling
+
+This combination helps the model generalize across both clean and noisy inputs.
+
+This version was also tuned on an additional 10k rows of casing data to improve accuracy.
+
+---
+
+### Training Procedure
+
+- Fine-tuned from Qwen3-0.6B-Base
+- Hardware: Kaggle dual NVIDIA T4 (FP16)
+- Training time: ~1.5 hours + ~5 minutes (casing CFT)
+- Epochs: 3 + 1 (casing CFT)
+
+Training configuration highlights:
+
+- Learning rate: 8e-5
+- Gradient clipping: 1.0
+- Warmup: 200 steps (~10%)
+
+---
+
+### Metrics
+
+- Final training loss: 0.08841 (0.006989 for casing CFT)
+- Mean token accuracy: 97.53% (99.77% for casing CFT) 
+
+These metrics reflect token-level accuracy; real-world normalization quality is slightly lower but more representative (~90–95% human-level correctness).
+
+---
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,6 @@
+{
+  "<NORM>": 151669,
+  "<|endoftext|>": 151643,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644
+}
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1 @@
+{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,63 @@
+{
+  "architectures": [
+    "Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "dtype": "float16",
+  "eos_token_id": 151643,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "pad_token_id": null,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 1000000,
+    "rope_type": "default"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.4",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151647
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
+{
+  "bos_token_id": 151643,
+  "eos_token_id": 151643,
+  "max_new_tokens": 2048,
+  "transformers_version": "5.5.4"
+}
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1068c71ca36c245101cda3a350707c380a8e0989effa7d81e19309730f6622e4
+size 1191542912
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,21 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<NORM>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6b7baed638506fe05577afe018db3b4ff02b4c6a684a112a8582a24596fcff7a
+size 11418445
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,54 @@
+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151669": {
+      "content": "<NORM>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<NORM>"
+  ],
+  "bos_token": null,
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}
--- a/vocab.json
+++ b/vocab.json
				`@@ -0,0 +1 @@`
				`{% for message in messages %}{% if message['role'] == 'user' %}{% if '<NORM>' in message['content'] %}{{ message['content'] }}{% else %}{{ message['content'] }}<NORM>{% endif %}{% endif %}{% endfor %}`