初始化项目，由ModelHub XC社区提供模型

Model: g023/Qwen3-1.77B-g023 Source: Original Platform
2026-04-13 16:37:05 +08:00
commit 81019c5698
12 changed files with 152297 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,211 @@
+---
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- text-generation
+- transformers
+- qwen3
+- qwen
+- ai
+- llm
+- qwen3
+- thinking
+base_model:
+- Qwen/Qwen3-1.7B
+---
+# Qwen3-1.77B-g023 (Full Precision)
+
+## Overview
+
+This is an optimized variant of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) created by duplicating **layer 21** to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 9–25, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality.
+
+## TurboQuant-able?
+Why yes, yes it can:
+(https://github.com/g023/turboquant)
+
+## Key Result
+
+| Metric | Baseline (28 layers) | This Model (29 layers) |
+|---|---|---|
+| **Overall Score** | 85.9 / 100 | **93.6 / 100** (+7.7) |
+| **Factual Accuracy** | 7 / 9 | **9 / 9** |
+| **Avg Perplexity** | 17.71 | 19.50 |
+| **Thinking Mode** | Working | Working |
+| **Non-Thinking Mode** | Working | Working |
+
+## Architecture
+
+| Parameter | Value |
+|---|---|
+| Layers | 29 (28 original + 1 duplicated) |
+| Hidden Size | 2048 |
+| Intermediate Size | 6144 |
+| Attention Heads | 16 (query) / 8 (KV) |
+| Head Dimension | 128 |
+| Vocab Size | 151,936 |
+| Max Position Embeddings | 40,960 |
+| Total Parameters | ~1.77B |
+| Dtype | bfloat16 |
+| Tied Embeddings | Yes |
+
+## Layer Mapping
+
+```
+Source Layer  →  Output Layer
+0–20         →  0–20   (unchanged)
+21           →  21, 22 (duplicated with noise std=0.001 + depth scaling)
+22–27        →  23–28  (shifted +1)
+```
+
+## Duplication Method
+
+- **Noise injection**: Gaussian noise (std=0.001) added to duplicate layer to break symmetry
+- **Depth scaling**: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion
+- **Anchors preserved**: First layer (0) and last layer (27→28) remain unmodified
+
+## Files
+
+| File | Size | Description |
+|---|---|---|
+| `model-00001-of-00001.safetensors` | 3.3 GB | Model weights (bfloat16) |
+| `config.json` | <1 KB | Model configuration |
+| `tokenizer.json` | 11 MB | Tokenizer |
+| `tokenizer_config.json` | 10 KB | Tokenizer configuration |
+| `vocab.json` | 2.7 MB | Vocabulary |
+| `merges.txt` | 1.6 MB | BPE merges |
+| `generation_config.json` | <1 KB | Generation defaults |
+| `eval_results.json` | 1 KB | Full evaluation metrics |
+
+## Usage
+
+```python
+# Tweakable parameters
+# MODEL_PATH = "./Qwen3-BEST" # local run
+MODEL_PATH = "g023/Qwen3-1.77B-g023"
+MAX_NEW_TOKENS = 8192
+TEMPERATURE = 0.7
+DO_SAMPLE = True
+TOP_P = 0.9
+TOP_K = 50
+REPETITION_PENALTY = 1.1
+STREAMING = True  # Set to True for streaming inference
+INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. "
+
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
+import time
+
+def load_model():
+    print("Loading model...")
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        device_map="auto",
+    )
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+    print("Model loaded.")
+    return model, tokenizer
+
+def inference_non_streaming(model, tokenizer, messages):
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=MAX_NEW_TOKENS,
+        temperature=TEMPERATURE,
+        do_sample=DO_SAMPLE,
+        top_p=TOP_P,
+        top_k=TOP_K,
+        repetition_penalty=REPETITION_PENALTY,
+    )
+    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+    print("Response:", response)
+    return response
+
+def inference_streaming(model, tokenizer, messages):
+    final_response = ""
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=MAX_NEW_TOKENS,
+        temperature=TEMPERATURE,
+        do_sample=DO_SAMPLE,
+        top_p=TOP_P,
+        top_k=TOP_K,
+        repetition_penalty=REPETITION_PENALTY,
+        streamer=streamer,
+    )
+
+
+    # return a final str
+    return final_response
+
+def llm_stream(model, tokenizer, conversation):
+    import time
+    start_time = time.time()
+    text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    from io import StringIO
+    buffer = StringIO()
+    class CapturingTextStreamer(TextStreamer):
+        def __init__(self, tokenizer, buffer):
+            super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True)
+            self.buffer = buffer
+        def on_finalized_text(self, text, stream_end=False):
+            self.buffer.write(text)
+            print(text, end="", flush=True)
+    streamer = CapturingTextStreamer(tokenizer, buffer)
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=MAX_NEW_TOKENS,
+        temperature=TEMPERATURE,
+        do_sample=DO_SAMPLE,
+        top_p=TOP_P,
+        top_k=TOP_K,
+        repetition_penalty=REPETITION_PENALTY,
+        streamer=streamer,
+    )
+    response = buffer.getvalue()
+
+    if "</think>" in response:
+        parts = response.rsplit("</think>", 1)
+        reasoning = parts[0].strip()
+        content = parts[1].strip()
+    else:
+        reasoning = ""
+        content = response.strip()
+    char_per_token = 3.245
+    reasoning_tokens = round(len(reasoning) / char_per_token)
+    content_tokens = round(len(content) / char_per_token)
+    total_tokens = reasoning_tokens + content_tokens
+    time_taken = time.time() - start_time
+    ret_dict = {
+        "reasoning": reasoning,
+        "content": content,
+        "usage": {
+            "reasoning_tokens": reasoning_tokens,
+            "content_tokens": content_tokens,
+            "total_tokens": total_tokens,
+        },
+        "time_taken": time_taken,
+    }
+    return ret_dict
+
+if __name__ == "__main__":
+    model, tokenizer = load_model()
+    messages = [{"role": "user", "content": INPUT_MESSAGE}]
+    ret = llm_stream(model, tokenizer, messages)
+    print("Result dict:", ret)
+
+    # output tokens per second by taking total_tokens and time_taken
+    if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0:
+        tps = ret["usage"]["total_tokens"] / ret["time_taken"]
+        print(f"Tokens per second: {tps:.2f}")
+```
+
+## Base Model
+
+- **Model**: Qwen/Qwen3-1.7B
+- **Architecture**: Qwen3ForCausalLM (decoder-only transformer with GQA)
+- **License**: Apache 2.0