Files
Qwen3-1.77B-g023/README.md
ModelHub XC 81019c5698 初始化项目,由ModelHub XC社区提供模型
Model: g023/Qwen3-1.77B-g023
Source: Original Platform
2026-04-13 16:37:05 +08:00

211 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation
- transformers
- qwen3
- qwen
- ai
- llm
- qwen3
- thinking
base_model:
- Qwen/Qwen3-1.7B
---
# Qwen3-1.77B-g023 (Full Precision)
## Overview
This is an optimized variant of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) created by duplicating **layer 21** to produce a 29-layer model (up from the original 28). The optimal duplication point was found through 5 rounds of iterative testing across layers 925, evaluating factual accuracy, perplexity, repetition, and thinking mode functionality.
## TurboQuant-able?
Why yes, yes it can:
(https://github.com/g023/turboquant)
## Key Result
| Metric | Baseline (28 layers) | This Model (29 layers) |
|---|---|---|
| **Overall Score** | 85.9 / 100 | **93.6 / 100** (+7.7) |
| **Factual Accuracy** | 7 / 9 | **9 / 9** |
| **Avg Perplexity** | 17.71 | 19.50 |
| **Thinking Mode** | Working | Working |
| **Non-Thinking Mode** | Working | Working |
## Architecture
| Parameter | Value |
|---|---|
| Layers | 29 (28 original + 1 duplicated) |
| Hidden Size | 2048 |
| Intermediate Size | 6144 |
| Attention Heads | 16 (query) / 8 (KV) |
| Head Dimension | 128 |
| Vocab Size | 151,936 |
| Max Position Embeddings | 40,960 |
| Total Parameters | ~1.77B |
| Dtype | bfloat16 |
| Tied Embeddings | Yes |
## Layer Mapping
```
Source Layer → Output Layer
020 → 020 (unchanged)
21 → 21, 22 (duplicated with noise std=0.001 + depth scaling)
2227 → 2328 (shifted +1)
```
## Duplication Method
- **Noise injection**: Gaussian noise (std=0.001) added to duplicate layer to break symmetry
- **Depth scaling**: Factor of √(28/29) ≈ 0.983 applied to prevent activation explosion
- **Anchors preserved**: First layer (0) and last layer (27→28) remain unmodified
## Files
| File | Size | Description |
|---|---|---|
| `model-00001-of-00001.safetensors` | 3.3 GB | Model weights (bfloat16) |
| `config.json` | <1 KB | Model configuration |
| `tokenizer.json` | 11 MB | Tokenizer |
| `tokenizer_config.json` | 10 KB | Tokenizer configuration |
| `vocab.json` | 2.7 MB | Vocabulary |
| `merges.txt` | 1.6 MB | BPE merges |
| `generation_config.json` | <1 KB | Generation defaults |
| `eval_results.json` | 1 KB | Full evaluation metrics |
## Usage
```python
# Tweakable parameters
# MODEL_PATH = "./Qwen3-BEST" # local run
MODEL_PATH = "g023/Qwen3-1.77B-g023"
MAX_NEW_TOKENS = 8192
TEMPERATURE = 0.7
DO_SAMPLE = True
TOP_P = 0.9
TOP_K = 50
REPETITION_PENALTY = 1.1
STREAMING = True # Set to True for streaming inference
INPUT_MESSAGE = "You are completing the next step in a task to create an arcade game in javascript. Your available tools are rationalize, red_green_tdd, and create_plan. Synthesize their output when reasoning. "
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import time
def load_model():
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
print("Model loaded.")
return model, tokenizer
def inference_non_streaming(model, tokenizer, messages):
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
temperature=TEMPERATURE,
do_sample=DO_SAMPLE,
top_p=TOP_P,
top_k=TOP_K,
repetition_penalty=REPETITION_PENALTY,
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("Response:", response)
return response
def inference_streaming(model, tokenizer, messages):
final_response = ""
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
outputs = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
temperature=TEMPERATURE,
do_sample=DO_SAMPLE,
top_p=TOP_P,
top_k=TOP_K,
repetition_penalty=REPETITION_PENALTY,
streamer=streamer,
)
# return a final str
return final_response
def llm_stream(model, tokenizer, conversation):
import time
start_time = time.time()
text = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
from io import StringIO
buffer = StringIO()
class CapturingTextStreamer(TextStreamer):
def __init__(self, tokenizer, buffer):
super().__init__(tokenizer, skip_prompt=True, skip_special_tokens=True)
self.buffer = buffer
def on_finalized_text(self, text, stream_end=False):
self.buffer.write(text)
print(text, end="", flush=True)
streamer = CapturingTextStreamer(tokenizer, buffer)
outputs = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
temperature=TEMPERATURE,
do_sample=DO_SAMPLE,
top_p=TOP_P,
top_k=TOP_K,
repetition_penalty=REPETITION_PENALTY,
streamer=streamer,
)
response = buffer.getvalue()
if "</think>" in response:
parts = response.rsplit("</think>", 1)
reasoning = parts[0].strip()
content = parts[1].strip()
else:
reasoning = ""
content = response.strip()
char_per_token = 3.245
reasoning_tokens = round(len(reasoning) / char_per_token)
content_tokens = round(len(content) / char_per_token)
total_tokens = reasoning_tokens + content_tokens
time_taken = time.time() - start_time
ret_dict = {
"reasoning": reasoning,
"content": content,
"usage": {
"reasoning_tokens": reasoning_tokens,
"content_tokens": content_tokens,
"total_tokens": total_tokens,
},
"time_taken": time_taken,
}
return ret_dict
if __name__ == "__main__":
model, tokenizer = load_model()
messages = [{"role": "user", "content": INPUT_MESSAGE}]
ret = llm_stream(model, tokenizer, messages)
print("Result dict:", ret)
# output tokens per second by taking total_tokens and time_taken
if ret["usage"]["total_tokens"] > 0 and ret["time_taken"] > 0:
tps = ret["usage"]["total_tokens"] / ret["time_taken"]
print(f"Tokens per second: {tps:.2f}")
```
## Base Model
- **Model**: Qwen/Qwen3-1.7B
- **Architecture**: Qwen3ForCausalLM (decoder-only transformer with GQA)
- **License**: Apache 2.0