Files
qwen3-tiny-v2/README.md
ModelHub XC aa0cda1887 初始化项目,由ModelHub XC社区提供模型
Model: g023/qwen3-tiny-v2
Source: Original Platform
2026-04-18 08:59:41 +08:00

221 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-1.7B
tags:
- qwen3
- gguf
- layer-surgery
- small-language-model
- optimized
- thinking
- text-generation
- skip-connections
- interpolation
model_name: Qwen3-g023-tiny-v2
pipeline_tag: text-generation
library_name: llama.cpp
quantized_by: g023
---
# Qwen3-g023-tiny-v2 — GGUF
**An advanced 30-layer Qwen3 variant using swap, interpolation, and skip-bridge surgery.**
Created through innovative layer surgery combining multi-swap, interpolation, and bridge (skip connection) techniques. Scores **94.3/100** — a 6.5-point improvement over the original Qwen3-1.7B baseline (87.8/100) and the highest score achieved in two phases of experimentation across ~250 configurations. (I have my own benchmarks, so results may vary if you run your own tests.)
## Available Quantizations
| Quantization | Bits/Weight | Description | Download |
|:---:|:---:|:---|:---:|
| **Q8_0** | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | [Qwen3-g023-tiny-v2-Q8_0.gguf](./Qwen3-g023-tiny-v2-Q8_0.gguf) |
| **Q6_K** | 6.57 | Excellent quality, good compression | [Qwen3-g023-tiny-v2-Q6_K.gguf](./Qwen3-g023-tiny-v2-Q6_K.gguf) |
| **Q4_K_M** | 4.85 | Good balance of quality and size | [Qwen3-g023-tiny-v2-Q4_K_M.gguf](./Qwen3-g023-tiny-v2-Q4_K_M.gguf) |
| **Q3_K_M** | 3.91 | High compression, moderate quality loss | [Qwen3-g023-tiny-v2-Q3_K_M.gguf](./Qwen3-g023-tiny-v2-Q3_K_M.gguf) |
| **Q2_K** | 3.35 | Maximum compression, significant quality loss | [Qwen3-g023-tiny-v2-Q2_K.gguf](./Qwen3-g023-tiny-v2-Q2_K.gguf) |
## Model Details
| Parameter | Value |
|:---|:---|
| Architecture | Qwen3ForCausalLM |
| Layers | **30** (28 original + 2 from surgery) |
| Hidden Size | 2,048 |
| Intermediate Size | 6,144 |
| Attention Heads | 16 query / 8 key-value (GQA) |
| Head Dimension | 128 |
| Vocabulary | 151,936 tokens |
| Max Context | 40,960 tokens |
| RoPE θ | 1,000,000 |
| Tied Embeddings | Yes |
| Total Parameters | **~1.82B** |
| Precision (source) | bfloat16 |
## Surgery Operations
This model was created by applying three innovative surgical operations to [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B):
1. **Multi-swap: layers 12↔13 and 16↔17** — Reorders attention blocks at two critical points in the network for improved representational flow through the mid-layers.
2. **Interpolation: layers 20 & 22 (α=0.5)** — Creates a new layer by blending the weights of layers 20 and 22 at equal proportions, producing a smoother transition in the upper layers.
3. **Bridge (skip connection): layer 5 → after layer 20** — Copies early-layer representations (layer 5) and inserts them after layer 20, creating a skip connection that helps preserve low-level features deep in the network.
### Why These Operations Work
- **Multi-swap** corrects suboptimal layer ordering that emerged from pre-training, allowing better gradient flow through the network's critical middle section.
- **Interpolation** creates a synthetic transition layer that smooths the representation gap between layers 20 and 22, reducing the information bottleneck.
- **Bridge/skip connections** address the "forgetting problem" in deep networks by reintroducing early feature representations at later stages — a technique inspired by ResNet's residual connections but applied at the transformer layer level.
## Benchmark Results
| Metric | Original (28L) | [v1 (27L)](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF) | **v2 (30L)** | Δ vs Original |
|:---|:---:|:---:|:---:|:---:|
| **Overall Score** | 87.8 / 100 | 92.9 / 100 | **94.3 / 100** | **+6.5** |
| **Factual Accuracy** | 15/17 (88%) | 17/17 (100%) | **16/17 (94%)** | **+6%** |
| Avg Perplexity | — | 15.70 | **15.17** | — |
| Thinking Mode | ✅ | ✅ | ✅ | — |
| Non-Thinking Mode | ✅ | ✅ | ✅ | — |
Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
## Features
- **Thinking mode**: Full `<think>` / `</think>` reasoning support — toggle via `enable_thinking` parameter
- **Non-thinking mode**: Direct responses without chain-of-thought overhead
- **Tool calling**: Full function/tool calling support
- **System prompts**: Standard system message support
- **Chat template**: Qwen3 ChatML template embedded in the GGUF
## Usage
### With Ollama
```bash
# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v2-Q8_0.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0
TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF
ollama create qwen3-tiny-v2 -f Modelfile
ollama run qwen3-tiny-v2
```
### With llama.cpp
```bash
# Interactive chat
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
--chat-template chatml -cnv
# Thinking mode
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
-p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
-n 512
# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v2-Q8_0.gguf \
-p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
-n 128
```
### With Python (llama-cpp-python)
```python
from llama_cpp import Llama
model = Llama("Qwen3-g023-tiny-v2-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.6,
)
print(response["choices"][0]["message"]["content"])
```
## System Requirements
| Quantization | RAM (CPU) | VRAM (GPU) |
|:---:|:---:|:---:|
| Q8_0 | ~2.2 GB | ~2.2 GB |
| Q6_K | ~1.8 GB | ~1.8 GB |
| Q4_K_M | ~1.4 GB | ~1.4 GB |
| Q3_K_M | ~1.2 GB | ~1.2 GB |
| Q2_K | ~1.0 GB | ~1.0 GB |
## v1 vs v2
This model (v2) is the **Phase 2 champion**, using advanced multi-operation surgery for the highest overall score.
| | [v1](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF) | v2 (this model) |
|:---|:---:|:---:|
| Layers | 27 | 30 |
| Parameters | ~1.67B | ~1.82B |
| Operations | del + swap | swap + interpolate + bridge |
| Score | 92.9 / 100 | 94.3 / 100 |
| Factual | 100% (17/17) | 94% (16/17) |
| Perplexity | 15.70 | 15.17 |
| Use Case | Max factual accuracy | Max overall score |
**v1** is recommended when factual accuracy is paramount (100% vs 94%).
**v2** is recommended when overall quality matters more (94.3 vs 92.9).
## Methodology
Layer surgery was performed through a systematic, test-driven process across two phases:
1. **Phase 1** (~150 configs): Exhaustive search across deletion, duplication, swapping, interpolation, and combined operations → champion: del_10 + swap_11↔12 (v1)
2. **Phase 2** (~95 configs): Advanced techniques including tripling, multi-swap, layer reversal, cycling, weight scaling, layer merging, skip bridges, and synthesis → champion: this model (v2)
3. **Evaluation**: Each configuration scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
### Phase 2 Leaderboard (Top 5)
| Rank | Configuration | Score | Factual | PPL |
|:---:|:---|:---:|:---:|:---:|
| 🥇 | swap(12↔13,16↔17) + interp(20↔22) + bridge(5→20) | **94.3** | 94% | 15.17 |
| 🥈 | swap(12↔13,16↔17) + interp(20↔22) | 93.9 | 94% | 14.74 |
| 🥉 | swap(12↔13) + interp(20↔22) + bridge(5→20) | 93.4 | 94% | 15.66 |
| 4 | multi-swap(12↔13,16↔17) | 93.1 | 100% | 14.90 |
| 5 | Phase 1 champion (del_10 + swap_11↔12) | 92.9 | 100% | 15.70 |
## Credits
- **Base model**: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by the Qwen team at Alibaba
- **Quantization**: llama.cpp
- **Surgery**: g023
## License
Apache 2.0 — same as the original Qwen3-1.7B model.