Files
qwen3-tiny-v1/README.md
ModelHub XC 3e81c881df 初始化项目,由ModelHub XC社区提供模型
Model: g023/qwen3-tiny-v1
Source: Original Platform
2026-04-22 16:03:32 +08:00

211 lines
6.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-1.7B
tags:
- qwen3
- gguf
- layer-surgery
- small-language-model
- pruned
- optimized
- thinking
- text-generation
model_name: Qwen3-g023-tiny-v1
pipeline_tag: text-generation
library_name: llama.cpp
quantized_by: g023
---
# Qwen3-g023-tiny-v1 — GGUF
**A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.**
Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores **92.9/100** with **100% factual accuracy** — a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).
## Available Quantizations
| Quantization | Bits/Weight | Description | Download |
|:---:|:---:|:---|:---:|
| **Q8_0** | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | [Qwen3-g023-tiny-v1-Q8_0.gguf](./Qwen3-g023-tiny-v1-Q8_0.gguf) |
| **Q6_K** | 6.57 | Excellent quality, good compression | [Qwen3-g023-tiny-v1-Q6_K.gguf](./Qwen3-g023-tiny-v1-Q6_K.gguf) |
| **Q4_K_M** | 4.85 | Good balance of quality and size | [Qwen3-g023-tiny-v1-Q4_K_M.gguf](./Qwen3-g023-tiny-v1-Q4_K_M.gguf) |
| **Q3_K_M** | 3.91 | High compression, moderate quality loss | [Qwen3-g023-tiny-v1-Q3_K_M.gguf](./Qwen3-g023-tiny-v1-Q3_K_M.gguf) |
| **Q2_K** | 3.35 | Maximum compression, significant quality loss | [Qwen3-g023-tiny-v1-Q2_K.gguf](./Qwen3-g023-tiny-v1-Q2_K.gguf) |
## Model Details
| Parameter | Value |
|:---|:---|
| Architecture | Qwen3ForCausalLM |
| Layers | **27** (28 original 1 deleted) |
| Hidden Size | 2,048 |
| Intermediate Size | 6,144 |
| Attention Heads | 16 query / 8 key-value (GQA) |
| Head Dimension | 128 |
| Vocabulary | 151,936 tokens |
| Max Context | 40,960 tokens |
| RoPE θ | 1,000,000 |
| Tied Embeddings | Yes |
| Total Parameters | **~1.67B** |
| Precision (source) | bfloat16 |
## Surgery Operations
This model was created by applying two surgical operations to [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B):
1. **Delete layer 10** — Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
2. **Swap layers 11 ↔ 12** (post-deletion indices) — Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.
### Key Findings
- **Smaller is better**: The 27-layer model outperforms both the 28-layer original and various 2930 layer expanded models.
- **Layer 10 is actively harmful**: Removing it alone yields a +3.6 point improvement.
- **Operations compound selectively**: Deletion + swap works, but deletion + duplication degrades quality.
## Benchmark Results
| Metric | Original (28L) | **v1 (27L)** | Δ |
|:---|:---:|:---:|:---:|
| **Overall Score** | 87.8 / 100 | **92.9 / 100** | **+5.1** |
| **Factual Accuracy** | 15 / 17 (88%) | **17 / 17 (100%)** | **+12%** |
| Avg Perplexity | — | 15.70 | — |
| Thinking Mode | ✅ | ✅ | — |
| Non-Thinking Mode | ✅ | ✅ | — |
Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
## Features
- **Thinking mode**: Full `<think>` / `</think>` reasoning support — toggle via `enable_thinking` parameter
- **Non-thinking mode**: Direct responses without chain-of-thought overhead
- **Tool calling**: Full function/tool calling support
- **System prompts**: Standard system message support
- **Chat template**: Qwen3 ChatML template embedded in the GGUF
## Usage
### With Ollama
```bash
# Download the GGUF and create from Modelfile
cat > Modelfile << 'EOF'
FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 45
PARAMETER min_p 0.1
PARAMETER num_ctx 40000
PARAMETER mirostat 2
PARAMETER mirostat_tau 5.0
PARAMETER mirostat_eta 0.1
PARAMETER repeat_last_n 16384
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0.5
PARAMETER frequency_penalty 1.0
TEMPLATE """{{- if .System }}
<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}
<|im_start|>user
{{ .Content }}<|im_end|>
{{- else if eq .Role "assistant" }}
<|im_start|>assistant
{{ .Content }}<|im_end|>
{{- end }}
{{- end }}
<|im_start|>assistant
"""
SYSTEM "You are a helpful assistant."
EOF
ollama create qwen3-tiny-v1 -f Modelfile
ollama run qwen3-tiny-v1
```
### With llama.cpp
```bash
# Interactive chat
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
--chat-template chatml -cnv
# Thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
-p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
-n 512
# Non-thinking mode
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
-p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
-n 128
```
### With Python (llama-cpp-python)
```python
from llama_cpp import Llama
model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096)
response = model.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.6,
)
print(response["choices"][0]["message"]["content"])
```
## System Requirements
| Quantization | RAM (CPU) | VRAM (GPU) |
|:---:|:---:|:---:|
| Q8_0 | ~2.0 GB | ~2.0 GB |
| Q6_K | ~1.7 GB | ~1.7 GB |
| Q4_K_M | ~1.3 GB | ~1.3 GB |
| Q3_K_M | ~1.1 GB | ~1.1 GB |
| Q2_K | ~0.9 GB | ~0.9 GB |
## v1 vs v2
This model (v1) is the **Phase 1 champion**, focused on surgical precision with minimal operations.
| | v1 (this model) | [v2](https://huggingface.co/g023/Qwen3-g023-tiny-v2-GGUF) |
|:---|:---:|:---:|
| Layers | 27 | 30 |
| Parameters | ~1.67B | ~1.82B |
| Operations | del + swap | swap + interpolate + bridge |
| Score | 92.9 / 100 | 94.3 / 100 |
| Factual | 100% (17/17) | 94% (16/17) |
| Perplexity | 15.70 | 15.17 |
| Use Case | Max factual accuracy | Max overall score |
**v1** is recommended when factual accuracy is paramount (100% vs 94%).
**v2** is recommended when overall quality matters more (94.3 vs 92.9).
## Methodology
Layer surgery was performed through a systematic, test-driven development process:
1. **Phase 1**: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
2. **Evaluation**: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
3. **Selection**: The champion was selected based on overall score, with factual accuracy as a tiebreaker
The surgery framework is available in the [source repository](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF).
## Credits
- **Base model**: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by the Qwen team at Alibaba
- **Quantization**: llama.cpp
- **Surgery**: g023
## License
Apache 2.0 — same as the original Qwen3-1.7B model.