211 lines
6.9 KiB
Markdown
211 lines
6.9 KiB
Markdown
---
|
||
license: apache-2.0
|
||
language:
|
||
- en
|
||
base_model: Qwen/Qwen3-1.7B
|
||
tags:
|
||
- qwen3
|
||
- gguf
|
||
- layer-surgery
|
||
- small-language-model
|
||
- pruned
|
||
- optimized
|
||
- thinking
|
||
- text-generation
|
||
model_name: Qwen3-g023-tiny-v1
|
||
pipeline_tag: text-generation
|
||
library_name: llama.cpp
|
||
quantized_by: g023
|
||
---
|
||
|
||
# Qwen3-g023-tiny-v1 — GGUF
|
||
|
||
**A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.**
|
||
|
||
Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores **92.9/100** with **100% factual accuracy** — a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100).
|
||
|
||
## Available Quantizations
|
||
|
||
| Quantization | Bits/Weight | Description | Download |
|
||
|:---:|:---:|:---|:---:|
|
||
| **Q8_0** | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | [Qwen3-g023-tiny-v1-Q8_0.gguf](./Qwen3-g023-tiny-v1-Q8_0.gguf) |
|
||
| **Q6_K** | 6.57 | Excellent quality, good compression | [Qwen3-g023-tiny-v1-Q6_K.gguf](./Qwen3-g023-tiny-v1-Q6_K.gguf) |
|
||
| **Q4_K_M** | 4.85 | Good balance of quality and size | [Qwen3-g023-tiny-v1-Q4_K_M.gguf](./Qwen3-g023-tiny-v1-Q4_K_M.gguf) |
|
||
| **Q3_K_M** | 3.91 | High compression, moderate quality loss | [Qwen3-g023-tiny-v1-Q3_K_M.gguf](./Qwen3-g023-tiny-v1-Q3_K_M.gguf) |
|
||
| **Q2_K** | 3.35 | Maximum compression, significant quality loss | [Qwen3-g023-tiny-v1-Q2_K.gguf](./Qwen3-g023-tiny-v1-Q2_K.gguf) |
|
||
|
||
## Model Details
|
||
|
||
| Parameter | Value |
|
||
|:---|:---|
|
||
| Architecture | Qwen3ForCausalLM |
|
||
| Layers | **27** (28 original − 1 deleted) |
|
||
| Hidden Size | 2,048 |
|
||
| Intermediate Size | 6,144 |
|
||
| Attention Heads | 16 query / 8 key-value (GQA) |
|
||
| Head Dimension | 128 |
|
||
| Vocabulary | 151,936 tokens |
|
||
| Max Context | 40,960 tokens |
|
||
| RoPE θ | 1,000,000 |
|
||
| Tied Embeddings | Yes |
|
||
| Total Parameters | **~1.67B** |
|
||
| Precision (source) | bfloat16 |
|
||
|
||
## Surgery Operations
|
||
|
||
This model was created by applying two surgical operations to [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B):
|
||
|
||
1. **Delete layer 10** — Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4.
|
||
2. **Swap layers 11 ↔ 12** (post-deletion indices) — Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9.
|
||
|
||
### Key Findings
|
||
|
||
- **Smaller is better**: The 27-layer model outperforms both the 28-layer original and various 29–30 layer expanded models.
|
||
- **Layer 10 is actively harmful**: Removing it alone yields a +3.6 point improvement.
|
||
- **Operations compound selectively**: Deletion + swap works, but deletion + duplication degrades quality.
|
||
|
||
## Benchmark Results
|
||
|
||
| Metric | Original (28L) | **v1 (27L)** | Δ |
|
||
|:---|:---:|:---:|:---:|
|
||
| **Overall Score** | 87.8 / 100 | **92.9 / 100** | **+5.1** |
|
||
| **Factual Accuracy** | 15 / 17 (88%) | **17 / 17 (100%)** | **+12%** |
|
||
| Avg Perplexity | — | 15.70 | — |
|
||
| Thinking Mode | ✅ | ✅ | — |
|
||
| Non-Thinking Mode | ✅ | ✅ | — |
|
||
|
||
Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification.
|
||
|
||
## Features
|
||
|
||
- **Thinking mode**: Full `<think>` / `</think>` reasoning support — toggle via `enable_thinking` parameter
|
||
- **Non-thinking mode**: Direct responses without chain-of-thought overhead
|
||
- **Tool calling**: Full function/tool calling support
|
||
- **System prompts**: Standard system message support
|
||
- **Chat template**: Qwen3 ChatML template embedded in the GGUF
|
||
|
||
## Usage
|
||
|
||
### With Ollama
|
||
|
||
```bash
|
||
# Download the GGUF and create from Modelfile
|
||
cat > Modelfile << 'EOF'
|
||
FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf
|
||
|
||
PARAMETER temperature 1.0
|
||
PARAMETER top_p 0.95
|
||
PARAMETER top_k 45
|
||
PARAMETER min_p 0.1
|
||
PARAMETER num_ctx 40000
|
||
PARAMETER mirostat 2
|
||
PARAMETER mirostat_tau 5.0
|
||
PARAMETER mirostat_eta 0.1
|
||
PARAMETER repeat_last_n 16384
|
||
PARAMETER repeat_penalty 1.1
|
||
PARAMETER presence_penalty 0.5
|
||
PARAMETER frequency_penalty 1.0
|
||
|
||
TEMPLATE """{{- if .System }}
|
||
<|im_start|>system
|
||
{{ .System }}<|im_end|>
|
||
{{ end }}
|
||
{{- range .Messages }}
|
||
{{- if eq .Role "user" }}
|
||
<|im_start|>user
|
||
{{ .Content }}<|im_end|>
|
||
{{- else if eq .Role "assistant" }}
|
||
<|im_start|>assistant
|
||
{{ .Content }}<|im_end|>
|
||
{{- end }}
|
||
{{- end }}
|
||
<|im_start|>assistant
|
||
"""
|
||
SYSTEM "You are a helpful assistant."
|
||
EOF
|
||
|
||
ollama create qwen3-tiny-v1 -f Modelfile
|
||
ollama run qwen3-tiny-v1
|
||
```
|
||
|
||
### With llama.cpp
|
||
|
||
```bash
|
||
# Interactive chat
|
||
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
|
||
--chat-template chatml -cnv
|
||
|
||
# Thinking mode
|
||
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
|
||
-p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n<think>\n" \
|
||
-n 512
|
||
|
||
# Non-thinking mode
|
||
llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \
|
||
-p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
|
||
-n 128
|
||
```
|
||
|
||
### With Python (llama-cpp-python)
|
||
|
||
```python
|
||
from llama_cpp import Llama
|
||
|
||
model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096)
|
||
response = model.create_chat_completion(
|
||
messages=[
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "What is the capital of France?"},
|
||
],
|
||
temperature=0.6,
|
||
)
|
||
print(response["choices"][0]["message"]["content"])
|
||
```
|
||
|
||
## System Requirements
|
||
|
||
| Quantization | RAM (CPU) | VRAM (GPU) |
|
||
|:---:|:---:|:---:|
|
||
| Q8_0 | ~2.0 GB | ~2.0 GB |
|
||
| Q6_K | ~1.7 GB | ~1.7 GB |
|
||
| Q4_K_M | ~1.3 GB | ~1.3 GB |
|
||
| Q3_K_M | ~1.1 GB | ~1.1 GB |
|
||
| Q2_K | ~0.9 GB | ~0.9 GB |
|
||
|
||
## v1 vs v2
|
||
|
||
This model (v1) is the **Phase 1 champion**, focused on surgical precision with minimal operations.
|
||
|
||
| | v1 (this model) | [v2](https://huggingface.co/g023/Qwen3-g023-tiny-v2-GGUF) |
|
||
|:---|:---:|:---:|
|
||
| Layers | 27 | 30 |
|
||
| Parameters | ~1.67B | ~1.82B |
|
||
| Operations | del + swap | swap + interpolate + bridge |
|
||
| Score | 92.9 / 100 | 94.3 / 100 |
|
||
| Factual | 100% (17/17) | 94% (16/17) |
|
||
| Perplexity | 15.70 | 15.17 |
|
||
| Use Case | Max factual accuracy | Max overall score |
|
||
|
||
**v1** is recommended when factual accuracy is paramount (100% vs 94%).
|
||
**v2** is recommended when overall quality matters more (94.3 vs 92.9).
|
||
|
||
## Methodology
|
||
|
||
Layer surgery was performed through a systematic, test-driven development process:
|
||
|
||
1. **Phase 1**: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations
|
||
2. **Evaluation**: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality
|
||
3. **Selection**: The champion was selected based on overall score, with factual accuracy as a tiebreaker
|
||
|
||
The surgery framework is available in the [source repository](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF).
|
||
|
||
## Credits
|
||
|
||
- **Base model**: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by the Qwen team at Alibaba
|
||
- **Quantization**: llama.cpp
|
||
- **Surgery**: g023
|
||
|
||
## License
|
||
|
||
Apache 2.0 — same as the original Qwen3-1.7B model.
|