--- license: apache-2.0 language: - en base_model: Qwen/Qwen3-1.7B tags: - qwen3 - gguf - layer-surgery - small-language-model - pruned - optimized - thinking - text-generation model_name: Qwen3-g023-tiny-v1 pipeline_tag: text-generation library_name: llama.cpp quantized_by: g023 --- # Qwen3-g023-tiny-v1 — GGUF **A surgically optimized 27-layer Qwen3 variant that outperforms the original 28-layer model.** Created by selectively deleting a harmful layer and swapping adjacent layers for improved information flow. Scores **92.9/100** with **100% factual accuracy** — a 5.1-point improvement over the original Qwen3-1.7B baseline (87.8/100). ## Available Quantizations | Quantization | Bits/Weight | Description | Download | |:---:|:---:|:---|:---:| | **Q8_0** | 8.00 | Highest quality, virtually lossless (USE THIS ONE) | [Qwen3-g023-tiny-v1-Q8_0.gguf](./Qwen3-g023-tiny-v1-Q8_0.gguf) | | **Q6_K** | 6.57 | Excellent quality, good compression | [Qwen3-g023-tiny-v1-Q6_K.gguf](./Qwen3-g023-tiny-v1-Q6_K.gguf) | | **Q4_K_M** | 4.85 | Good balance of quality and size | [Qwen3-g023-tiny-v1-Q4_K_M.gguf](./Qwen3-g023-tiny-v1-Q4_K_M.gguf) | | **Q3_K_M** | 3.91 | High compression, moderate quality loss | [Qwen3-g023-tiny-v1-Q3_K_M.gguf](./Qwen3-g023-tiny-v1-Q3_K_M.gguf) | | **Q2_K** | 3.35 | Maximum compression, significant quality loss | [Qwen3-g023-tiny-v1-Q2_K.gguf](./Qwen3-g023-tiny-v1-Q2_K.gguf) | ## Model Details | Parameter | Value | |:---|:---| | Architecture | Qwen3ForCausalLM | | Layers | **27** (28 original − 1 deleted) | | Hidden Size | 2,048 | | Intermediate Size | 6,144 | | Attention Heads | 16 query / 8 key-value (GQA) | | Head Dimension | 128 | | Vocabulary | 151,936 tokens | | Max Context | 40,960 tokens | | RoPE θ | 1,000,000 | | Tied Embeddings | Yes | | Total Parameters | **~1.67B** | | Precision (source) | bfloat16 | ## Surgery Operations This model was created by applying two surgical operations to [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B): 1. **Delete layer 10** — Layer 10 was identified as harmful to model quality. Removing it improved the overall score from 85.9 to 91.4. 2. **Swap layers 11 ↔ 12** (post-deletion indices) — Swapping these adjacent attention blocks optimized information flow between the model's middle layers, further improving the score to 92.9. ### Key Findings - **Smaller is better**: The 27-layer model outperforms both the 28-layer original and various 29–30 layer expanded models. - **Layer 10 is actively harmful**: Removing it alone yields a +3.6 point improvement. - **Operations compound selectively**: Deletion + swap works, but deletion + duplication degrades quality. ## Benchmark Results | Metric | Original (28L) | **v1 (27L)** | Δ | |:---|:---:|:---:|:---:| | **Overall Score** | 87.8 / 100 | **92.9 / 100** | **+5.1** | | **Factual Accuracy** | 15 / 17 (88%) | **17 / 17 (100%)** | **+12%** | | Avg Perplexity | — | 15.70 | — | | Thinking Mode | ✅ | ✅ | — | | Non-Thinking Mode | ✅ | ✅ | — | Evaluated using a comprehensive test suite with 17 factual questions, 2 completion coherence tests, perplexity measurements, repetition analysis, and thinking/non-thinking mode verification. ## Features - **Thinking mode**: Full `` / `` reasoning support — toggle via `enable_thinking` parameter - **Non-thinking mode**: Direct responses without chain-of-thought overhead - **Tool calling**: Full function/tool calling support - **System prompts**: Standard system message support - **Chat template**: Qwen3 ChatML template embedded in the GGUF ## Usage ### With Ollama ```bash # Download the GGUF and create from Modelfile cat > Modelfile << 'EOF' FROM ./Qwen3-g023-tiny-v1-Q8_0.gguf PARAMETER temperature 1.0 PARAMETER top_p 0.95 PARAMETER top_k 45 PARAMETER min_p 0.1 PARAMETER num_ctx 40000 PARAMETER mirostat 2 PARAMETER mirostat_tau 5.0 PARAMETER mirostat_eta 0.1 PARAMETER repeat_last_n 16384 PARAMETER repeat_penalty 1.1 PARAMETER presence_penalty 0.5 PARAMETER frequency_penalty 1.0 TEMPLATE """{{- if .System }} <|im_start|>system {{ .System }}<|im_end|> {{ end }} {{- range .Messages }} {{- if eq .Role "user" }} <|im_start|>user {{ .Content }}<|im_end|> {{- else if eq .Role "assistant" }} <|im_start|>assistant {{ .Content }}<|im_end|> {{- end }} {{- end }} <|im_start|>assistant """ SYSTEM "You are a helpful assistant." EOF ollama create qwen3-tiny-v1 -f Modelfile ollama run qwen3-tiny-v1 ``` ### With llama.cpp ```bash # Interactive chat llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \ --chat-template chatml -cnv # Thinking mode llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \ -p "<|im_start|>user\nExplain quantum computing<|im_end|>\n<|im_start|>assistant\n\n" \ -n 512 # Non-thinking mode llama-cli -m Qwen3-g023-tiny-v1-Q8_0.gguf \ -p "<|im_start|>user\n/no_think What is 2+2?<|im_end|>\n<|im_start|>assistant\n" \ -n 128 ``` ### With Python (llama-cpp-python) ```python from llama_cpp import Llama model = Llama("Qwen3-g023-tiny-v1-Q8_0.gguf", n_ctx=4096) response = model.create_chat_completion( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, ], temperature=0.6, ) print(response["choices"][0]["message"]["content"]) ``` ## System Requirements | Quantization | RAM (CPU) | VRAM (GPU) | |:---:|:---:|:---:| | Q8_0 | ~2.0 GB | ~2.0 GB | | Q6_K | ~1.7 GB | ~1.7 GB | | Q4_K_M | ~1.3 GB | ~1.3 GB | | Q3_K_M | ~1.1 GB | ~1.1 GB | | Q2_K | ~0.9 GB | ~0.9 GB | ## v1 vs v2 This model (v1) is the **Phase 1 champion**, focused on surgical precision with minimal operations. | | v1 (this model) | [v2](https://huggingface.co/g023/Qwen3-g023-tiny-v2-GGUF) | |:---|:---:|:---:| | Layers | 27 | 30 | | Parameters | ~1.67B | ~1.82B | | Operations | del + swap | swap + interpolate + bridge | | Score | 92.9 / 100 | 94.3 / 100 | | Factual | 100% (17/17) | 94% (16/17) | | Perplexity | 15.70 | 15.17 | | Use Case | Max factual accuracy | Max overall score | **v1** is recommended when factual accuracy is paramount (100% vs 94%). **v2** is recommended when overall quality matters more (94.3 vs 92.9). ## Methodology Layer surgery was performed through a systematic, test-driven development process: 1. **Phase 1**: Exhaustive search across 150+ configurations testing deletion, duplication, swapping, interpolation, and combined operations 2. **Evaluation**: Each configuration was scored on factual accuracy (17 questions), completion coherence, perplexity, repetition ratio, and thinking mode functionality 3. **Selection**: The champion was selected based on overall score, with factual accuracy as a tiebreaker The surgery framework is available in the [source repository](https://huggingface.co/g023/Qwen3-g023-tiny-v1-GGUF). ## Credits - **Base model**: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by the Qwen team at Alibaba - **Quantization**: llama.cpp - **Surgery**: g023 ## License Apache 2.0 — same as the original Qwen3-1.7B model.