Qwen3-4B-Thinking-2507-Here…/README.md

---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507/blob/main/LICENSE
pipeline_tag: text-generation
base_model:
- becnic/Qwen3-4B-Thinking-2507-Heretic
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
---

# Qwen3-4B-Thinking-2507-Heretic-GGUF

## Llamacpp imatrix Quantizations of Qwen3-4B-Thinking-2507-Heretic by becnic (from original Qwen3-4B-Thinking-2507)

Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> release <a href="https://github.com/ggerganov/llama.cpp/releases/tag/b7120">b7120</a> for quantization.

Original model: https://huggingface.co/becnic/Qwen3-4B-Thinking-2507-Heretic

Run them in [LM Studio](https://lmstudio.ai/)

Run them directly with [llama.cpp](https://github.com/ggerganov/llama.cpp), or any other llama.cpp based project

## Download a file (not the whole branch) from below:

| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [Qwen3-4B-Thinking-2507-Q8_0.gguf](https://huggingface.co/becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF/blob/main/Qwen3-4B-Thinking-2507-Heretic-Q8_0.gguf) | Q8_0 | 4.28GB | false | Extremely high quality |

## Downloading using huggingface-cli

<details>
  <summary>Click to view download instructions</summary>

First, make sure you have hugginface-cli installed:

```
pip install -U "huggingface_hub[cli]"
```

Then, you can target the specific file you want:

```
huggingface-cli download becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf" --local-dir ./
```

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

```
huggingface-cli download becnic/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf/*" --local-dir ./
```

</details>

## Abliteration parameters

| Parameter | Value |
| :-------- | :---: |
| **direction_index** | 19.42 |
| **attn.o_proj.max_weight** | 1.23 |
| **attn.o_proj.max_weight_position** | 22.34 |
| **attn.o_proj.min_weight** | 0.69 |
| **attn.o_proj.min_weight_distance** | 10.42 |
| **mlp.down_proj.max_weight** | 1.12 |
| **mlp.down_proj.max_weight_position** | 29.64 |
| **mlp.down_proj.min_weight** | 1.08 |
| **mlp.down_proj.min_weight_distance** | 20.24 |

## Performance

| Metric | This model | Original model ([Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)) |
| :----- | :--------: | :---------------------------: |
| **KL divergence** | 0.06 | 0 *(by definition)* |
| **Refusals** | 6/100 | 96/100 |

## Model Overview

**Qwen3-4B-Thinking-2507** has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 4.0B
- Number of Paramaters (Non-Embedding): 3.6B
- Number of Layers: 36
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Context Length: **262,144 natively**.

**NOTE: This model supports only thinking mode. Meanwhile, specifying `enable_thinking=True` is no longer required.**

Additionally, to enforce model thinking, the default chat template automatically includes `<think>`. Therefore, it is normal for the model's output to contain only `</think>` without an explicit opening `<think>` tag.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).

**Supported languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.