qwen3-4b-tool-gguf/README.md

---
license: mpl-2.0
base_model: Igriscodes/qwen3-4b-tool
tags:
- tool-use
- function-calling
- reinforcement-learning
- mcp
- gguf
- quantized
pipeline_tag: text-generation
language:
- en
---

# Qwen3-4B-Agentic-MCP-RL - GGUF

This repository contains the GGUF quantization files for [Igriscodes/qwen-tool](https://huggingface.co/Igriscodes/qwen-tool), a fine-tuned `Qwen/Qwen3-1.7B` model optimized for multi-step tool use and structured payload delivery via the **Model Context Protocol (MCP)**.

The base model was aligned using **Proximal Policy Optimization (PPO)** on strict JSON validation, execution tracking, and tool-error recovery loops. These GGUF files allow for low-latency, low-memory local inference on edge devices, CPU-only systems, and Apple Silicon.

## Available Quantizations

* **Q2_K**: Maximum compression. Significant loss in logic, not recommended for complex tool-use but fits on ultra-low-memory devices.
* **Q3_K_M**: Balanced 3-bit compression. Better logic than Q2, suitable for highly constrained memory footprints.
* **Q4_0**: Standard legacy 4-bit quantization. Faster on certain older hardware architectures but slightly lower quality than K-quants.
* **Q4_K_M**: **Recommended.** Optimal balance of reasoning performance, generation speed, and VRAM savings.
* **Q5_0**: Standard legacy 5-bit quantization. Good middle ground, but outpaced by K-quants.
* **Q5_K_M**: High quality 5-bit compression. Retains nearly all unquantized capabilities while saving substantial VRAM.
* **Q6_K**: 6-bit quantization. Near-zero degradation from F16 while shaving off a decent chunk of file size.
* **Q8_0**: Maximum 8-bit fidelity. Extremely close to native F16 performance, ideal for strict syntax and reliable tool-calling.
* **F16**: Unquantized baseline. High fidelity, near-native performance for systems with more memory overhead.

## Local Deployment Quickstart

### Using Ollama
Ollama supports running models directly from Hugging Face via the `hf.co` registry prefix. You can pull and run your preferred precision instantly:

```bash
# Q2_K (Extreme compression)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q2_K

# Q3_K_M (Medium 3-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q3_K_M

# Q4_0 (Legacy 4-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q4_0

# Q4_K_M (Recommended balanced version)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q4_K_M

# Q5_0 (Legacy 5-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q5_0

# Q5_K_M (High-fidelity 5-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q5_K_M

# Q6_K (Deep 6-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q6_K

# Q8_0 (Near-lossless 8-bit)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q8_0

# F16 (High-fidelity unquantized float version)
ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:F16

```

**or**

## Ollama Setup Guide

To run this model locally with full tool-calling (function calling) and thinking capabilities, you can easily package it into an **Ollama** model using the provided template configuration.

### 1. Create the Modelfile

Save the configuration block below exactly as a file named `Modelfile` in the same directory where your downloaded GGUF file is located.

> 💡 **Note:** If you are using a different quantization format than the `q4_k_m` example below, make sure to update the `FROM` line to match your exact `.gguf` filename.

```text
# Point to your quantized GGUF file
FROM ./qwen3-4b-tool-q4_k_m.gguf

# Custom template optimizing tool-use syntax and thought blocks
TEMPLATE """{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}{{ .System }}

{{ end }}
{{- if .Tools }}# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}{{ end }}
{{- if .ToolCalls }}
{{- range .ToolCalls }}
<tool_call>
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
</tool_call>
{{- end }}
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
<think>
{{ end }}
{{- end }}"""

# Inference parameters optimized for structured reasoning
PARAMETER temperature 0.6
PARAMETER num_ctx 8192
PARAMETER num_gpu -1
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

```

### 2. Build and Run the Model

Open your terminal, navigate to the directory containing your `Modelfile` and your `.gguf` file, and execute the build command:

```bash
ollama create qwen3-4b-tool --file Modelfile

```

Once the build process completes, you can launch and interact with your new custom model natively via Ollama:

```bash
ollama run qwen3-4b-tool

```

### Using Python (`llama-cpp-python`)

First, ensure you have the library installed:

```bash
pip install llama-cpp-python
```

Depending on your hardware constraints, you can load either the uncompressed precision or the quantized version using the snippets below:

#### Option 1: High Fidelity (F16 Precision)

```python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="Igriscodes/qwen3-4b-tool-gguf",
    filename="qwen3-4b-tool-f16.gguf",
    n_ctx=2048,
    n_gpu_layers=-1 # Use -1 to offload all layers to GPU (Metal/CUDA)
)

```

#### Option 2: Low Resource (Q4 Quantization)

```python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="Igriscodes/qwen3-4b-tool-gguf",
    filename="qwen3-4b-tool-q4.gguf",
    n_ctx=2048,
    n_gpu_layers=-1 # Optimized for CPU execution or limited VRAM
)

```