--- license: mpl-2.0 base_model: Igriscodes/qwen3-4b-tool tags: - tool-use - function-calling - reinforcement-learning - mcp - gguf - quantized pipeline_tag: text-generation language: - en --- # Qwen3-4B-Agentic-MCP-RL - GGUF This repository contains the GGUF quantization files for [Igriscodes/qwen-tool](https://huggingface.co/Igriscodes/qwen-tool), a fine-tuned `Qwen/Qwen3-1.7B` model optimized for multi-step tool use and structured payload delivery via the **Model Context Protocol (MCP)**. The base model was aligned using **Proximal Policy Optimization (PPO)** on strict JSON validation, execution tracking, and tool-error recovery loops. These GGUF files allow for low-latency, low-memory local inference on edge devices, CPU-only systems, and Apple Silicon. ## Available Quantizations * **Q2_K**: Maximum compression. Significant loss in logic, not recommended for complex tool-use but fits on ultra-low-memory devices. * **Q3_K_M**: Balanced 3-bit compression. Better logic than Q2, suitable for highly constrained memory footprints. * **Q4_0**: Standard legacy 4-bit quantization. Faster on certain older hardware architectures but slightly lower quality than K-quants. * **Q4_K_M**: **Recommended.** Optimal balance of reasoning performance, generation speed, and VRAM savings. * **Q5_0**: Standard legacy 5-bit quantization. Good middle ground, but outpaced by K-quants. * **Q5_K_M**: High quality 5-bit compression. Retains nearly all unquantized capabilities while saving substantial VRAM. * **Q6_K**: 6-bit quantization. Near-zero degradation from F16 while shaving off a decent chunk of file size. * **Q8_0**: Maximum 8-bit fidelity. Extremely close to native F16 performance, ideal for strict syntax and reliable tool-calling. * **F16**: Unquantized baseline. High fidelity, near-native performance for systems with more memory overhead. ## Local Deployment Quickstart ### Using Ollama Ollama supports running models directly from Hugging Face via the `hf.co` registry prefix. You can pull and run your preferred precision instantly: ```bash # Q2_K (Extreme compression) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q2_K # Q3_K_M (Medium 3-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q3_K_M # Q4_0 (Legacy 4-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q4_0 # Q4_K_M (Recommended balanced version) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q4_K_M # Q5_0 (Legacy 5-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q5_0 # Q5_K_M (High-fidelity 5-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q5_K_M # Q6_K (Deep 6-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q6_K # Q8_0 (Near-lossless 8-bit) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:Q8_0 # F16 (High-fidelity unquantized float version) ollama run hf.co/Igriscodes/qwen3-4b-tool-gguf:F16 ``` **or** ## Ollama Setup Guide To run this model locally with full tool-calling (function calling) and thinking capabilities, you can easily package it into an **Ollama** model using the provided template configuration. ### 1. Create the Modelfile Save the configuration block below exactly as a file named `Modelfile` in the same directory where your downloaded GGUF file is located. > 💡 **Note:** If you are using a different quantization format than the `q4_k_m` example below, make sure to update the `FROM` line to match your exact `.gguf` filename. ```text # Point to your quantized GGUF file FROM ./qwen3-4b-tool-q4_k_m.gguf # Custom template optimizing tool-use syntax and thought blocks TEMPLATE """{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}} {{- end }} {{- if or .System .Tools }}<|im_start|>system {{ if .System }}{{ .System }} {{ end }} {{- if .Tools }}# Tools You may call one or more functions to assist with the user query. You are provided with function signatures within XML tags: {{- range .Tools }} {"type": "function", "function": {{ .Function }}} {{- end }} For each function call, return a json object with function name and arguments within XML tags: {"name": , "arguments": } {{- end -}} <|im_end|> {{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 -}} {{- if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> {{ else if eq .Role "assistant" }}<|im_start|>assistant {{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}} {{ .Thinking }} {{ end -}} {{ if .Content }}{{ .Content }}{{ end }} {{- if .ToolCalls }} {{- range .ToolCalls }} {"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}} {{- end }} {{- end }}{{ if not $last }}<|im_end|> {{ end }} {{- else if eq .Role "tool" }}<|im_start|>user {{ .Content }} <|im_end|> {{ end }} {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant {{ end }} {{- end }}""" # Inference parameters optimized for structured reasoning PARAMETER temperature 0.6 PARAMETER num_ctx 8192 PARAMETER num_gpu -1 PARAMETER top_k 20 PARAMETER top_p 0.95 PARAMETER repeat_penalty 1 PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> ``` ### 2. Build and Run the Model Open your terminal, navigate to the directory containing your `Modelfile` and your `.gguf` file, and execute the build command: ```bash ollama create qwen3-4b-tool --file Modelfile ``` Once the build process completes, you can launch and interact with your new custom model natively via Ollama: ```bash ollama run qwen3-4b-tool ``` ### Using Python (`llama-cpp-python`) First, ensure you have the library installed: ```bash pip install llama-cpp-python ``` Depending on your hardware constraints, you can load either the uncompressed precision or the quantized version using the snippets below: #### Option 1: High Fidelity (F16 Precision) ```python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Igriscodes/qwen3-4b-tool-gguf", filename="qwen3-4b-tool-f16.gguf", n_ctx=2048, n_gpu_layers=-1 # Use -1 to offload all layers to GPU (Metal/CUDA) ) ``` #### Option 2: Low Resource (Q4 Quantization) ```python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Igriscodes/qwen3-4b-tool-gguf", filename="qwen3-4b-tool-q4.gguf", n_ctx=2048, n_gpu_layers=-1 # Optimized for CPU execution or limited VRAM ) ```