[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516)

### What this PR does / why we need it? Adds W4A16 quantization method for the Kimi-K2-Thinking model and updates relevant modules to support the new quantization method. - Implements complete W4A16 quantization method including weight packing/unpacking, per-group quantization parameter generation, post-processing logic and MoE method application. - Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts `with_quant` conditional logic to support W4A16 matrix multiplication. - Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and processing logic for `weight_packed` field. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: Ruri <zhouxiang100@huawei.com>
2025-12-10 15:58:52 +08:00
parent c1db298f43
commit ce5872705e
13 changed files with 781 additions and 13 deletions
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@@ -12,6 +12,7 @@ single_npu_qwen3_w4a4
 single_node_pd_disaggregation_mooncake
 multi_npu_qwen3_next
 multi_npu
+multi_npu_kimi-k2-thinking
 multi_npu_moge
 multi_npu_qwen3_moe
 multi_npu_quantization
--- a/docs/source/tutorials/multi_npu_kimi-k2-thinking.md
+++ b/docs/source/tutorials/multi_npu_kimi-k2-thinking.md
@@ -0,0 +1,107 @@
+# Multi-NPU (Kimi-K2-Thinking)
+
+## Run with Docker
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+export NAME=vllm-ascend
+
+# Run the container using the defined variables
+# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+docker run --rm \
+--name $NAME \
+--net=host \
+--shm-size=1g \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci4 \
+--device /dev/davinci5 \
+--device /dev/davinci6 \
+--device /dev/davinci7 \
+--device /dev/davinci8 \
+--device /dev/davinci9 \
+--device /dev/davinci10 \
+--device /dev/davinci11 \
+--device /dev/davinci12 \
+--device /dev/davinci13 \
+--device /dev/davinci14 \
+--device /dev/davinci15 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /mnt/sfs_turbo/.cache:/home/cache \
+-it $IMAGE bash
+```
+
+## Verify the Quantized Model
+Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
+
+```json
+{
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "targets": [
+          "MoE"
+        ]
+      }
+    }
+  }
+}
+```
+
+Your model files look like:
+
+```bash
+.
+|-- chat_template.jinja
+|-- config.json
+|-- configuration_deepseek.py
+|-- configuration.json
+|-- generation_config.json
+|-- model-00001-of-000062.safetensors
+|-- ...
+|-- model-00062-of-000062.safetensors
+|-- model.safetensors.index.json
+|-- modeling_deepseek.py
+|-- tiktoken.model
+|-- tokenization_kimi.py
+`-- tokenizer_config.json
+```
+
+## Online Inference on Multi-NPU
+
+Run the following script to start the vLLM server on Multi-NPU:
+
+For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
+
+```bash
+vllm serve Kimi-K2-Thinking \
+--served-model-name kimi-k2-thinking \
+--tensor-parallel-size 16 \
+--enable_expert_parallel \
+--trust-remote-code \
+--no-enable-prefix-caching
+```
+
+Once your server is started, you can query the model with input prompts.
+
+```bash
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "kimi-k2-thinking",
+  "messages": [
+    {"role": "user", "content": "Who are you?"}
+  ],
+  "temperature": 1.0
+}'
+```