xc-llm-ascend/docs/source/tutorials/Kimi-K2-Thinking.md

# Kimi-K2-Thinking

## Run with Docker

```{code-block} bash
   :substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash
```

## Verify the Quantized Model
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).

```json
{
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "targets": [
          "MoE"
        ]
      }
    }
  }
}
```

Your model files look like:

```bash
.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json
```

## Online Inference on Multi-NPU

Run the following script to start the vLLM server on Multi-NPU:

For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.

```bash
vllm serve Kimi-K2-Thinking \
--served-model-name kimi-k2-thinking \
--tensor-parallel-size 16 \
--enable_expert_parallel \
--trust-remote-code \
--no-enable-prefix-caching
```

Once your server is started, you can query the model with input prompts.

```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "kimi-k2-thinking",
  "messages": [
    {"role": "user", "content": "Who are you?"}
  ],
  "temperature": 1.0
}'
```
[Doc] Update tutorial index (#4920) Update tutorial index and remove useless doc - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-12-11 20:53:13 +08:00			`# Kimi-K2-Thinking`
[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516) ### What this PR does / why we need it? Adds W4A16 quantization method for the Kimi-K2-Thinking model and updates relevant modules to support the new quantization method. - Implements complete W4A16 quantization method including weight packing/unpacking, per-group quantization parameter generation, post-processing logic and MoE method application. - Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts `with_quant` conditional logic to support W4A16 matrix multiplication. - Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and processing logic for `weight_packed` field. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: Ruri <zhouxiang100@huawei.com> 2025-12-10 15:58:52 +08:00
			`## Run with Docker`

			```{code-block} bash
			`:substitutions:`
			`# Update the vllm-ascend image`
			`export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:\|vllm_ascend_version\|`
			`export NAME=vllm-ascend`

			`# Run the container using the defined variables`
			`# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance`
			`docker run --rm \`
			`--name $NAME \`
			`--net=host \`
			`--shm-size=1g \`
			`--device /dev/davinci0 \`
			`--device /dev/davinci1 \`
			`--device /dev/davinci2 \`
			`--device /dev/davinci3 \`
			`--device /dev/davinci4 \`
			`--device /dev/davinci5 \`
			`--device /dev/davinci6 \`
			`--device /dev/davinci7 \`
			`--device /dev/davinci8 \`
			`--device /dev/davinci9 \`
			`--device /dev/davinci10 \`
			`--device /dev/davinci11 \`
			`--device /dev/davinci12 \`
			`--device /dev/davinci13 \`
			`--device /dev/davinci14 \`
			`--device /dev/davinci15 \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /mnt/sfs_turbo/.cache:/home/cache \`
			`-it $IMAGE bash`
			```

			`## Verify the Quantized Model`
			Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).

			```json
			`{`
			`"quantization_config": {`
			`"config_groups": {`
			`"group_0": {`
			`"targets": [`
			`"MoE"`
			`]`
			`}`
			`}`
			`}`
			`}`
			```

			`Your model files look like:`

			```bash
			`.`
			`\|-- chat_template.jinja`
			`\|-- config.json`
			`\|-- configuration_deepseek.py`
			`\|-- configuration.json`
			`\|-- generation_config.json`
			`\|-- model-00001-of-000062.safetensors`
			`\|-- ...`
			`\|-- model-00062-of-000062.safetensors`
			`\|-- model.safetensors.index.json`
			`\|-- modeling_deepseek.py`
			`\|-- tiktoken.model`
			`\|-- tokenization_kimi.py`
			`-- tokenizer_config.json
			```

			`## Online Inference on Multi-NPU`

			`Run the following script to start the vLLM server on Multi-NPU:`

			`For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.`

			```bash
			`vllm serve Kimi-K2-Thinking \`
			`--served-model-name kimi-k2-thinking \`
			`--tensor-parallel-size 16 \`
			`--enable_expert_parallel \`
			`--trust-remote-code \`
			`--no-enable-prefix-caching`
			```

			`Once your server is started, you can query the model with input prompts.`

			```bash
			`curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{`
			`"model": "kimi-k2-thinking",`
			`"messages": [`
			`{"role": "user", "content": "Who are you?"}`
			`],`
			`"temperature": 1.0`
			`}'`
			```