### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
147 lines
4.4 KiB
Markdown
147 lines
4.4 KiB
Markdown
# Kimi-K2-Thinking
|
||
|
||
## Introduction
|
||
|
||
Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
|
||
|
||
This document will show the main verification steps of the model, including supported features, environment preparation, single-node deployment, and functional verification.
|
||
|
||
## Supported Features
|
||
|
||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||
|
||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||
|
||
## Environment Preparation
|
||
|
||
### Model Weight
|
||
|
||
- `Kimi-K2-Thinking`(bfloat16): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||
|
||
It is recommended to download the model weight to the shared directory, such as `/mnt/sfs_turbo/.cache/`.
|
||
|
||
### Installation
|
||
|
||
You can use our official docker image to run `Kimi-K2-Thinking` directly.
|
||
|
||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||
|
||
## Run with Docker
|
||
|
||
```{code-block} bash
|
||
:substitutions:
|
||
# Update the vllm-ascend image
|
||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||
export NAME=vllm-ascend
|
||
|
||
# Run the container using the defined variables
|
||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||
docker run --rm \
|
||
--name $NAME \
|
||
--net=host \
|
||
--shm-size=1g \
|
||
--device /dev/davinci0 \
|
||
--device /dev/davinci1 \
|
||
--device /dev/davinci2 \
|
||
--device /dev/davinci3 \
|
||
--device /dev/davinci4 \
|
||
--device /dev/davinci5 \
|
||
--device /dev/davinci6 \
|
||
--device /dev/davinci7 \
|
||
--device /dev/davinci8 \
|
||
--device /dev/davinci9 \
|
||
--device /dev/davinci10 \
|
||
--device /dev/davinci11 \
|
||
--device /dev/davinci12 \
|
||
--device /dev/davinci13 \
|
||
--device /dev/davinci14 \
|
||
--device /dev/davinci15 \
|
||
--device /dev/davinci_manager \
|
||
--device /dev/devmm_svm \
|
||
--device /dev/hisi_hdc \
|
||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||
-v /mnt/sfs_turbo/.cache:/home/cache \
|
||
-it $IMAGE bash
|
||
```
|
||
|
||
## Verify the Quantized Model
|
||
|
||
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||
|
||
```json
|
||
{
|
||
"quantization_config": {
|
||
"config_groups": {
|
||
"group_0": {
|
||
"targets": [
|
||
"MoE"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Your model files look like:
|
||
|
||
```bash
|
||
.
|
||
|-- chat_template.jinja
|
||
|-- config.json
|
||
|-- configuration_deepseek.py
|
||
|-- configuration.json
|
||
|-- generation_config.json
|
||
|-- model-00001-of-000062.safetensors
|
||
|-- ...
|
||
|-- model-00062-of-000062.safetensors
|
||
|-- model.safetensors.index.json
|
||
|-- modeling_deepseek.py
|
||
|-- tiktoken.model
|
||
|-- tokenization_kimi.py
|
||
`-- tokenizer_config.json
|
||
```
|
||
|
||
## Online Inference on Multi-NPU
|
||
|
||
Run the following script to start the vLLM server on Multi-NPU:
|
||
|
||
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
export VLLM_USE_MODELSCOPE=True
|
||
export HCCL_BUFFSIZE=1024
|
||
export TASK_QUEUE_ENABLE=1
|
||
export OMP_PROC_BIND=false
|
||
export HCCL_OP_EXPANSION_MODE=AIV
|
||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||
|
||
vllm serve "moonshotai/Kimi-K2-Thinking" \
|
||
--tensor-parallel-size 16 \
|
||
--port 8000 \
|
||
--max-model-len 8192 \
|
||
--max-num-batched-tokens 8192 \
|
||
--max-num-seqs 12 \
|
||
--gpu-memory-utilization 0.9 \
|
||
--trust-remote-code \
|
||
--enable-expert-parallel \
|
||
--no-enable-prefix-caching
|
||
```
|
||
|
||
Once your server is started, you can query the model with input prompts.
|
||
|
||
```bash
|
||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||
"model": "kimi-k2-thinking",
|
||
"messages": [
|
||
{"role": "user", "content": "Who are you?"}
|
||
],
|
||
"temperature": 1.0
|
||
}'
|
||
```
|