What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
4.4 KiB
Kimi-K2-Thinking
Introduction
Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
This document will show the main verification steps of the model, including supported features, environment preparation, single-node deployment, and functional verification.
Supported Features
Refer to supported features to get the model's supported feature matrix.
Refer to feature guide to get the feature's configuration.
Environment Preparation
Model Weight
Kimi-K2-Thinking(bfloat16): require 1 Atlas 800 A3 (64G × 16) node. Download model weight.
It is recommended to download the model weight to the shared directory, such as /mnt/sfs_turbo/.cache/.
Installation
You can use our official docker image to run Kimi-K2-Thinking directly.
Select an image based on your machine type and start the docker image on your node, refer to using docker.
Run with Docker
:substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash
Verify the Quantized Model
Please be advised to edit the value of "quantization_config.config_groups.group_0.targets" from ["Linear"] into ["MoE"] in config.json of original model downloaded from Hugging Face.
{
"quantization_config": {
"config_groups": {
"group_0": {
"targets": [
"MoE"
]
}
}
}
}
Your model files look like:
.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
|-- tokenizer_config.json
Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
#!/bin/bash
export VLLM_USE_MODELSCOPE=True
export HCCL_BUFFSIZE=1024
export TASK_QUEUE_ENABLE=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE=AIV
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
vllm serve "moonshotai/Kimi-K2-Thinking" \
--tensor-parallel-size 16 \
--port 8000 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--max-num-seqs 12 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enable-expert-parallel \
--no-enable-prefix-caching
Once your server is started, you can query the model with input prompts.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "kimi-k2-thinking",
"messages": [
{"role": "user", "content": "Who are you?"}
],
"temperature": 1.0
}'