What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
147 lines
4.4 KiB
Markdown
147 lines
4.4 KiB
Markdown
# Kimi-K2-Thinking
|
||
|
||
## Introduction
|
||
|
||
Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
|
||
|
||
This document will show the main verification steps of the model, including supported features, environment preparation, single-node deployment, and functional verification.
|
||
|
||
## Supported Features
|
||
|
||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||
|
||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||
|
||
## Environment Preparation
|
||
|
||
### Model Weight
|
||
|
||
- `Kimi-K2-Thinking`(bfloat16): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||
|
||
It is recommended to download the model weight to the shared directory, such as `/mnt/sfs_turbo/.cache/`.
|
||
|
||
### Installation
|
||
|
||
You can use our official docker image to run `Kimi-K2-Thinking` directly.
|
||
|
||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||
|
||
## Run with Docker
|
||
|
||
```{code-block} bash
|
||
:substitutions:
|
||
# Update the vllm-ascend image
|
||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||
export NAME=vllm-ascend
|
||
|
||
# Run the container using the defined variables
|
||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||
docker run --rm \
|
||
--name $NAME \
|
||
--net=host \
|
||
--shm-size=1g \
|
||
--device /dev/davinci0 \
|
||
--device /dev/davinci1 \
|
||
--device /dev/davinci2 \
|
||
--device /dev/davinci3 \
|
||
--device /dev/davinci4 \
|
||
--device /dev/davinci5 \
|
||
--device /dev/davinci6 \
|
||
--device /dev/davinci7 \
|
||
--device /dev/davinci8 \
|
||
--device /dev/davinci9 \
|
||
--device /dev/davinci10 \
|
||
--device /dev/davinci11 \
|
||
--device /dev/davinci12 \
|
||
--device /dev/davinci13 \
|
||
--device /dev/davinci14 \
|
||
--device /dev/davinci15 \
|
||
--device /dev/davinci_manager \
|
||
--device /dev/devmm_svm \
|
||
--device /dev/hisi_hdc \
|
||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||
-v /mnt/sfs_turbo/.cache:/home/cache \
|
||
-it $IMAGE bash
|
||
```
|
||
|
||
## Verify the Quantized Model
|
||
|
||
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
|
||
|
||
```json
|
||
{
|
||
"quantization_config": {
|
||
"config_groups": {
|
||
"group_0": {
|
||
"targets": [
|
||
"MoE"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Your model files look like:
|
||
|
||
```bash
|
||
.
|
||
|-- chat_template.jinja
|
||
|-- config.json
|
||
|-- configuration_deepseek.py
|
||
|-- configuration.json
|
||
|-- generation_config.json
|
||
|-- model-00001-of-000062.safetensors
|
||
|-- ...
|
||
|-- model-00062-of-000062.safetensors
|
||
|-- model.safetensors.index.json
|
||
|-- modeling_deepseek.py
|
||
|-- tiktoken.model
|
||
|-- tokenization_kimi.py
|
||
|-- tokenizer_config.json
|
||
```
|
||
|
||
## Online Inference on Multi-NPU
|
||
|
||
Run the following script to start the vLLM server on Multi-NPU:
|
||
|
||
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
export VLLM_USE_MODELSCOPE=True
|
||
export HCCL_BUFFSIZE=1024
|
||
export TASK_QUEUE_ENABLE=1
|
||
export OMP_PROC_BIND=false
|
||
export HCCL_OP_EXPANSION_MODE=AIV
|
||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||
|
||
vllm serve "moonshotai/Kimi-K2-Thinking" \
|
||
--tensor-parallel-size 16 \
|
||
--port 8000 \
|
||
--max-model-len 8192 \
|
||
--max-num-batched-tokens 8192 \
|
||
--max-num-seqs 12 \
|
||
--gpu-memory-utilization 0.9 \
|
||
--trust-remote-code \
|
||
--enable-expert-parallel \
|
||
--no-enable-prefix-caching
|
||
```
|
||
|
||
Once your server is started, you can query the model with input prompts.
|
||
|
||
```bash
|
||
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
||
"model": "kimi-k2-thinking",
|
||
"messages": [
|
||
{"role": "user", "content": "Who are you?"}
|
||
],
|
||
"temperature": 1.0
|
||
}'
|
||
```
|