Kimi-K2-Thinking is a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features a hybrid thinking architecture that excels in complex reasoning and problem-solving tasks.
This document will show the main verification steps of the model, including supported features, environment preparation, single-node deployment, and functional verification.
## Supported Features
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
It is recommended to download the model weight to the shared directory, such as `/mnt/sfs_turbo/.cache/`.
### Installation
You can use our official docker image to run `Kimi-K2-Thinking` directly.
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
Please be advised to edit the value of `"quantization_config.config_groups.group_0.targets"` from `["Linear"]` into `["MoE"]` in `config.json` of original model downloaded from [Hugging Face](https://huggingface.co/moonshotai/Kimi-K2-Thinking).
```json
{
"quantization_config": {
"config_groups": {
"group_0": {
"targets": [
"MoE"
]
}
}
}
}
```
Your model files look like:
```bash
.
|-- chat_template.jinja
|-- config.json
|-- configuration_deepseek.py
|-- configuration.json
|-- generation_config.json
|-- model-00001-of-000062.safetensors
|-- ...
|-- model-00062-of-000062.safetensors
|-- model.safetensors.index.json
|-- modeling_deepseek.py
|-- tiktoken.model
|-- tokenization_kimi.py
`-- tokenizer_config.json
```
## Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas 800 A3 (64G*16) node, tensor-parallel-size should be at least 16.