Files
xc-llm-ascend/docs/source/tutorials/index.md
Ruri ce5872705e [Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516)
### What this PR does / why we need it?

Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.

- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>
2025-12-10 15:58:52 +08:00

31 lines
563 B
Markdown

# Tutorials
:::{toctree}
:caption: Deployment
:maxdepth: 1
single_npu
single_npu_qwen2.5_vl
single_npu_qwen2_audio
single_npu_qwen3_embedding
single_npu_qwen3_quantization
single_npu_qwen3_w4a4
single_node_pd_disaggregation_mooncake
multi_npu_qwen3_next
multi_npu
multi_npu_kimi-k2-thinking
multi_npu_moge
multi_npu_qwen3_moe
multi_npu_quantization
single_node_300i
DeepSeek-V3.1.md
DeepSeek-V3.2-Exp.md
Qwen3-235B-A22B.md
Qwen3-Coder-30B-A3B
multi_node
multi_node_kimi
multi_node_qwen3vl
multi_node_pd_disaggregation_mooncake
multi_node_ray
Qwen2.5-Omni.md
:::