xc-llm-ascend

Files

Ruri ce5872705e [Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516 )

### What this PR does / why we need it?

Adds W4A16 quantization method for the Kimi-K2-Thinking model and
updates relevant modules to support the new quantization method.

- Implements complete W4A16 quantization method including weight
packing/unpacking, per-group quantization parameter generation,
post-processing logic and MoE method application.
- Adds parameters `use_int4_w4a16`, `w1_offset` and `w2_offset`, adjusts
`with_quant` conditional logic to support W4A16 matrix multiplication.
- Adds `packed_modules_model_mapping` for Kimi-K2-Thinking model and
processing logic for `weight_packed` field.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: Ruri <zhouxiang100@huawei.com>

2025-12-10 15:58:52 +08:00

test_quant_config.py

[bugfix] fix quant method validation bug (#4831 )

2025-12-09 23:42:01 +08:00

test_utils.py

[1/N][Refactor][Quantization] remove redundant quantizer class (#2680 )

2025-09-04 11:35:14 +08:00

test_w4a4_flatquant_dynamic.py

[Refactor] Clean up w4a4_flatquant_dynamic implementation (#3440 )

2025-10-17 23:53:19 +08:00

test_w4a8_dynamic.py

[Feat][BugFix]Support the Qwen3-Next-80B-A3B-Instruct quantization model&Fix the NZ issue (#4245 )