xc-llm-ascend

Files

TmacAaron 5018f2d8fd [quantization] Add w8a16 quantization support (#4541 )

### What this PR does / why we need it?
related to https://github.com/vllm-project/vllm-ascend/issues/4267

### Does this PR introduce _any_ user-facing change?
support w8a16 quantization now

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

### Test
tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2
#### Precision
  | ceval | mmlu | gsm8k
-- | -- | -- | --
bf16 | 90.46 | 89.17 | 96.21
w8a16 | 89.51 | 89.29 | 95.98

#### Performance
  | input_len | output_len | concurrency | TTFT (ms) | TPOT (ms) | TPS
(Total) (tokens/s)
-- | -- | -- | -- | -- | -- | --
bf16 | 2048 | 2048 | 10 | 1911.7136 | 77.988 | 253.9866
w8a16 | 2048 | 2048 | 10 | 2128.6334 | 67.1633 | 293.9117
bf16 | 3500 | 1024 | 10 | 3076.2509 | 84.3525 | 506.949
w8a16 | 3500 | 1024 | 10 | 2685.2031 | 73.015 | 585.4717

---------

Signed-off-by: yyt <yangyit139@gmail.com>
Signed-off-by: TmacAaron <yangyit139@gmail.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>

2025-12-24 19:49:32 +08:00

compressed_tensors

[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516 )

2025-12-10 15:58:52 +08:00

__init__.py

[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 )

2025-02-21 17:07:37 +08:00

quant_config.py

[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 )

2025-12-19 09:00:07 +08:00

utils.py

[quantization] Add w8a16 quantization support (#4541 )