backport of #7474
This PR adds C8 (INT8) KV cache quantization support for standard GQA
attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel
quantization scales to store KV cache in INT8, reducing KV cache memory
by ~50% compared to BF16, enabling higher batch concurrency and longer
context lengths on the same hardware.
**Key changes:**
1. **`attention_v1.py`** — New `AscendC8AttentionBackendImpl` subclass
of `AscendAttentionBackendImpl`:
- `_prepare_c8_scales`: Shards per-channel scales/offsets to the current
TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time
per layer).
- `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before
`reshape_and_cache`, using pre-cached inverse scales.
- `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV
and `perchannel` antiquant mode.
- `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8)
and prefill (FIA V1 TND float) into two kernel calls.
- `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and
`PrefillCacheHit` states.
2. **`quantization/methods/kv_c8.py`** — New
`AscendC8KVCacheAttentionMethod` scheme:
- Creates `k/v_cache_scale/offset` parameters via
`_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and
lazy resizing.
- Sets `layer.kv_cache_torch_dtype = torch.int8` so
`get_kv_cache_spec()` returns INT8 dtype automatically.
- Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class
surgery.
3. **`quantization/modelslim_config.py`** — C8 branch in
`get_quant_method()` activates when `kv_cache_type == "C8"` in
`quant_model_description.json`.
4. **`patch/worker/patch_qwen3_c8.py`** — Intercepts per-channel C8
scale/offset weights before `AutoWeightsLoader` discards them, routing
them to the parameters created by `AscendC8KVCacheAttentionMethod`.
5. **`tests/ut/quantization/test_kv_c8.py`** — Unit tests covering
`_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and
`AscendC8AttentionBackendImpl` scale helpers.
Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV
cache on Ascend NPU. The model checkpoint must contain a
`quant_model_description.json` with `"kv_cache_type": "C8"` and
per-channel scale/offset tensors in safetensors.
No changes to the serving CLI — the feature activates automatically when
the quantization config is detected.
Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`,
`max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench`
(input_len=10240, output_len=2048, 960 prompts, max_concurrency=192):
```
============ Serving Benchmark Result ============
Successful requests: 960
Failed requests: 0
Maximum request concurrency: 192
Benchmark duration (s): 1359.81
Total input tokens: 9830400
Total generated tokens: 1966080
Request throughput (req/s): 0.71
Output token throughput (tok/s): 1445.85
Peak output token throughput (tok/s): 2304.00
Total token throughput (tok/s): 8675.12
---------------Time to First Token----------------
Mean TTFT (ms): 24598.51
Median TTFT (ms): 23167.02
P50 TTFT (ms): 23167.02
P90 TTFT (ms): 47717.08
P99 TTFT (ms): 84402.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 120.76
Median TPOT (ms): 121.50
P50 TPOT (ms): 121.50
P90 TPOT (ms): 127.05
P99 TPOT (ms): 130.13
---------------Inter-token Latency----------------
Mean ITL (ms): 120.70
Median ITL (ms): 90.34
P50 ITL (ms): 90.34
P90 ITL (ms): 93.79
P99 ITL (ms): 101.80
==================================================
```
All attention states verified: `PrefillNoCache`, `PrefillCacheHit`,
`ChunkedPrefill`, `DecodeOnly`.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Reapply the auto-detect quantization format feature (originally in
#6645, reverted in #6873) and extend it to support remote model
identifiers (e.g., `org/model-name`).
Changes:
- Reapply auto-detection of quantization method from model files
(`quant_model_description.json` for ModelSlim, `config.json` for
compressed-tensors)
- Add `get_model_file()` utility to handle file retrieval from both
local paths and remote repos (HuggingFace Hub / ModelScope)
- Update `detect_quantization_method()` to accept remote repo IDs with
optional `revision` parameter
- Update `maybe_update_config()` to work with remote model identifiers
- Add platform-level `auto_detect_quantization` support
- Add unit tests and e2e tests for both local and remote model ID
scenarios
Closes#6836
### Does this PR introduce _any_ user-facing change?
Yes. When `--quantization` is not explicitly specified, vllm-ascend will
now automatically detect the quantization format from the model files
for both local directories and remote model IDs.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
## Summary
- Add automatic quantization format detection, eliminating the need to
manually specify `--quantization` when serving quantized models.
- The detection inspects only lightweight JSON files
(`quant_model_description.json` and `config.json`) at engine
initialization time, with no `.safetensors` reads.
- User-explicit `--quantization` flags are always respected;
auto-detection only applies when the flag is omitted.
## Details
**Detection priority:**
1. `quant_model_description.json` exists → `quantization="ascend"`
(ModelSlim)
2. `config.json` contains `"quant_method": "compressed-tensors"` →
`quantization="compressed-tensors"` (LLM-Compressor)
3. Neither → default float behavior
**Technical approach:**
Hooked into `NPUPlatform.check_and_update_config()` to run detection
after `VllmConfig.__post_init__`. Since `quant_config` is already `None`
at that point, we explicitly recreate it via
`VllmConfig._get_quantization_config()` to trigger the full quantization
initialization pipeline.
## Files Changed
| File | Description |
|------|-------------|
| `vllm_ascend/quantization/utils.py` | Added
`detect_quantization_method()` and `maybe_auto_detect_quantization()` |
| `vllm_ascend/platform.py` | Integrated auto-detection in
`check_and_update_config()` |
| `vllm_ascend/quantization/modelslim_config.py` | Improved error
handling for weight loading |
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>