Files
xc-llm-ascend/tests/ut/quantization/test_modelslim_config.py
Mengqing Cao 044d4c3974 [v0.18.0]feat(quant): add C8 INT8 KV cache support for GQA attention models (#7474) (#8007)
backport of #7474

This PR adds C8 (INT8) KV cache quantization support for standard GQA
attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel
quantization scales to store KV cache in INT8, reducing KV cache memory
by ~50% compared to BF16, enabling higher batch concurrency and longer
context lengths on the same hardware.

**Key changes:**

1. **`attention_v1.py`** — New `AscendC8AttentionBackendImpl` subclass
of `AscendAttentionBackendImpl`:
- `_prepare_c8_scales`: Shards per-channel scales/offsets to the current
TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time
per layer).
- `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before
`reshape_and_cache`, using pre-cached inverse scales.
- `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV
and `perchannel` antiquant mode.
- `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8)
and prefill (FIA V1 TND float) into two kernel calls.
- `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and
`PrefillCacheHit` states.

2. **`quantization/methods/kv_c8.py`** — New
`AscendC8KVCacheAttentionMethod` scheme:
- Creates `k/v_cache_scale/offset` parameters via
`_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and
lazy resizing.
- Sets `layer.kv_cache_torch_dtype = torch.int8` so
`get_kv_cache_spec()` returns INT8 dtype automatically.
- Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class
surgery.

3. **`quantization/modelslim_config.py`** — C8 branch in
`get_quant_method()` activates when `kv_cache_type == "C8"` in
`quant_model_description.json`.

4. **`patch/worker/patch_qwen3_c8.py`** — Intercepts per-channel C8
scale/offset weights before `AutoWeightsLoader` discards them, routing
them to the parameters created by `AscendC8KVCacheAttentionMethod`.

5. **`tests/ut/quantization/test_kv_c8.py`** — Unit tests covering
`_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and
`AscendC8AttentionBackendImpl` scale helpers.

Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV
cache on Ascend NPU. The model checkpoint must contain a
`quant_model_description.json` with `"kv_cache_type": "C8"` and
per-channel scale/offset tensors in safetensors.

No changes to the serving CLI — the feature activates automatically when
the quantization config is detected.

Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`,
`max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench`
(input_len=10240, output_len=2048, 960 prompts, max_concurrency=192):

```
============ Serving Benchmark Result ============
Successful requests:                     960
Failed requests:                         0
Maximum request concurrency:             192
Benchmark duration (s):                  1359.81
Total input tokens:                      9830400
Total generated tokens:                  1966080
Request throughput (req/s):              0.71
Output token throughput (tok/s):         1445.85
Peak output token throughput (tok/s):    2304.00
Total token throughput (tok/s):          8675.12
---------------Time to First Token----------------
Mean TTFT (ms):                          24598.51
Median TTFT (ms):                        23167.02
P50 TTFT (ms):                           23167.02
P90 TTFT (ms):                           47717.08
P99 TTFT (ms):                           84402.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.76
Median TPOT (ms):                        121.50
P50 TPOT (ms):                           121.50
P90 TPOT (ms):                           127.05
P99 TPOT (ms):                           130.13
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.70
Median ITL (ms):                         90.34
P50 ITL (ms):                            90.34
P90 ITL (ms):                            93.79
P99 ITL (ms):                            101.80
==================================================
```

All attention states verified: `PrefillNoCache`, `PrefillCacheHit`,
`ChunkedPrefill`, `DecodeOnly`.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>
2026-04-08 10:51:58 +08:00

270 lines
12 KiB
Python

import json
import os
import tempfile
from unittest.mock import MagicMock, patch
from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.layers.fused_moe.config import FusedMoEConfig
from vllm.model_executor.layers.linear import LinearBase
from tests.ut.base import TestBase
from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
from vllm_ascend.quantization.modelslim_config import (
MODELSLIM_CONFIG_FILENAME,
AscendModelSlimConfig,
)
from vllm_ascend.utils import ASCEND_QUANTIZATION_METHOD
from vllm.model_executor.layers.attention import Attention
from vllm.model_executor.layers.attention_layer_base import AttentionLayerBase
class TestAscendModelSlimConfig(TestBase):
def setUp(self):
self.sample_config = {
"weight": "INT8",
"fa_quant_type": "C8",
"layers.1.fa_k.scale": "C8",
"layer1.weight": "INT8",
"layer2.weight": "FLOAT",
"fused_layer.weight": "FLOAT",
"fused_layer.shard1.weight": "FLOAT",
"fused_layer.shard2.weight": "FLOAT",
"shard1.weight": "FLOAT",
"shard2.weight": "FLOAT",
}
self.ascend_config = AscendModelSlimConfig(self.sample_config)
self.ascend_config.packed_modules_mapping = None
def test_init(self):
self.assertEqual(self.ascend_config.quant_description,
self.sample_config)
def test_repr(self):
repr_str = repr(self.ascend_config)
self.assertTrue(repr_str.startswith("AscendModelSlimConfig:\n"))
def test_get_name(self):
self.assertEqual(AscendModelSlimConfig.get_name(),
ASCEND_QUANTIZATION_METHOD)
def test_get_supported_act_dtypes(self):
supported_dtypes = AscendModelSlimConfig.get_supported_act_dtypes()
self.assertEqual(len(supported_dtypes), 3)
def test_get_min_capability(self):
with self.assertRaises(NotImplementedError):
AscendModelSlimConfig.get_min_capability()
def test_get_config_filenames(self):
filenames = AscendModelSlimConfig.get_config_filenames()
self.assertEqual(filenames, [])
def test_from_config(self):
config = AscendModelSlimConfig.from_config(self.sample_config)
self.assertIsInstance(config, AscendModelSlimConfig)
self.assertEqual(config.quant_description, self.sample_config)
@patch('torch.npu.is_available')
def test_override_quantization_method(self, mock_is_available):
# Test when NPU is available
mock_is_available.return_value = True
result = AscendModelSlimConfig.override_quantization_method(None, None)
self.assertIsNone(result)
hf_quant_cfg = {"quant_method": ""}
result = AscendModelSlimConfig.override_quantization_method(
hf_quant_cfg, None)
self.assertEqual(result, "ascend")
# Test when NPU is not available
mock_is_available.return_value = False
result = AscendModelSlimConfig.override_quantization_method(None, None)
self.assertIsNone(result)
hf_quant_cfg = {"quant_method": ""}
result = AscendModelSlimConfig.override_quantization_method(
hf_quant_cfg, None)
self.assertIsNone(result)
def test_get_quant_method_for_linear(self):
mock_config = MagicMock()
mock_config.model_config.hf_config.model_type = None
linear_layer = MagicMock(spec=LinearBase)
# Test skipped layer
with patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
patch.object(self.ascend_config, \
'is_layer_skipped_ascend',
return_value=True):
method = self.ascend_config.get_quant_method(linear_layer, ".attn")
self.assertIsInstance(method, AscendUnquantizedLinearMethod)
# Test quantized layer
mock_scheme = MagicMock()
with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=False), \
patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
patch('vllm_ascend.quantization.method_adapters.AscendLinearMethod', return_value=MagicMock()) as mock_ascend_linear:
method = self.ascend_config.get_quant_method(linear_layer, ".attn")
self.assertIs(method, mock_ascend_linear.return_value)
mock_ascend_linear.assert_called_once_with(mock_scheme)
def test_get_quant_method_for_attention(self):
attention_layer = MagicMock(spec=Attention)
mock_config = MagicMock()
mock_config.model_config.hf_config.model_type = None
mock_scheme = MagicMock()
with patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
patch('vllm_ascend.quantization.method_adapters.AscendKVCacheMethod', \
return_value=MagicMock()) as mock_ascend_kvcache:
# Test with fa_quant_type
method = self.ascend_config.get_quant_method(
attention_layer, ".attn")
self.assertIs(method, None)
method = self.ascend_config.get_quant_method(
attention_layer, "layers.1.attn")
self.assertIs(method, mock_ascend_kvcache.return_value)
def test_get_quant_method_for_c8_kv_cache_attention(self):
c8_config = AscendModelSlimConfig({"kv_cache_type": "C8"})
attention_layer = MagicMock(spec=AttentionLayerBase)
mock_vllm_config = MagicMock()
mock_vllm_config.model_config.hf_config.model_type = None
with patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_vllm_config), \
patch("vllm_ascend.quantization.method_adapters.AscendKVCacheMethod", return_value=MagicMock()) as mock_kvcache:
method = c8_config.get_quant_method(attention_layer, "model.layers.0.self_attn.attn")
self.assertIs(method, mock_kvcache.return_value)
args, _ = mock_kvcache.call_args
from vllm_ascend.quantization.methods.kv_c8 import AscendC8KVCacheAttentionMethod
self.assertIsInstance(args[0], AscendC8KVCacheAttentionMethod)
def test_get_quant_method_for_fused_moe(self):
fused_moe_layer = MagicMock(spec=FusedMoE)
fused_moe_layer.moe = MagicMock(spec=FusedMoEConfig)
fused_moe_layer.moe_config = MagicMock(spec=FusedMoEConfig)
mock_config = MagicMock()
mock_config.model_config.hf_config.model_type = None
# Test skipped layer
with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=True), \
patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
patch('vllm_ascend.ops.fused_moe.fused_moe.AscendUnquantizedFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
method = self.ascend_config.get_quant_method(
fused_moe_layer, "moe_layer")
self.assertIs(method, mock_ascend_moe.return_value)
# Test quantized layer
mock_scheme = MagicMock()
with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=False), \
patch("vllm_ascend.quantization.modelslim_config.get_current_vllm_config", return_value=mock_config), \
patch("vllm_ascend.quantization.modelslim_config.create_scheme_for_layer", return_value=mock_scheme), \
patch('vllm_ascend.quantization.method_adapters.AscendFusedMoEMethod', return_value=MagicMock()) as mock_ascend_moe:
method = self.ascend_config.get_quant_method(
fused_moe_layer, "moe_layer")
self.assertIs(method, mock_ascend_moe.return_value)
def test_is_layer_skipped_ascend(self):
# Test non-fused layer that should be quantized
self.assertFalse(self.ascend_config.is_layer_skipped_ascend("layer1"))
# Test non-fused layer that should be skipped
self.assertTrue(self.ascend_config.is_layer_skipped_ascend("layer2"))
# Test fused layer
fused_mapping = {"fused_layer": ["shard1", "shard2"]}
self.assertTrue(
self.ascend_config.is_layer_skipped_ascend("fused_layer",
fused_mapping))
# Test inconsistent fused layer shards
bad_config = {"shard1.weight": "FLOAT", "shard2.weight": "INT8"}
config = AscendModelSlimConfig(bad_config)
with self.assertRaises(ValueError):
config.is_layer_skipped_ascend("fused_layer", fused_mapping)
def test_init_with_none_config(self):
config = AscendModelSlimConfig(None)
self.assertEqual(config.quant_description, {})
def test_init_with_default_config(self):
config = AscendModelSlimConfig()
self.assertEqual(config.quant_description, {})
def test_maybe_update_config_already_populated(self):
# When quant_description is already populated, should be a no-op
self.assertTrue(len(self.ascend_config.quant_description) > 0)
self.ascend_config.maybe_update_config("/some/model/path")
# quant_description should remain unchanged
self.assertEqual(self.ascend_config.quant_description,
self.sample_config)
def test_maybe_update_config_loads_from_file(self):
config = AscendModelSlimConfig()
self.assertEqual(config.quant_description, {})
quant_data = {"layer1.weight": "INT8", "layer2.weight": "FLOAT"}
with tempfile.TemporaryDirectory() as tmpdir:
config_path = os.path.join(tmpdir, MODELSLIM_CONFIG_FILENAME)
with open(config_path, "w") as f:
json.dump(quant_data, f)
config.maybe_update_config(tmpdir)
self.assertEqual(config.quant_description, quant_data)
def test_maybe_update_config_raises_when_file_missing(self):
config = AscendModelSlimConfig()
with tempfile.TemporaryDirectory() as tmpdir:
with self.assertRaises(ValueError) as ctx:
config.maybe_update_config(tmpdir)
error_msg = str(ctx.exception)
self.assertIn("ModelSlim Quantization Config Not Found", error_msg)
self.assertIn(MODELSLIM_CONFIG_FILENAME, error_msg)
def test_maybe_update_config_raises_with_json_files_listed(self):
config = AscendModelSlimConfig()
with tempfile.TemporaryDirectory() as tmpdir:
# Create a dummy json file that is NOT the config file
dummy_path = os.path.join(tmpdir, "config.json")
with open(dummy_path, "w") as f:
json.dump({"dummy": True}, f)
with self.assertRaises(ValueError) as ctx:
config.maybe_update_config(tmpdir)
error_msg = str(ctx.exception)
self.assertIn("config.json", error_msg)
def test_maybe_update_config_non_directory_raises(self):
config = AscendModelSlimConfig()
with self.assertRaises(ValueError) as ctx:
config.maybe_update_config("not_a_real_directory_path")
error_msg = str(ctx.exception)
self.assertIn("ModelSlim Quantization Config Not Found", error_msg)
def test_apply_extra_quant_adaptations_shared_head(self):
config = AscendModelSlimConfig()
config.quant_description = {
"model.layers.0.shared_head.weight": "INT8",
}
config._apply_extra_quant_adaptations()
self.assertIn("model.layers.0.weight", config.quant_description)
self.assertEqual(config.quant_description["model.layers.0.weight"],
"INT8")
def test_apply_extra_quant_adaptations_weight_packed(self):
config = AscendModelSlimConfig()
config.quant_description = {
"model.layers.0.weight_packed": "INT8",
}
config._apply_extra_quant_adaptations()
self.assertIn("model.layers.0.weight", config.quant_description)
self.assertEqual(config.quant_description["model.layers.0.weight"],
"INT8")