[Feat.]: support 310p w8a8 (#6454)

### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: dc917cceb8 --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
2026-02-03 14:13:06 +08:00
parent 79803932e2
commit 39e77fb9e4
9 changed files with 392 additions and 22 deletions
--- a/tests/e2e/310p/test_offline_inference_w8a8_310p.py
+++ b/tests/e2e/310p/test_offline_inference_w8a8_310p.py
@@ -0,0 +1,22 @@
+import pytest
+
+from tests.e2e.conftest import VllmRunner
+
+
+@pytest.mark.parametrize("dtype", ["float16"])
+@pytest.mark.parametrize("max_tokens", [5])
+def test_qwen3_w8a8_e2e_310p(dtype: str, max_tokens: int) -> None:
+    example_prompts = [
+        "vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.",
+    ]
+
+    with VllmRunner(
+        "vllm-ascend/Qwen3-32B-W8A8",
+        tensor_parallel_size=4,
+        dtype=dtype,
+        max_model_len=8192,
+        enforce_eager=True,
+        quantization="ascend",
+        enable_prefix_caching=False,
+    ) as vllm_model:
+        vllm_model.generate_greedy(example_prompts, max_tokens)