### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
23 lines
666 B
Python
23 lines
666 B
Python
import pytest
|
|
|
|
from tests.e2e.conftest import VllmRunner
|
|
|
|
|
|
@pytest.mark.parametrize("dtype", ["float16"])
|
|
@pytest.mark.parametrize("max_tokens", [5])
|
|
def test_qwen3_w8a8_e2e_310p(dtype: str, max_tokens: int) -> None:
|
|
example_prompts = [
|
|
"vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.",
|
|
]
|
|
|
|
with VllmRunner(
|
|
"vllm-ascend/Qwen3-32B-W8A8",
|
|
tensor_parallel_size=4,
|
|
dtype=dtype,
|
|
max_model_len=8192,
|
|
enforce_eager=True,
|
|
quantization="ascend",
|
|
enable_prefix_caching=False,
|
|
) as vllm_model:
|
|
vllm_model.generate_greedy(example_prompts, max_tokens)
|