xc-llm-ascend/tests/e2e/310p/test_offline_inference_w8a8_310p.py

import pytest

from tests.e2e.conftest import VllmRunner


@pytest.mark.parametrize("dtype", ["float16"])
@pytest.mark.parametrize("max_tokens", [5])
def test_qwen3_w8a8_e2e_310p(dtype: str, max_tokens: int) -> None:
    example_prompts = [
        "vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.",
    ]

    with VllmRunner(
        "vllm-ascend/Qwen3-32B-W8A8",
        tensor_parallel_size=4,
        dtype=dtype,
        max_model_len=8192,
        enforce_eager=True,
        quantization="ascend",
        enable_prefix_caching=False,
    ) as vllm_model:
        vllm_model.generate_greedy(example_prompts, max_tokens)
[Feat.]: support 310p w8a8 (#6454) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com> 2026-02-03 14:13:06 +08:00			`import pytest`

			`from tests.e2e.conftest import VllmRunner`


			`@pytest.mark.parametrize("dtype", ["float16"])`
			`@pytest.mark.parametrize("max_tokens", [5])`
			`def test_qwen3_w8a8_e2e_310p(dtype: str, max_tokens: int) -> None:`
			`example_prompts = [`
			`"vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.",`
			`]`

			`with VllmRunner(`
			`"vllm-ascend/Qwen3-32B-W8A8",`
			`tensor_parallel_size=4,`
			`dtype=dtype,`
			`max_model_len=8192,`
			`enforce_eager=True,`
			`quantization="ascend",`
			`enable_prefix_caching=False,`
			`) as vllm_model:`
			`vllm_model.generate_greedy(example_prompts, max_tokens)`