[main] Use AddRmsNormQuant ops in the custom model to optimize Qwen3's performance (#1806)

### What this PR does / why we need it? Optimizes the performance of the Qwen3 quantization model by registering a custom model and adding the AddRmsNormQuant operation. Subsequent PRs will focus on performance optimizations based on this custom model. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with existing test. - vLLM version: v0.9.2 - vLLM main: 8d0a01a5f2 Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-07-22 19:03:13 +08:00
parent ce4970eee0
commit 9a3bdf2162
5 changed files with 227 additions and 8 deletions
--- a/tests/e2e/multicard/test_offline_inference_distributed.py
+++ b/tests/e2e/multicard/test_offline_inference_distributed.py
@@ -167,3 +167,20 @@ def test_models_distributed_topk() -> None:
            distributed_executor_backend="mp",
    ) as vllm_model:
        vllm_model.generate(example_prompts, sampling_params)
+
+
+def test_models_distributed_Qwen3_W8A8():
+    example_prompts = [
+        "Hello, my name is",
+    ]
+    max_tokens = 5
+
+    with VllmRunner(
+            snapshot_download("vllm-ascend/Qwen3-8B-W8A8"),
+            max_model_len=8192,
+            enforce_eager=True,
+            dtype="auto",
+            tensor_parallel_size=4,
+            quantization="ascend",
+    ) as vllm_model:
+        vllm_model.generate_greedy(example_prompts, max_tokens)