fixed fia pad logic in graph mode. (#7144)

### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-12 14:50:54 +08:00
parent bbffe58b63
commit 37d1bd8c50
2 changed files with 20 additions and 6 deletions
--- a/tests/e2e/multicard/2-cards/test_full_graph_mode.py
+++ b/tests/e2e/multicard/2-cards/test_full_graph_mode.py
@@ -18,7 +18,6 @@
 #
 import os

-import pytest
 from vllm import SamplingParams

 from tests.e2e.conftest import VllmRunner
@@ -68,7 +67,6 @@ def test_qwen3_moe_full_decode_only_tp2():
    )


-@pytest.mark.skip(reason="CANN8.5 failed with this test, fix me")
 def test_qwen3_moe_full_graph_tp2():
    if "HCCL_OP_EXPANSION_MODE" in os.environ:
        del os.environ["HCCL_OP_EXPANSION_MODE"]