Commit Graph

194 Commits

Author SHA1 Message Date
CodeCat
b2857de43f [ST]Add e2e test for Npugraphex_pass (#6388)
### What this PR does / why we need it?
We found the custom passes of NPUGraphEX have implemented fusion
operator features, which still require E2E test case validation and
guard. This PR implements E2E test cases for the AddRMSNormQuant and
SplitQKVNormRope operator fusions under NPUGraphEX that are already in
the codebase.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-30 09:14:07 +08:00
wangxiyuan
f8e76a49fa [CI] Upgrade trasnformers version (#6307)
Upgrade transformers to >=4.56.4

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-28 14:06:39 +08:00
huangning1995
ce11fd49f3 [Feature] Batch invariant torch.compile (#6107)
### What this PR does / why we need it?
Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to
enable batch-invariant in vllm-ascend, we observed that the performance
of BI in eager mode remains suboptimal.

This PR further integrates batch-invariant with torch.compile, which
improves inference performance by 350% when tested with Qwen3-0.6B.

### Does this PR introduce _any_ user-facing change?
Previously, enabling both aclgraph and Batch-Invariant would cause an
"ub overflow" error. This occurred because transposed input tensors
could produce incorrect stride() values.

To fix this, we now call .contiguous() on the input tensors before
passing them to Triton kernels. This ensures a contiguous memory layout
and prevents transposed tensors from causing incorrect stride
calculations.

### Test Plan
pytest -sv --durations=0
tests/e2e/singlecard/test_aclgraph_batch_invariant.py

### Test Result
```
============================================================================ slowest durations ============================================================================
87.37s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle
77.39s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN
74.04s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail
73.59s call     tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation

(8 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================
```
### Performance
export VLLM_BATCH_INVARIANT=1
vllm serve /home/Qwen3-0.6B \
--served-model-name qwen \
--port 8000 \
--max-num-seqs 256 \
--tensor-parallel-size 1 \
--max-model-len 5500 \
--max-num-batched-tokens 5500 \
--reasoning-parser qwen3 \
--gpu-memory-utilization 0.9 \
--compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY",
"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
--additional-config
'{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}'

vllm bench serve --served-model-name qwen --trust-remote-code --backend
vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name
random --random-input-len 512 --random-output-len 256 --num-prompts 800
--max-concurrency 8

torch.compile batch invariant performance:
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  477.21    
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              1.68      
Output token throughput (tok/s):         429.16    
Peak output token throughput (tok/s):    472.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1287.48   
---------------Time to First Token----------------
Mean TTFT (ms):                          285.53    
Median TTFT (ms):                        312.70    
P99 TTFT (ms):                           324.22    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.59     
Median TPOT (ms):                        17.50     
P99 TPOT (ms):                           18.44     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.59     
Median ITL (ms):                         17.45     
P99 ITL (ms):                            18.76     
==================================================
```
Eager
```
============ Serving Benchmark Result ============
Successful requests:                     800       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  1694.70   
Total input tokens:                      409600    
Total generated tokens:                  204800    
Request throughput (req/s):              0.47      
Output token throughput (tok/s):         120.85    
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          362.54    
---------------Time to First Token----------------
Mean TTFT (ms):                          164.29    
Median TTFT (ms):                        129.71    
P99 TTFT (ms):                           1961.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.81     
Median TPOT (ms):                        65.15     
P99 TPOT (ms):                           72.27     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.81     
Median ITL (ms):                         64.64     
P99 ITL (ms):                            75.72     
==================================================
```

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: huangning1995 <huangning12@huawei.com>
2026-01-26 09:15:06 +08:00
Li Wang
63adbedb7a [Worker] Implement update max_model_len interface for NPUWorker (#6193)
### What this PR does / why we need it?
This patch purpose to add the `update_max_model_len` interface.

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:03:33 +08:00
Li Wang
ca297eb57f [CI] Migrate e2e test runner to hk (#5344)
### What this PR does / why we need it?
This patch add new runner labels for the HK region, and e2e single-card
testing has been migrated to this runner.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:00:51 +08:00
Angazenn
5b746f3e83 [Inductor]change pass to adapt to new addrmsnormBias operator (#6094)
### What this PR does / why we need it?
#5790 changes default addrmsnormBias operator if custom ops is enabled.
This PR modifies AddRmsNormQuant pass to align with addrmsnormBias.

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-24 20:16:44 +08:00
yjmyl
e90b14140b [feature] add_rms_norm support bias (#5790)
### What this PR does / why we need it?
This PR is to replace addRmsNorm and Add With addRmsNormBias. This way
can lead to a more effecient result.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Full Test Pass

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com>
Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>
2026-01-23 21:09:54 +08:00
wjunLu
a3079cd253 [Tests] Skip unstable eagle cases to keep CI success (#6180)
### What this PR does / why we need it?
The test case
`tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance`
fails occasionally, such result seems not stable with method `eagle`,
for example:

[tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance](https://github.com/vllm-project/vllm-ascend/actions/runs/21249578476/job/61147453980?pr=6151)

This PR skips the `eagle` tests to keep CI success

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-23 15:33:53 +08:00
wjunLu
f4a361fcc3 [CI] Re-open skipped cases due to PTA upgrading and update the golden results (#6144)
### What this PR does / why we need it?
Re-open `tests/e2e/singlecard/test_aclgraph_accuracy.py` and update its
golden results to match PTA 2.9.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-23 10:46:31 +08:00
wjunLu
a7d781f135 [Main] Upgrade PTA to 2.9.0 (#6112)
### What this PR does / why we need it?
Upgrade PTA to 2.9.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-22 17:59:06 +08:00
zhaomingyu13
34fb628248 [BugFix] Support setting tp=1 for the Eagle draft model to take effect (#6097)
According to the official documentation, the parameter
"draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3
model. However, based on actual debugging, it was found that the number
of tensor parallelisms (tp) of the Eagle model is consistent with that
of the target model. The setting of tp for the draft model did not take
effect as expected.

**Note:** This feature has not been superimposed and tested with `sp`
and `dp`. It will be adapted later
No
```python
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.9,
            enforce_eager=True,
            speculative_config={
                "method": "eagle3",
                "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
                "draft_tensor_parallel_size": 1,
                "num_speculative_tokens": 3,
            },
        )
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Fixes vllm-project/vllm#31345

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarksblood@qq.com>
2026-01-22 11:36:23 +08:00
wangxiyuan
69740039b7 [CI] Upgrade CANN to 8.5.0 (#6070)
### What this PR does / why we need it?
1. Upgrade CANN to 8.5.0
2. move triton-ascend 3.2.0 to requirements

note: we skipped the two failed e2e test, see
https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail.
We'll fix it soon.


### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/5494

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-22 09:29:50 +08:00
Magnus
5b129cf0a1 [1/N][Feat] Xlite Qwen3 MoE Support (#5951)
### What this PR does / why we need it?
This patch adds support for the Qwen3-MoE model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.

Qwen3-MoE TODO List:
- [ ] Qwen3-235B-A22B support
- [ ] Qwen3-MoE weights NZ support
- [ ] Qwen3-MoE data parallel support

## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance
Comparison
- aclgraph: main(69b170b8b5)
- xlite-full: main + xlite-full
- xlite-decode-only: main + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph

| maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) |
OutputSpeed (token/s) |
| --- | --- | --- | --- | --- | --- | --- | --- |
|  |  | Avg | P99 | Avg | P99 |  |  |
| 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81
|
| 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 |
| 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70
|
| 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% |
| 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% |
|  |  |  |  |  |  |  |  |
| 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 |
589.89 |
| 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 |
| 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 |
711.60 |
| 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% |
| 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% |
|  |  |  |  |  |  |  |  |
| 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 |
928.54 |
| 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 |
| 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 |
1202.82 |
| 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% |
| 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% |
|  |  |  |  |  |  |  |  |
| 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 |
1115.08 |
| 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 |
| 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 |
1453.75 |
| 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% |
| 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% |
|  |  |  |  |  |  |  |  |
| 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 |
1272.37 |
| 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 |
| 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 |
1567.94 |
| 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% |
| 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% |
|  |  |  |  |  |  |  |  |
| 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 |
1580.64 |
| 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 |
| 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 |
1813.58 |
| 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% |
| 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% |
|  |  |  |  |  |  |  |  |
| 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 |
1952.26 |
| 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 |
| 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 |
2145.11 |
| 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30%
|
| 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% |
|  |  |  |  |  |  |  |  |
| 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 |
2189.72 |
| 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 |
| 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 |
2271.52 |
| 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% |
| 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% |
|  |  |  |  |  |  |  |  |

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: LVYANGGUO <275926687@qq.com>
Co-authored-by: lulina <lina.lulina@huawei.com>
2026-01-21 09:26:03 +08:00
ChenCangtao
6c30f8bf87 [Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775)
### What this PR does / why we need it?
This is a part of
https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762
1. refactor the npugraph_ex config,modified the default configuration of
the static kernel, new default value of static kernel is false
2. support online-infer with static kernel
3. fixed the issue where manually modifying FX graphs caused an abnormal
model return type, and removed the related redundant code.

### Does this PR introduce _any_ user-facing change?
yes,the new config of npugraph_ex is as follow:
```
additional_config={
            "npugraph_ex_config": {
                "enable": True,
                "enable_static_kernel": False
            }
        }
```
### How was this patch tested?
```
vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \
    --host 0.0.0.0 \
    --port 8004 \
    --data-parallel-size 4 \
    --tensor-parallel-size 4 \
    --quantization ascend \
    --seed 1024 \
    --served-model-name deepseek_v3 \
    --enable-expert-parallel \
    --max-num-seqs 48 \
    --max-model-len 40000 \
    --async-scheduling \
    --max-num-batched-tokens 9000 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \
    --gpu-memory-utilization 0.9 \
    --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
    --additional-config \
    '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}'
```

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-20 21:31:38 +08:00
Song Zhixin
2b6dc100b5 Eagle3 mm support, enablement on qwen3vl (#4848)
### What this PR does / why we need it?
follow pr
[https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788)
, Eagle3 mm support, enablement on qwen3vl
target model
[Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct])
eagle3
[MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv

vLLM with eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images  --speculative-config '{
    "method": "eagle3",
    "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3",
    "num_speculative_tokens": 3
  }'
```
vLLM without eagle3 :
```bash
vllm serve /model/Qwen3-VL-8B-Instruct   --enforce-eager   --port 9100    --max-model-len 32768   --max-num-seqs 32   --tensor-parallel-size 2   --allowed-local-media-path /model/gx/images 
```

bench:
```
vllm bench serve   --backend openai-chat   --base-url http://127.0.0.1:9100   --tokenizer /model/Qwen3-VL-8B-Instruct   --endpoint /v1/chat/completions   --model /model/Qwen3-VL-8B-Instruct   --dataset-name random  --num-prompts 50   --max-concurrency 5   --temperature 0   --top-p 1.0   --seed 123
```

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: jesse <szxfml@gmail.com>
2026-01-19 08:58:07 +08:00
wjunLu
73a3f822c7 [Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5911)
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0

- Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since
the errors before have been fixed by
https://github.com/vllm-project/vllm/pull/32243
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-15 23:22:43 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
zhaomingyu13
01805fbd7d Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519)"(#5902)
This reverts commit d886b81971. it breaks pd function

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2026-01-14 20:55:10 +08:00
Ronald
e20813f441 [Feature] implement eagle spec decoding for model runner v2 (#5840)
### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.13.0

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-01-14 09:18:05 +08:00
drslark
48ec97821a [Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816)
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.

The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.

It also adds a much more precise way to determine whether drafter should
use `sp` or not.

Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.

And we get the same result of `eagle3` with `sp` disabled.

```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```

* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-14 09:00:37 +08:00
zhangxinyuehfad
f7b904641e [Main2Main] Upgrade vllm commit to 0109 (#5752)
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
https://github.com/vllm-project/vllm/pull/31786
2. fix spec_decode e2e test due to
https://github.com/vllm-project/vllm/pull/29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
https://github.com/vllm-project/vllm/pull/31891
4. fix `self.seq_lens - query_lens` on same device due to
https://github.com/vllm-project/vllm/pull/31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-13 19:14:43 +08:00
yupeng
5b95c6b03a [Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA (#4075)
### What this PR does / why we need it?

This PR depends on PR
https://github.com/vllm-project/vllm-ascend/pull/4046. And only if the
latter merged, it will work.

This PR aims to solve the issue
https://github.com/vllm-project/vllm-ascend/issues/3240.

The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the
senarios that the LoRA weights are added to q_proj, v_proj, k_proj,
o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_llama2_lora.py
pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2026-01-13 16:32:28 +08:00
zhaomingyu13
d886b81971 [BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519)
### What this PR does / why we need it?
According to the official documentation, the parameter
"draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3
model. However, based on actual debugging, it was found that the number
of tensor parallelisms (tp) of the Eagle model is consistent with that
of the target model. The setting of tp for the draft model did not take
effect as expected.

**Note:** This feature has not been superimposed and tested with `sp`
and `dp`. It will be adapted later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```python
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.9,
            enforce_eager=True,
            speculative_config={
                "method": "eagle3",
                "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
                "draft_tensor_parallel_size": 1,
                "num_speculative_tokens": 3,
            },
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Fixes vllm-project/vllm#31345

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Co-authored-by: drslark <slarksblood@qq.com>
2026-01-13 09:14:30 +08:00
gh924
6880c1b383 [Feature] Support for cross-attention and whisper model (#5592)
### What this PR does / why we need it?
To solve the problem of the
issue:https://github.com/vllm-project/vllm-ascend/issues/2262

- support for cross-attention when the model is encoder-decoder
- support for whisper model

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: gh924 <guihao2@huawei.com>
Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
2026-01-11 11:38:45 +08:00
ZT-AIA
e11ff8e535 [BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394)
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408 
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2026-01-09 15:57:43 +08:00
Chenxi Qian
40eb3e1836 [OP] Enable custom op aclnnMoeInitRoutingCustom (#5332)
### What this PR does / why we need it?
This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR
#5251

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
2026-01-09 09:35:18 +08:00
drslark
ccbc5e2ba1 [Feat][Bugfix][main] Adapted SP to eagle3 (#5562)
### What this PR does / why we need it?

Adapted sp to eagle3.

There may still be some problems, e.g., accuracy in some scenes,
`sp`+`dp`...

We will fix them later.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

We tested it mainly in a new `e2e`.

```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```

```text
.

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) =============
```

It passed.

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-08 15:33:52 +08:00
Li Wang
1165b2c863 [1/N][CI] Refactor accuracy test (#5400)
### What this PR does / why we need it?
1. Accuracy testing no longer compares eager and graph modes; instead,
it directly extracts the golden result under the graph mode
configuration (the implicit purpose of this case is to verify whether
modifications affect existing results)
2. Next step: finer-grained supervision of logits/sampler results
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-07 20:58:15 +08:00
wangxiyuan
6f7a81cd9f [CI] cleanup single/multi-card test (#5623)
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-07 14:13:34 +08:00
Ronald
6ea2afe5fa [Feature] implement basic framework for batch invariant (#5517)
### What this PR does / why we need it?
This PR implement the basic framework for batch invariant, please see
https://github.com/vllm-project/vllm-ascend/issues/5487.
### Does this PR introduce _any_ user-facing change?
we reuse the function `vllm_is_batch_invariant` in vllm to judge if
batch invariant is enabled.

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Lord_of_Ironhill <suiweiyi@huawei.com>
Signed-off-by: zjchenn <zjchenn@gmail.com>
Signed-off-by: wangx700 <wangxin700@huawei.com>
Co-authored-by: Lord_of_Ironhill <suiweiyi@huawei.com>
Co-authored-by: zjchenn <zjchenn@gmail.com>
Co-authored-by: wangx700 <wangxin700@huawei.com>
2026-01-07 09:11:26 +08:00
Magnus
2b5536362a [CI] skip xlite-decode-only e2e test (#5407)
### What this PR does / why we need it?
skip xlite-decode-only e2e test, since it's unstable

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

Signed-off-by: changdawei1 <changdawei3@huawei.com>
2026-01-05 11:05:26 +08:00
lilinsiman
46862ce1af [main][test] Refactor the mtp and eagle test case (#5326)
### What this PR does / why we need it?
1. Refactor the current test with mtp and eagle cases
2. Add new necessary cases with mtp and eagle

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-31 09:22:58 +08:00
weiguihua2
15d73f248e [refactor] refactor model runner capture model (#5230)
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-30 08:32:14 +08:00
Ronald
e7e1a7dc05 [Feature] support eager mode in model runner v2 (#5210)
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-12-29 15:28:34 +08:00
anon189Ty
3e67e8276c [Feature] Support to use fullgraph with eagle (#5118)
### What this PR does / why we need it?
    
We support to use full graph with eagle. 

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
2025-12-29 09:54:51 +08:00
whx
3f33ad23fe [BugFix] Fix npu-cpu offloading interface change bug. (#5290)
### What this PR does / why we need it?
Last month the interface of `OffloadingSpec` has
changed(https://github.com/vllm-project/vllm/pull/27743). This PR fixes
this bug and adds e2e test for cpu offloading.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-27 10:21:20 +08:00
Zetong Li
16ef2474bf [Test] Add acceptance test for eagle/eagle3 (#5366)
### What this PR does / why we need it?
This PR aims to add acceptance test for eagle/eagle3 via llama/qwen. We
obtained golden baselines by running several times (based on healthy
main), which is feasible and convincing.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-12-27 08:50:01 +08:00
Wang Kunpeng
bc5b7a5fb5 [bugfix] Fix MHA model runtime error in aclgraph mode (#5397)
### What this PR does / why we need it?
Currently, MHA models (eg: minicpm-2b, Baichuan-7b) will encounter
errors when running in piecewise graph mode, with error messages similar
to:
```
(E89999):  When layout is TND and PA not enabled, keyT(8) and valueT(8) must be equal to the last element of actualSeqenceLengthKV(5)[FUNC:CheckInputShapeWhenLayoutIsTND][FILE:prompt_flash_attention_tiling.cpp][LINE:3618]
```
The error occurs because the qkv in the Prefill stage is also padded,
causing the shape to be inconsistent with actual_seq_lengths.
Add unpadding logic for kv.

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-26 21:37:28 +08:00
zhangyiming
45c5bcd962 [E2E] Optimize the E2E test time. (#5294)
### What this PR does / why we need it?
Add cudagraph_capture_sizes for E2E CI test.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-26 14:17:50 +08:00
wangxiyuan
29d2fe653d cleanup ascend config (#5296)
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-26 14:07:37 +08:00
Icey
9b2a7d8866 [BugFix][Fusion] Patch compile backend to make fusion available (#5308)
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-26 09:18:16 +08:00
Magnus
59f11dd1cb [Bugfix] fix xlite decode-only e2e test (#5354)
### What this PR does / why we need it?
fix xlite decode-only e2e test, xlite decode-only mode utilizes
aclgraph's prefill and will be affected by aclgraph, so shortened test
length.

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: changdawei1 <changdawei3@huawei.com>
Co-authored-by: changdawei1 <changdawei3@huawei.com>
2025-12-25 16:30:17 +08:00
Wang Kunpeng
13cd6362c6 [bugfix] fix Error 'ValueError: Duplicate layer name' (#5280)
### What this PR does / why we need it?
When matmul_and_reduce is enabled, the prefix attribute is required.
However, in some models, the prefix is not passed correctly, causing
errors when starting the service.
The issue of incorrect prefix passing will be fixed in vLLM in the
future.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-12-25 10:43:24 +08:00
Magnus
a9fccbeb30 [CI] add xlite e2e test (#5305)
### What this PR does / why we need it?
add xlite e2e test

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: DaweiChang <405739598@qq.com>
2025-12-25 09:17:06 +08:00
Mengqing Cao
e54630e01c Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) (#5317)
### What this PR does / why we need it?
Revert [KV-Sharing] Support KV-Sharing feature in CLA models (#4138) as
it causes deepseek v3.2 hang error


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-12-24 22:24:17 +08:00
TmacAaron
5018f2d8fd [quantization] Add w8a16 quantization support (#4541)
### What this PR does / why we need it?
related to https://github.com/vllm-project/vllm-ascend/issues/4267

### Does this PR introduce _any_ user-facing change?
support w8a16 quantization now

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

### Test
tested using [aisbench](https://gitee.com/aisbench/benchmark/) with tp2
#### Precision
  | ceval | mmlu | gsm8k
-- | -- | -- | --
bf16 | 90.46 | 89.17 | 96.21
w8a16 | 89.51 | 89.29 | 95.98

#### Performance
  | input_len | output_len | concurrency | TTFT (ms) | TPOT (ms) | TPS
(Total) (tokens/s)
-- | -- | -- | -- | -- | -- | --
bf16 | 2048 | 2048 | 10 | 1911.7136 | 77.988 | 253.9866
w8a16 | 2048 | 2048 | 10 | 2128.6334 | 67.1633 | 293.9117
bf16 | 3500 | 1024 | 10 | 3076.2509 | 84.3525 | 506.949
w8a16 | 3500 | 1024 | 10 | 2685.2031 | 73.015 | 585.4717

---------

Signed-off-by: yyt <yangyit139@gmail.com>
Signed-off-by: TmacAaron <yangyit139@gmail.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2025-12-24 19:49:32 +08:00
zhangyiming
74a1de50a9 [E2E] Optimize e2e test. (#5091)
### What this PR does / why we need it?
[E2E] Optimize e2e test.
- Remove the test_basic_camem testcase.
- Change Qwen2.5-0.5B-Instruct-W8A8 to Qwen3-0.6B-W8A8

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-24 10:41:55 +08:00
zhangyiming
bd4fb871c6 [CI] Add skipped testcases. (#5254)
### What this PR does / why we need it?
Some E2E testcases are not in our CI workflow, this PR add them back.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-24 10:41:32 +08:00
zhangxinyuehfad
8ae7fca947 [CI] refect e2e ci test (#5246)
### What this PR does / why we need it?
efect e2e ci test:
1. tests/e2e/singlecard/pooling/test_embedding.py: remove the eager
parameter and rename test case
2. tests/e2e/singlecard/pooling/test_scoring.py: Rename test cases
3. tests/e2e/singlecard/pooling/test_classification.py: Rename test case
4. tests/e2e/singlecard/test_quantization.py: remove the eager parameter
and chage model to vllm-ascend/Qwen2.5-0.6B-W8A8 and Rename test case
5. tests/e2e/multicard/test_shared_expert_dp.py: Rename test cases
6. tests/e2e/singlecard/test_sampler.py: Rename test cases
7. tests/e2e/singlecard/test_aclgraph_accuracy.py: Rename test cases
8. tests/e2e/multicard/test_offline_inference_distributed.py: Rename
test cases and remove the eager parameter
9. tests/e2e/multicard/long_sequence/test_accuracy.py: Rename test cases
and remove the eager parameter
10. tests/e2e/multicard/long_sequence/test_basic.py: Rename test cases
and remove the eager parameter
11.tests/e2e/multicard/test_expert_parallel.py:remove the eager
parameter
12.tests/e2e/multicard/test_full_graph_mode.py:remove the eager
parameter
13.tests/e2e/multicard/test_ilama_lora_tp2.py:remove the eager parameter

14.tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py:remove
the eager parameter
15.tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py:remove the
eager parameter
16.tests/e2e/singlecard/test_aclgraph_accuracy.py:remove the eager
parameter
17.tests/e2e/singlecard/test_camem.py:remove the eager parameter
18.tests/e2e/singlecard/test_ilama_lora.py:remove the eager parameter

19.tests/e2e/singlecard/test_multistream_overlap_shared_expert.py:remove
the eager parameter
20.tests/e2e/singlecard/test_vlm.py:remove the eager parameter
21.tests/e2e/singlecard/test_xli:remove the eager parameter

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-12-23 18:42:35 +08:00
Li Wang
5d1f6daef6 [CI] Mock spawn for vlm tests (#5279)
### What this PR does / why we need it?
Using `spawn` in continuous testing scenarios
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-23 18:35:06 +08:00