Commit Graph

1220 Commits

Author SHA1 Message Date
Li Wang
90ae114569 [CI] Fix nightly CI (#3821)
### What this PR does / why we need it?
This patch fix the nightly CI runs
[failure](https://github.com/vllm-project/vllm-ascend/actions/runs/18848144365)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.1

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-28 20:40:03 +08:00
Icey
a7450db1bd Upgrade to 0.11.1 newest vllm commit (#3762)
### What this PR does / why we need it?

c9461e05a4

Fix ```spec decode rejection sampler```, caused by
https://github.com/vllm-project/vllm/pull/26060
Fix some ```import```, caused by
https://github.com/vllm-project/vllm/pull/27374
Fix ```scheduler_config.send_delta_data```, caused by
https://github.com/vllm-project/vllm-ascend/pull/3719
Fix ```init_with_cudagraph_sizes```, caused by
https://github.com/vllm-project/vllm/pull/26016
Fix ```vl model```of replacing PatchEmbed's conv3d to linear layer,
caused by https://github.com/vllm-project/vllm/pull/27418

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-10-28 14:55:03 +08:00
Li Wang
f846bd20e4 [CI] Add multi-node test case for a2 (#3805)
### What this PR does / why we need it?
This patch add multi-node test case for a2
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-27 23:10:17 +08:00
jiangyunfan1
9030106a14 [TEST]Add 2P1D multi node cases for nightly test (#3764)
### What this PR does / why we need it?
This PR adds the 2P1D multi node func/acc/perf test cases, we need test
them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
2025-10-27 23:09:15 +08:00
Levi
d64bdd06ae 【Bugfix】bugfix for weight load of kimi-k2 (#3798)
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

### What this PR does / why we need it?
Fix kimi-k2 start bug, weight load
ERROR:https://github.com/vllm-project/vllm-ascend/issues/3785

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: zhaozx-cn <zhaozx2116@163.com>
2025-10-27 21:18:35 +08:00
wangxiyuan
da5f2cc1e3 [Doc] Update FAQ (#3792)
Many FAQ content is out of date, this PR refresh it.

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-27 20:32:17 +08:00
shiyuan680
00aa0bf33e support prefill cache mode use fia op (#3696)
### What this PR does / why we need it?
support prefill cache mode use fia op for full graph
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

origin
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  131.63
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         466.77
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          932.95
---------------Time to First Token----------------
Mean TTFT (ms):                          125.17
Median TTFT (ms):                        121.51
P50 TTFT (ms):                           121.51
P90 TTFT (ms):                           140.91
P99 TTFT (ms):                           182.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.85
Median TPOT (ms):                        43.84
P50 TPOT (ms):                           43.84
P90 TPOT (ms):                           44.28
P99 TPOT (ms):                           44.32
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.85
Median ITL (ms):                         42.63
P50 ITL (ms):                            42.63
P90 ITL (ms):                            48.74
P99 ITL (ms):                            59.62
==================================================

after
============ Serving Benchmark Result ============
Successful requests:                     30
Maximum request concurrency:             256
Request rate configured (RPS):           0.70
Benchmark duration (s):                  130.10
Total input tokens:                      61363
Total generated tokens:                  61440
Request throughput (req/s):              0.23
Output token throughput (tok/s):         472.26
Peak output token throughput (tok/s):    750.00
Peak concurrent requests:                30.00
Total Token throughput (tok/s):          943.94
---------------Time to First Token----------------
Mean TTFT (ms):                          123.69
Median TTFT (ms):                        122.51
P50 TTFT (ms):                           122.51
P90 TTFT (ms):                           143.69
P99 TTFT (ms):                           165.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.07
Median TPOT (ms):                        43.13
P50 TPOT (ms):                           43.13
P90 TPOT (ms):                           43.50
P99 TPOT (ms):                           43.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.07
Median ITL (ms):                         41.81
P50 ITL (ms):                            41.81
P90 ITL (ms):                            48.11
P99 ITL (ms):                            62.13
==================================================

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-10-27 19:41:07 +08:00
Shanshan Shen
3e5ae49160 [MM][Doc] Update online serving tutorials for Qwen2-Audio (#3606)
### What this PR does / why we need it?
Update online serving tutorials for `Qwen2-Audio`.

Part of https://github.com/vllm-project/vllm-ascend/issues/3508.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-10-27 16:58:03 +08:00
Shirley125
d8ca7fee75 [bugfix][main]fix proxy decode bug (#3750)
### What this PR does / why we need it?

fix proxy decode bug when parsing non-UTF-8 characters.

- vLLM version: v0.11.0
- vLLM main:
c9461e05a4

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-27 16:56:09 +08:00
yupeng
b8796b06c8 [Doc][Example][Bugfix] Elements in local_device_ids should be casted … (#3782)
### What this PR does / why we need it?
It's a tiny bugfix in the `gen_ranktable.py` script. The script is an
util to help setup an example case. It is used to prepare a ranktable
before disaggregated prefill deployment.

Elements in `local_device_ids` list should be casted to `int` type
before referred for a MOD math operation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.


- vLLM version: v0.11.0
- vLLM main:
c9461e05a4

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-10-27 14:52:47 +08:00
dependabot[bot]
638d8d1a47 Bump actions/upload-artifact from 4 to 5 (#3786)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-27 14:11:53 +08:00
dependabot[bot]
79623e0bab Bump actions/download-artifact from 5 to 6 (#3787)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6.

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-27 14:10:56 +08:00
jiangyunfan1
e9072429fb [CI] Enable 2 jobs for nightly test (#3781)
### What this PR does / why we need it?
This PR adds 2 jobs to a3 nightly test, which contains 4 test cases, we
need test them nightly

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running the test

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-27 14:08:29 +08:00
Li Wang
60ee4af6d0 [CI] Add custom op to nightly (#3765)
### What this PR does / why we need it?
1. Add custom op to nightly tests, fix
https://github.com/vllm-project/vllm-ascend/pull/3665
2. Correctly pass github secrets when using workflow_call, see
https://docs.github.com/en/actions/how-tos/reuse-automations/reuse-workflows
3. Fix the single node mutual cancellation issue

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-27 14:07:03 +08:00
weiguihua2
4312a92a4f [feat]dcp pcp support aclgraph (#3731)
### What this PR does / why we need it?
dcp pcp support  full aclgraph, including mla attention_v1

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-10-27 09:58:23 +08:00
Yizhou
8ab8111fde [Fix] Prevent memory leak in MLA decode graph (#3743)
### What this PR does / why we need it?
The cache for MLA decode graph parameters was holding strong references
to tensors, preventing them from being garbage collected and leading to
increased memory usage.

This change wraps the cached tensors in weak references, allowing them
to be deallocated when no longer in use and reducing overall memory
pressure.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 20:37:33 +08:00
22dimensions
afc58184ec [Installation] limit opencv-python-headless version to resolve numpy version conflict (#3713)
### What this PR does / why we need it?

vllm requires opencv-python-headless >= 4.11.0 which requires
(numpy<2.3.0,>=2), but vllm-ascend numpy version must be less than
2.0.0, so limit opencv-python-headless less than 4.11.0.86 will fix this
conflict.

### How was this patch tested?

tested by CI 

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-10-25 18:07:54 +08:00
Icey
bb5f16d926 [BugFix] Fix Qwen3-next break (#3428)
### What this PR does / why we need it?
Fix Qwen3NextGatedDeltaNet, caused by
https://github.com/vllm-project/vllm/pull/26437

### How was this patch tested?
```
def main():
    prompts = [
        "窗前明月光,",
        "The president of the United States is Mr.",
        "The capital of France is",
        "The future of AI is",
        "感时花溅泪,",
        "家书抵万金啥意思?",
        "plz tell me a story: ",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64
              )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-10-25 18:03:36 +08:00
ck-hw-1018
7572939b94 add qwq testcase (#3757)
### What this PR does / why we need it?
This PR adds a qwq case for nightly test for qwen-qwq on A3 ,we need
test them daily

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
by running the test


- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4

---------

Signed-off-by: ckhw <cuikai1@huawei.com>
2025-10-25 17:11:35 +08:00
zzzzwwjj
e5676fc36e [main] remove dbo code (#3712)
### What this PR does / why we need it?
Remove codes of dbo.
Currently, vLLM has supported dbo with pr:
https://github.com/vllm-project/vllm/pull/23693.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-10-25 15:53:01 +08:00
Icey
d9cdc65854 Upgrade to new vllm commit (#3719)
### What this PR does / why we need it?
Upgrade to new vllm commit:
c9461e05a4

- Fix many imports, caused by
https://github.com/vllm-project/vllm/pull/26908
- Fix import ```sha256```, caused by
https://github.com/vllm-project/vllm/pull/27169
- Remove ```SchedulerConfig.send_delta_data```, caused by
https://github.com/vllm-project/vllm/pull/27142
- Fix ```FusedMoE``` because of dual stream execution, caused by
https://github.com/vllm-project/vllm/pull/26440

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-10-25 15:36:32 +08:00
fems14
226f832c0b [bugfixfix] correct _register function place for mooncacke (#3747)
correct _register function place for mooncacke

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-25 14:20:09 +08:00
HuaJiaHeng
11f75883be [Test] add test for prefix cache feature of deepseek (#3733)
### What this PR does / why we need it?
This PR adds a prefix cache case for nightly test for
DeepSeek-r1-0528-W8A8 on A3, we need test them daily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
2025-10-25 14:08:15 +08:00
Yizhou
1f25d60870 [Fix] Cap max tokens to prevent potential OOM (#3720)
### What this PR does / why we need it?
Caps the calculated maximum number of tokens at 512.

This prevents allocating an excessively large buffer when a cudagraph
capture size is not specified, mitigating the risk of out-of-memory
errors.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 11:23:21 +08:00
weichen
63c363d3de [Refactor] [MoE] Rename moe-related classes & files (#3646)
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-25 11:22:03 +08:00
zhangxinyuehfad
0637e8f021 [Doc] Update supported models (#3481)
### What this PR does / why we need it?
Update supported models

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-25 11:13:46 +08:00
zhangxinyuehfad
8f6f967028 [Test] Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct (#3450)
### What this PR does / why we need it?

Add e2e test and accuracy test for Qwen3-Next-80B-A3B-Instruct

### How was this patch tested?
accuracy test:
https://github.com/vllm-project/vllm-ascend/actions/runs/18771221544/job/53556027634?pr=3450
ci test:
https://github.com/vllm-project/vllm-ascend/actions/runs/18771221530/job/53556027614?pr=3450
<img width="1703" height="562" alt="image"
src="https://github.com/user-attachments/assets/973b6cfa-8240-41e3-893a-5024ff8d0693"
/>



- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-25 10:57:56 +08:00
whx
d5609e2c48 [BugFix] Comment out newly added vlm e2e. (#3736)
This PR comments out newly added vlm e2e test of ascend scheduler
scenario because I found that when running in multi-batch this will
stuck. Need to add this back after dealing with this issue.
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-25 10:34:59 +08:00
lio
9e150e5009 [Refactor] optimize _prepare_inputs method in eagle_proposer (#3296)
### What this PR does / why we need it?

We optimized the _prepare_input method in eagle_proposer and no longer
use the _prepare_eagle_input_sequential method, improving the
performance of eagle-3.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
python3 -m vllm.entrypoints.openai.api_server  
--host 0.0.0.0 
--port 13963
--dtype bfloat16 
--model meta-llama/Llama-3.1-8B-Instruct
--served-model-name Llama-3.1-8B-Instruct 
--tensor-parallel-size 1 
--gpu-memory-utilization 0.85   
--max-model-len  32768 
--trust-remote-code  
--seed 42  
--no-enable-prefix-caching 
--speculative_config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":2,"draft_tensor_parallel_size":1}'
```

Co-authored-by: QilaiZhang (245706640@qq.com )


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: lio <1983142975@qq.com>
2025-10-25 09:49:42 +08:00
QilaiZhang
d30bb95b90 [Bugfix] Fix zero attention output in qwen3-next (#3572)
### What this PR does / why we need it?
Since Attention and LinearAttention share the same ```slot_mapping```,
and the ```slot_mapping``` for LinearAttention is all zeros, the
```slot_mapping``` for Attention gets overwritten, resulting in the
computed output being all zeros.

This PR removes the uniformly managed ```self.slot_mapping``` and
directly passes the ```slot_mapping``` from ```input_batch.blocktable```
to ```attn_metadata```, along with modifying the relevant references.
Due to hardware, the data type of ```block_table.slot_mapping``` needs
to be set to int32.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: QilaiZhang <245706640@qq.com>
2025-10-25 09:47:03 +08:00
whx
e33751ef8b [BugFix][Core] Fix a bug running multi-modal with ascend_scheduler (#3675)
This PR fix the bug related with running multi-modal models with
AscendScheduler. This bug was introduced by PR #2372 by using the same
parameter names as vLLM with different default values. 

Currently I fix this bug by changing the default values of these two
parameters to align with vLLM. 

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
2025-10-25 09:41:33 +08:00
wangxiyuan
1a9feb3ba5 Update version doc (#3599)
1. Add v0.11.0-dev branch info
2. mark rfc/long_seq_optimization branch as completed
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-25 09:37:56 +08:00
wangxiyuan
07c8d4547c [CI] Skip ops test for e2e (#3665)
### What this PR does / why we need it?
Skip ops test for e2e and will move it to nightly test in the following
pr

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-25 09:37:30 +08:00
wangxiyuan
6922947033 [Misc] Limit ray version (#3660)
We notice that with ray>2.48.0, the npu card count is not correct from
ray. This is a know bug. Let's limit ray version to <2.48.0 now.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-25 09:36:44 +08:00
Canlin Guo
8295136575 [UT][fix] Add missing get_ascend_config mock to NPUWorker initialization tests (#3729)
### What this PR does / why we need it?

Enable the unit tests that #3612 skipped.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Unit tests.

- vLLM main:
17c540a993

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-10-25 09:33:16 +08:00
Li Wang
7f73c28a24 [CI][Doc] Optimize multi-node CI (#3565)
### What this PR does / why we need it?
This pull request mainly do the following things:
1. Add a doc for multi-node CI, The main content is the mechanism
principle and how to contribute
2. Simplify the config yaml for more developer-friendly
3. Optimized the mooncake installation script to prevent accidental
failures during installation
4. Fix the workflow to ensure the kubernetes can be apply correctly
5. Add Qwen3-235B-W8A8 disaggregated_prefill test
6. Add GLM-4.5 multi dp test
7. Add 2p1d 4nodes disaggregated_prefill test
8. Refactor nightly tests
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-25 09:23:47 +08:00
hucong
292cf339c3 [BugFix][P/D] Modify the recalculation logic to prevent waiting requests from filling up the D node KVCache (#3641)
### What this PR does / why we need it?
Modify the recalculation logic to prevent waiting requests from filling
up the D node KVCache

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

Signed-off-by: underfituu <hzhucong@163.com>
2025-10-25 09:14:20 +08:00
shaopeng-666
39b994a987 [Feat] Add mrope fusion op (#3708)
### What this PR does / why we need it?
Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support
Qwen3-VL currently. Thus could only take affect in qwen2.5-vl

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-25 09:12:18 +08:00
Yizhou
3158742a97 [Refactor] Refactor Ascend attention implementation forward (#3714)
### What this PR does / why we need it?
This PR refactors the Ascend attention implementation to align with
vLLM's core interfaces, simplifying the code and improving
maintainability.

### Key Changes:

* **Align with vLLM's Attention Interface**: The `forward` method
signature in `AscendAttentionBackendImpl` now matches the base
`AttentionImpl` in vLLM, removing the custom `trace_flag`.

* **Enable Opaque Attention Operator**: By adding `opaque_attention_op`
to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its
standard `vllm.unified_attention_with_output` operator. This avoids the
need for a custom call path.

*   **Remove Obsolete Code**:
* The custom op `vllm.unified_ascend_attention_with_output` has been
deleted as it is now redundant.
* The `trace_flag` and its associated logic were removed, reducing code
complexity.
* An outdated quantization branch within the attention implementation
was cleaned up.

* **Improve Readability**: Renamed output variables (`output` vs.
`intermediate_output`) and added comments to clarify the in-place nature
of the attention output.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No extra tests needed.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 08:58:35 +08:00
ZYang6263
0b1da24742 [Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. (#3693)
### What this PR does / why we need it?
This PR boosts performance by introducing a fused kernel for the matrix
matmul and reduce scatter operations. It supports both unquantized
(e.g., BFloat16) and W8A8 quantized models.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-24 18:19:58 +08:00
fems14
82a4970fe9 look up multi_tp key (#3699)
### What this PR does / why we need it?
In multi-Tensor Parallel (TP) scenarios, the KV pool only queries the
first GPU card. When keys on other cards are released, the query result
still returns as successful, introducing accuracy issues. This PR
modifies the KV pool's query logic to check all cards, resolving this
problem.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 17:23:36 +08:00
fems14
c83efcb9e4 kvpool sync load (#3698)
### What this PR does / why we need it?
In certain scenarios, the performance of synchronously loading data from
the pool is better than that of asynchronously loading data. Therefore,
a control logic (or switch) for asynchronous loading from the pool has
been added.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 17:22:53 +08:00
何必问
59bb16b75c [Bugfix] The server fails to locate the request, leading to the server hanging. (#3703)
### What this PR does / why we need it?
fix bug: In the mooncake pooling scenario, when the client closes the
request, the server fails to locate the request, leading to the server
hanging.oling scenario, when the client closes the request, the server
fails to locate the request, leading to the server hanging.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Pull up the PD separated pooling service, send requests using aisbench,
press CTRL+C twice, and check if the vllm_ascend service exit.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linhebiwen <linhebiwen@gmail.com>
2025-10-24 17:18:03 +08:00
wangyu
d301c56d1a [TEST]Add initial multi modal cases of Qwen2.5-VL-32B-Instruct for nightly test (#3707)
### What this PR does / why we need it?
This PR adds the initial multi modal model for nightly test, including 2
cases for Qwen2.5-vl-32b acc/perf test on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2025-10-24 17:12:06 +08:00
offline893
9b0baa1182 [BugFix] Check all expert maps when using muilty instance. (#3576)
### What this PR does / why we need it?
Check all expert maps when using muilty instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2:   master has expert map, slave has error expert map.
case3:   master has expert map,slave has correct expert map.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-24 17:10:14 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
jiangyunfan1
ec9ec78b53 [TEST]Add initial prefix cache case for nightly test (#3709)
### What this PR does / why we need it?
This PR adds the initial prefix cache case for nightly test for
Qwen3-32b-int8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-24 16:33:18 +08:00
zzzzwwjj
6be321b95e remove useless code (#3685)
### What this PR does / why we need it?
`vanilla_chunked_prefill_mla` and `vanilla_decode_mla` is unused, so
remove it.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-10-24 16:29:08 +08:00
lio
cd58a643c5 [UT] Fix test_sample_recovered_tokens_pytorch_autoregressive (#3434)
### What this PR does / why we need it?

This 'test_rejection_sampler' unit test is something wrong.

> def test_sample_recovered_tokens_pytorch_autoregressive(self):
>       output_token_ids = torch.empty(2, dtype=torch.int32)
>       cu_num_draft_tokens = torch.tensor([1, 1])
>       draft_token_ids = torch.tensor([0, 1])

len(draft_token_ids ) = 2, cu_num_draft_tokens should be
torch.tensor([1, 2]) or torch.tensor([2, 2])

I fix it and set cu_num_draft_tokens = torch.tensor([1, 2]). The methods
before and after optimization can pass.

### Does this PR introduce _any_ user-facing change?
No 
### How was this patch tested?
NA

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: lio <1983142975@qq.com>
2025-10-24 11:20:57 +08:00
Li Wang
802c574532 [Benchmark] Upgrade benchmark args for new vllm version (#3218)
### What this PR does / why we need it?
Since the newest vllm commit has deprecated the arg `--endpoint-type`,
we should use `--backend` instead
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test it locally:
```shell
export VLLM_USE_MODELSCOPE=true
export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json

vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy

wget -O ${DATASET_PATH}  /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH}  --num-prompt 200
```
and the result looks good:
```shell
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.36
Total input tokens:                      43560
Total generated tokens:                  44697
Request throughput (req/s):              9.82
Output token throughput (tok/s):         2194.88
Peak output token throughput (tok/s):    4676.00
Peak concurrent requests:                200.00
Total Token throughput (tok/s):          4333.93
---------------Time to First Token----------------
Mean TTFT (ms):                          2143.85
Median TTFT (ms):                        2486.17
P99 TTFT (ms):                           2530.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.50
Median TPOT (ms):                        30.75
P99 TPOT (ms):                           309.22
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.15
Median ITL (ms):                         25.42
P99 ITL (ms):                            38.30
==================================================
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-24 11:18:19 +08:00