Commit Graph

1240 Commits

Author SHA1 Message Date
offline893
726bc8aa2a [CI]fix test nightly workflow. (#3604)
Add the nightly test back, it's deleted by mistake.

Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-22 10:34:03 +08:00
offline893
e916265b2b [CI]Add EPLB CI. (#3568)
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params. 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Qwen in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-21 22:58:02 +08:00
linfeng-yuan
4c9af353ee Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495)" (#3586)
### What this PR does / why we need it?
This reverts commit
bf87606932.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as
before.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-21 22:24:30 +08:00
whx
bd11c0054f [BugFix] Fix torchair+mtp bug after deleting deepseek_mtp. (#3590)
This is a missing bug fix introduced by PR #3561

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-21 22:23:52 +08:00
shaopeng-666
0c83eee9b1 fix vl float model not support NZ format weight error (#3533)
### What this PR does / why we need it?
fix vl float model not support nz mm op
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-21 22:23:17 +08:00
Icey
6f04b467de [CI] Upgrade manylinux image (#3587)
### What this PR does / why we need it?
Upgrade manylinux image

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Icey <1790571317@qq.com>
2025-10-21 22:22:45 +08:00
xuyexiong
79821106e6 [BugFix]Fix mtp torchair bug caused by #2719 (#3566)
### What this PR does / why we need it?
Fix mtp tochair bug cuased by #2719
Since FIA need extra space for padding, we need to enforce
`self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer
+ MTP
This means that, `self.max_num_seqs` **>** the actual maximum requests
(`self.scheduler_config.max_num_seqs`)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-21 22:21:44 +08:00
drslark
534f32d27c [BugFix][mian] Fixed a triton kernel bug of layer_norm_fwd_kernel for Qwen3-next (#3549)
### What this PR does / why we need it?
Fixes triton kernel **layer_norm_fwd_kernel**, descripted by
https://github.com/vllm-project/vllm-ascend/issues/3548

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/3548.

Starts a vllm server with:
```shell
vllm serve /home/model/Qwen3-Next-80B-A3B-Instruct   --port 22   --host 0.0.0.0   --served-model-name qwen3_next_mtp_0   --tensor-parallel-size 4   --max-model-len 32000   --gpu-memory-utilization 0.7   --enforce-eager
```

The, we start an aisbench clinet like:
```shell
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt --dump-eval-details
```

Whose config is:
```python
    # a big batch_size and a large max_out_len
    dict(
        abbr='vllm-api-general-chat',
        attr='service',
        batch_size=512,
        generation_kwargs=dict(temperature=0.7, top_k=20, top_p=0.8),
        host_ip='xxx.xxx.xxx.xxx',
        host_port=8881,
        max_out_len=30000,
        model='qwen3_next_mtp_0',
        path='',
        pred_postprocessor=dict(
            type=
            'ais_bench.benchmark.utils.model_postprocessors.extract_non_reasoning_content'
        ),
        request_rate=0,
        retry=2,
        trust_remote_code=False,
        type='ais_bench.benchmark.models.VLLMCustomAPIChat'),
```

**Results:**

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.1 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.1 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.4 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 86.1%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.2 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:45:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.8 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:45:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.4 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 90.8%, Prefix cache hit rate: 0.0%
```

We can see when we sent a bunch of requests and the **KV cache usage
reaches 100.0%**.
We won't get a **coreDim=xxx can't be greater than UINT16_MAX.**
Exception.

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:18:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.9 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:18:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.7 tokens/s, Running: 2 reqs, Waiting: 6 reqs, GPU KV cache usage: 81.9%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
```

And after a few minutes, these two requests have been done.

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
Finally, all requests are done.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: drslark <slarksblood@qq.com>
2025-10-21 20:20:57 +08:00
wangxiyuan
13e8e75143 [Refactor] refactor patch module (#3555)
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-21 20:19:46 +08:00
Jade Zheng
0c6349610e [Feature] Reduce host memory usage for attention mask generation (#3048)
### What this PR does / why we need it?

Previously, the mask construction process created multiple tensors of
size (max_model_len, max_model_len). When max_model_len reached 128k,
single GPU host memory usage exceeded hundreds of GB, causing process
OOM crashes. This update optimizes the mask generation to significantly
reduce memory consumption.

### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?

CI pass.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-10-21 20:19:04 +08:00
Anion
5f8b1699ae [Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311)
### What this PR does / why we need it?
**Problem Description:**

The existing implementation for the w4a8-dynamic linear method only
supports the old quantization format from msmodelslim. When attempting
to load models quantized with the new version, vLLM encounters errors
due to mismatched tensor shapes and unprocessed quantization parameters.

Relavant issues: 
- https://github.com/vllm-project/vllm-ascend/issues/3192
- https://github.com/vllm-project/vllm-ascend/issues/3152

**Proposed Changes:**
1. Add support for w4a8 dynamic(new format) in
AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod
2. Add unit tests and e2e tests for w4a8 dynamic new and old format
models
<details>
<summary><b>details</b></summary>

1.  **Support for new w4a8-dynamic format:**
* Detects quantization format by reading the "version" field in
quant_description to ensure backward compatibility.
* Handles the new pre-packed weight format (`2x int4` in an `int8`),
which has a halved dimension. It tells the vLLM loader how to unpack it
using `_packed_dim` and `_packed_factor`.
* Supports the new `scale_bias` parameter, setting its shape based on
the layer type, as required by msmodelslim. For api consistency and
future use, the `layer_type` parameter was also added to other
quantization methods.
* Updates the weight processing logic: new format weights are handled
with `.view(torch.int32)` since they're pre-packed, while old ones are
processed with `npu_convert_weight_to_int4pack`.

2.  **New unit and E2E tests:**
* Added unit tests that verify the logic for both the old and new
formats.
* Split the distributed E2E test to confirm that both old and new format
models work correctly.

</details>
Theoretically, these changes will provide support for all common new
version w4a8(dynamic) models from msmodelslim.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
I implement relevant unit tests and e2e tests and test the changes with
following commands:
```bash
# unit tests
python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v

# e2e tests
pytest tests/e2e/singlecard/test_quantization.py -v -s

pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s

```

I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic
format:
```
vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 
```

All tests mentioned passed locally.

**NOTE: I use quantization model from my own repo in
test_offline_inference_distributed.py**. Here is the description:
[Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary)
(including quantization steps).This should be replaced by a model in
vllm-ascend ci modelscope repo.

Thanks for reading!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Anionex <1005128408@qq.com>
2025-10-21 20:18:39 +08:00
Chao Lei
11f9bccf6b Mooncake store use adxl inferface (#3350)
Use adxl inferface in mooncake store, mooncake PR
https://github.com/kvcache-ai/Mooncake/pull/929

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-10-21 20:18:17 +08:00
Yizhou
ef3fabf399 [Chore] Prevents use of ASCEND_LAUNCH_BLOCKING with ACL Graph (#3574)
### What this PR does / why we need it?
Adds a validation check to prevent running with an incompatible
configuration.

The `ASCEND_LAUNCH_BLOCKING=1` environment variable, used for debugging,
enforces synchronous execution which is incompatible with ACL Graph.

This change raises an explicit error to inform the user about the
conflict and how to resolve it, preventing a more obscure failure later.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-21 20:17:33 +08:00
whx
220df60c61 [Model][2/N] Remove deepseek_mtp modeling. (#3561)
This PR is step 2 of deepseek model refactoring and removes
deepseek_mtp.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-21 20:17:09 +08:00
Zhu Yi Lin
ffb42a8daa [BugFix] Fixed the bug that caused the transposematmul operator to report an error due to the shape being too large (#3578)
### What this PR does / why we need it?
npu_transpose_batchmatmul has the problem that the shape being too large

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: GDzhu1 <809721801@qq.com>
2025-10-21 20:16:54 +08:00
liziyu
3164cb663c [Bugfix] mooncake connector support external dp & update readme (#3579)
### What this PR does / why we need it?

mooncake connector support external dp & update readme

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2025-10-21 20:15:24 +08:00
Chen Chen
6b290acfe1 remove redundant params in mla_preprocess kernel (#3530)
### What this PR does / why we need it?

This pull request removes the redundant parameters `gamma1` and `beta1`
(also named `gamma0`/`beta0` in some places) from the `mla_preprocess`
kernel and its calling hierarchy. The changes are consistent across C++
kernel code, bindings, and Python call sites. The parameters were unused
in the lower-level functions, so their removal is a good cleanup.

### Does this PR introduce _any_ user-facing change?

The python interface of the kernel is affected, and the params of
`gamma0` and `beta0` are not needed.

### How was this patch tested?

The unit-test of the kernel is adapted accordingly.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: mojave2 <chenchen145@huawei.com>
2025-10-21 19:20:13 +08:00
jiangyunfan1
80b8df881f [TEST] Add Qwen3-32b-w8a8 acc/perf A2/A3 test (#3541)
### What this PR does / why we need it?
This PR Qwen3-32b-w8a8 acc/perf 8 cases on A2 and A3, we need test them
daily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running the test


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: wangli <wangli858794774@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-10-21 17:34:48 +08:00
Yizhou
ec1d2b5c04 [Test] Temporarily skip flaky ACL graph test (#3577)
### What this PR does / why we need it?
Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently.

This prevents CI blockages while the root cause of the flakiness is
investigated.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-21 17:16:15 +08:00
Li Wang
9830f85c42 [CI] Fix test_mla_v1 (#3570)
### What this PR does / why we need it?
Remove test cases containing CPU incompatible operators
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-21 10:31:55 +08:00
Zhu Yi Lin
4a849df6fa [main] support cpu binding (#3546)
### What this PR does / why we need it?

Currently, in the piecewise of aclgraph, the model will be in eagle mode
in attention, which will cause abnormal allreduce latency of O matrix.
The reason is that cpu resources will be preempted in eagle mode. So I
hope to temporarily add cpu binding to vllm-ascend.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI passed with new existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: GDzhu1 <809721801@qq.com>
2025-10-21 09:17:03 +08:00
Yizhou
274b708e0c [Fix] Refactor dummy attention metadata creation (#3497)
### What this PR does / why we need it?
The `force_attention` parameter is designed for flash infer kernel
warmup, we don't actually need it on Ascend device (at least for
now).And it tends to make things more complicated. So we replace the
`force_attention` parameter with `aclgraph_runtime_mode` in the
attention metadata creation logic.

This change makes the control flow more explicit by directly using the
graph runtime mode to determine how to build attention metadata, rather
than relying on an intermediate boolean flag. This simplification
removes redundant logic and clarifies the conditions for building
attention metadata for full decode graph mode.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
DP + `FULL_DECODE_ONLY` + online serving.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-21 00:00:42 +08:00
likeful
6b6857929d [Doc] Add --shm-size option to Docker command for qwen3 vl 235B (#3519)
### What this PR does / why we need it?
Added shared memory size option to Docker run command.If shm-size is not
specified, docker will use 64MB by default. In this case,
vllm:EngineCore process may coredump if workload is high.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Done

Closes: https://github.com/vllm-project/vllm-ascend/issues/3513

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: likeful <irayki@gmail.com>
Signed-off-by: leijie2015 <irayki@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-20 23:37:35 +08:00
wangxiyuan
0bf3f21a98 Revert "Add mrope op fusion (#3509)" (#3562)
This reverts commit 646c1db5d7.

this new ops may lead accuracy problem

### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
2025-10-20 20:19:24 +08:00
linfeng-yuan
068ed706c8 [feat][torchair] support super kernel feat for quantized dsr1 (#3485)
### What this PR does / why we need it?
Port #1916 and #2157 to master branch to fuse operators in deepseek moe
layers, which can reduce scheduling overhead on devices. Note that this
feature is valid only when `tp_size = 1` and
`multistream_overlap_shared_expert` is enabled with torchair graph mode.

### Does this PR introduce _any_ user-facing change?
Users can enable this feature with `--additional-config
'{"torchair_graph_config":{"enabled":true, "enable_super_kernel":true},
"multistream_overlap_shared_expert":true}'`.

### How was this patch tested?
E2E deepseek serving with 2P1D disaggregated prefill scenarios.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-20 20:04:37 +08:00
lilinsiman
70bef33f13 add new accuracy test case for aclgraph (#3390)
### What this PR does / why we need it?
Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-20 20:04:04 +08:00
ZYang6263
b9e2896eb1 Revert "[Perf] Add FIA interface in FA case" (#3553)
Reverts vllm-project/vllm-ascend#3321
The output dimension mismatch and accuracy issue
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-20 19:56:10 +08:00
Zhu Yi Lin
34c2996ab8 [main] v_proj combining transpose and matmul (#3545)
### What this PR does / why we need it?

v_proj combining transpose and matmul

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: GDzhu1 <809721801@qq.com>
2025-10-20 19:53:32 +08:00
Jade Zheng
e04a5e3dd3 [Bugfix] Fix race condition in d2h transfer (#3372)
### What this PR does / why we need it?

Using non-blocking operations for device-to-host transfers can lead to
data corruption in later steps. The CPU tensor is accessed right after
the transfer is triggered, but the transfer might not be complete yet.
As a result, the data could be wrong. This problem was seen in the A3
environment during `profile_run`.

### How was this patch tested?
CI pass.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-10-20 18:24:21 +08:00
zhangxinyuehfad
fdac146f71 [UT] fix skip ut test and enable ut test run normally (#3410)
### What this PR does / why we need it?

fix skip ut test and enable ut test run normally

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-10-20 16:30:57 +08:00
whx
f8b52fe950 [Model][1/N] Delete deepseek v2/v3 modeling codes. (#3189)
This PR deletes model codes of deepseek_v2 and deepseek_v3 to reuse the
model file from vLLM.

vLLM Ascend now uses custom ops register way instead of model file
hard-coding.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-20 15:31:34 +08:00
Mengqing Cao
918ded9155 [BugFix][HybridKV] Update the check logic of reinitializing inputbatch (#3540)
### What this PR does / why we need it?
Update the check logic of reinitializing inputbatch, this is a follow-up
pr of #3477. `kernel_block_sizes` is a `list[list[int]]` and the
original logic will always update `InputBatch` when using hybrid blocks,
this pr fixes that

### How was this patch tested?
locally test with qwen3-next
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-20 15:29:48 +08:00
Mengqing Cao
daa4dd0a57 [DeepSeek] Seperate deepseek v3.2 modeling form deepseek v2 (#3531)
### What this PR does / why we need it?
Seperate deepseek v3.2 modeling form deepseek v2

### How was this patch tested?
- CI passed with existing test.
- test deepseek v3.2 locally

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-20 09:50:44 +08:00
Mengqing Cao
6c65dd891f [ModelRunner][Qwen3-Next] Fix attn_group initialization timing (#3477)
### What this PR does / why we need it?
Fix attn_group initialization timing so that fix qwen3-next model

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-20 09:39:40 +08:00
jiangyunfan1
9e59fc1510 [TEST] Add initial aisbench support and Qwen3 32B acc/perf test (#3474)
### What this PR does / why we need it?
This PR adds the first aisbench case for nightly test, it lays a
foundation for following performance and accuracy tests in nightly test.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-20 09:33:17 +08:00
zouyida2052
58a37ce189 bugfix for mooncake (#3535)
### What this PR does / why we need it?
bugfix for mooncake, remove useless judgement.

### How was this patch tested?
by ci

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-19 17:06:05 +08:00
ZYang6263
1e78ecbad6 [Perf] Add FIA interface in FA case (#3321)
### What this PR does / why we need it?

Add new npu_fused_infer_attention_score op to improve perfomance in
flash attention case.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: ZYang6263 <zy626375@gmail.com>
2025-10-19 12:45:33 +08:00
Wang Kunpeng
4b3bd4f397 [main][bugfix] bugfix for minicpm models (#3527)
### What this PR does / why we need it?
bugfix for minicpm-2b and minicpm3-4b

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-10-19 11:00:55 +08:00
offline893
6c9909c861 [Patch]patch of v1 executor when enable eplb. (#3511)
### What this PR does / why we need it?
when using dynamic eplb, patch v1 executor to avoid create child process
failed.

### How was this patch tested?
deepseek in v3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-19 10:54:26 +08:00
shaopeng-666
646c1db5d7 Add mrope op fusion (#3509)
### What this PR does / why we need it?
Add mrope fusion op for qwen2.5-vl. This mrope operator dosen't support
Qwen3-VL currently. Thus could only take affect in qwen2.5-vl

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-18 18:08:24 +08:00
xuyexiong
0777e2f899 Optimize torchair kv_consumer padding logic (#3526)
### What this PR does / why we need it?
Optimize torchair kv_consumer padding logic. Only pad when it is spec
decoding

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-18 16:42:17 +08:00
Shirley125
b4233a2ec3 [Bugfix] Route requests requiring KVC recomputation from the decode instance to the P instance (#3448)
### What this PR does / why we need it?
This PR is aimed to fix the recomputing out of memory bug in decode
instance. When recomputing happens in decode, kv cache usage may exceed
the pre-allocated memory, and it will cause OOM.

So we propose a new scheduling strategy, when decode instance cannot
allocate new block for running requests, we will stop the request that
will be preempted. These stopped request will be recognied by proxy, and
they will be send to prefill instance again to calculate kvc and then
direct to decode instance.

This is a temporary plan to fix the bug. The long-term stratege is to
use CPU offload in decode instance.

### Does this PR introduce _any_ user-facing change?
An extra ascend configuration option **-- recompute_scheduler_enable =
True** is added to enable this strategy. The default value is False
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
2025-10-18 15:56:44 +08:00
yechao237
4750d45d86 [BugFix]Support redundant experts in EPLB (#3473)
This PR adds support for redundant experts in the EPLB. 

Key points: 
- Use global_num_experts = num_experts + num_redundant_experts
consistently.
- Backward compatible when num_redundant_experts=0. 

Tested 
On a 16-rank setup (W8A8) with static EPLB and expert_map_path,
verifying router logits shape and successful requests.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: yechao237 <yechao20180411@gmail.com>
2025-10-18 00:09:16 +08:00
Slightwind
07ca1b9b78 [Refactor] Clean up w4a4_flatquant_dynamic implementation (#3440)
Cleans up the initial implementation of `w4a4_flatquant_dynamic` for
better readability and maintainability.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-17 23:53:19 +08:00
xuyexiong
21769e8f44 [BUGFIX] Mtp torchair pd fix (#3506)
### What this PR does / why we need it?

In memory of https://github.com/vllm-project/vllm-ascend/pull/2610 and
#3449 Fix Mtp torchair pd bug.

In the pd Disaggregation scenario, the first token of the inference
after the d node receives the kv follows the eager mode.

Fixes:
Running with MTP torchair graph mode with Prefilling Decoding
Disaggregation , if all requests processed by the D node are requests
just transmitted from the P node, it will break the torchair graph.

Reason: During PD Disaggregation , the P node only transmits the KV
cache and prompt to the D node, not the actual tokens inferred (neither
the main model tokens nor the MTP tokens are transmitted). Therefore,
the D node will treat this request as one without MTP tokens for
inference (seq_len=1).
The community does not have graph mode issues because the community's
attention has a seq_len=1 for each batch during the decode phase.
We have issues because the graph mode pads according to processing 2
tokens per request. When there are some seq_len=1 and some seq_len=2,
padding is done at the end. If all requests received by the D node are
seq_len=1, padding cannot be performed normally according to the
attention's fia operator constraints.

Solution:

The kv consumer uses extra torchair graph padding to avoid breaking FIA
graph constrains (The one this PR implemented).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-17 21:57:05 +08:00
Angazenn
9547d6f0d9 [Core]Append padding logic for Attention (#3256)
### What this PR does / why we need it?

This PR aims to add padding logic to seq_lens、block_tables when running
in full decode scenario. Before this PR, the number of input tokens with
padding might exceeds corresponding seq_lens. For example, when running
in full decode scenario:

```
input_ids : [1, 3, 0, 0]
seq_lens: [2, 1]
query_start_loc: [0, 1, 2]
```
Here, `input_ids` is padded by 2 tokens while
`seq_lens`/`query_start_loc` are not. The mismatch between `input_ids`
and `seq_lens`/`query_start_loc` might cause some potential bugs. This
PR would change it into :

```
input_ids : [1, 3, 0, 0]
seq_lens: [2, 1, 1, 1]
query_start_loc: [0, 1, 2, 3, 4]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Angazenn <supperccell@163.com>
2025-10-17 21:56:01 +08:00
realliujiaxu
b154a8e22c [Bugfix] fix logging and d2h bug for flash comm1 (#3505)
### What this PR does / why we need it?

Fix 3 bugs in flash comm1 of Allgather
EP(https://github.com/vllm-project/vllm-ascend/pull/3334):
1. call `enable_sp()` with argument `vllm_config` trigger a lot of
warning log, this PR caches its return value.
2. `num_tokens_after_padding` should be cpu tensor as it will used as
`num_tokens_across_dp_cpu` in `DPMetadata`. It will causes may d2h copy
when running model.
3. In PD, model runner will execute `kv_connector_no_forward`,where
`num_tokens` is None

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-10-17 21:13:41 +08:00
anon189Ty
248ee7fa11 [Feat]Make full graph mode compalible with MTP (#3276)
### What this PR does / why we need it?
Make the Full Graph mode can run with MTP.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-10-17 20:19:56 +08:00
anon189Ty
46e62efd44 [Feat]mtp aclgraph support (#3244)
### What this PR does / why we need it?
Currently, MTP Model in deepseek can not be capture in ACLGraph. This PR
is use to allow MTP to be captured in ACLGraph mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-10-17 18:14:49 +08:00
lilinsiman
1b424fb7f1 ACLgraph enable: Test cases revisions for all features (#3388)
### What this PR does / why we need it?
This PR revise the test cases of various features on the warehouse which
add the enablement of aclgraph to the test cases.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-10-17 17:15:19 +08:00