Commit Graph

1559 Commits

Author SHA1 Message Date
Slightwind
aa56a0f4b7 [Bugfix] PCP adaptation for VLLM v0.11.2 modifications (#4604)
To adapt to the vLLM v0.11.2 image, the method for obtaining PCP size
and DCP size has been modified.
___
- vLLM version: v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-01 19:20:32 +08:00
wangxiyuan
0d14f635b4 upgrade torch npu version (#4433)
vLLM graph feature now rely on torch >=2.8. To make graph mode work, we
need upgrade torch version as well. For long term support, upgrade torch
to a newer one is good to go as well.

Related vLLM change: https://github.com/vllm-project/vllm/pull/25110


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
2025-12-01 19:01:55 +08:00
fluctlux
f1f6370ed9 [Feature] Integrate Suffix Spec Decoding (#4045)
### What this PR does / why we need it?
This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975)
from vllm (https://github.com/vllm-project/vllm/pull/25784)

#
Suffix Decoding is a dynamic n-gram matching method that:

1. Uses suffix trees to generate speculative tokens quickly using branch
frequency counts.
2. Can keep a history of prior model responses, which tends to work very
well with repetitive agentic use cases.
3. Can be dynamically updated with newly generated tokens, and FIFO
eviction of older requests.
#
### Does this PR introduce _any_ user-facing change?
This feature should be implemented as opt-in and remain seamless for
users who do not require suffix speculative decoding.

For users who wish to enable it, they must first install
arctic-inference:
`pip install arctic-inference
`

After installation, the suffix speculative decoding feature can be
enabled using the following speculative config:
`--speculative_config '{"method": "suffix", "num_speculative_tokens":
5}'
`

### How was this patch tested?
This PR is currently being tested on vLLM
main:83f478bb19
 with PR https://github.com/vllm-project/vllm/pull/25784

In our previous testing, suffix decoding achieved a 13%-30% throughput
improvement over n-gram on the sonnet dataset, tested on vllm-ascend
v0.9.1 with concurrency ranging from 2 to 40.

- vLLM version: v0.11.2

---------

Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
2025-12-01 18:41:42 +08:00
zzzzwwjj
3023e15e23 add _cann_ops_custom gitignore (#4605)
### What this PR does / why we need it?
add _cann_ops_custom dir to .gitignore

- vLLM version: v0.11.2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-12-01 18:37:32 +08:00
MidnightSun
f4871c6ab9 [Kernel] add triton kernels for sampling (#4550)
### What this PR does / why we need it?
Replace pyorch implement of sampling with triton kernels

### Does this PR introduce _any_ user-facing change?
No


- vLLM version: v0.11.2

---------

Signed-off-by: Lord_of_Ironhill <suiweiyi@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Co-authored-by: Lord_of_Ironhill <suiweiyi@huawei.com>
Co-authored-by: whx-sjtu <2952154980@qq.com>
2025-12-01 17:41:58 +08:00
zzhxxx
2b82320b66 [Bugfix] Fix bug with establishing the flashcomm2 and pp communication domains. (#4458)
### What this PR does / why we need it?
The previous implementation of the flashcomm2 communication domain did
not consider pp(pipeline parallel), which caused problems when enabling
pp and flashcomm2. This PR fixes this issue.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-01 15:56:22 +08:00
dependabot[bot]
8c65009d62 Bump actions/setup-python from 6.0.0 to 6.1.0 (#4591)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.0.0 to 6.1.0.

- vLLM version: v0.11.2

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-01 14:32:08 +08:00
Jade Zheng
51c8f60eb0 [Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254)
### What this PR does / why we need it?

Previously, the dummy run executed compute_logits only once, regardless
of num_speculative_tokens. This caused execute_model to hang on
compute_logits when lm head tensor parallelism exceeded 1. The fix
ensures compute_logits executes correctly during dummy run, matching
num_speculative_tokens.

I set the `non_blocking` argument to False when moving
`exceeds_max_model_len` to the CPU. From what I understand, using
`non_blocking=True` and immediately accessing the tensor on the CPU can
cause accuracy problems. However, this issue doesn't happen when
transferring data to a device. ref:
https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-01 10:22:36 +08:00
Ting FU
e8e20c0bbf [BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568)
Qwen2.5_Omni vision tower use AscendRMSNorm, which conatins a property
function. It would be override by set_forward_context(), patch
Qwen2_5OmniThinkerForConditionalGeneration func with customized
_process_image_input() and _process_video_input() to fix it.

### What this PR does / why we need it?

Fix Qwen2.5_Omni model infer image/video issue

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Ting FU <futing10@huawei.com>
2025-12-01 09:18:55 +08:00
Wang Yixuan
c68ddc11ce [OPS] add bmm_transpose ops (#3990)
### What this PR does / why we need it?
Add a new fusion ops to custom_op, which can cobime the torch.bmm() and
transpsose to achieve better peformance. This ops is used in mla_v1 to
replace the bmm and transpose

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.11.2

---------

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-12-01 09:09:51 +08:00
欧派果奶我还要
bc67696a02 [EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216)
### What this PR does / why we need it?
Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list into
dynamic EPLB to support list-type parameters
This PR also modify the logic of loading model in dynamic-eplb scenario.
The operator is based on this pr:
https://github.com/vllm-project/vllm-ascend/pull/3804

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

```
vllm serve /home/weight/DeepSeek-V3.1_w8a8mix_mtp \
    --max_num_seqs 8 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --tensor-parallel-size 8 \
    --data-parallel-size 2 \
    --enable-expert-parallel \
    --served-model-name ds_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --no-enable-prefix-caching \
    --port 8999 \
    --quantization "ascend" \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --compilation_config '{"cudagraph_capture_sizes":[1,2,4,8,16,32]}' \
    --additional-config='{"dynamic_eplb":true, "num_iterations_eplb_update":100, "num_wait_worker_iterations":100}'
 
```
input&output: 2k 2k
This PR:
<img width="1318" height="695" alt="fusion"
src="https://github.com/user-attachments/assets/f8657813-0c02-42f4-8396-d99e730f48cd"
/>

Baseline:
<img width="1323" height="690" alt="baseline"
src="https://github.com/user-attachments/assets/e1323a78-af26-4523-820c-e20e5642a38e"
/>


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <845473182@qq.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
2025-11-30 22:52:05 +08:00
Slightwind
18eefc23c3 [feature] Support W8A8 PD-Mix Quantization (#4235)
In PD-separated deployment scenarios:

* MoE layers use dynamic quantization exclusively.
* For the Attention module, Prefill (P) nodes use **dynamic**
quantization, while Decode (D) nodes use **static** quantization.

In PD-mixed deployment scenarios:
* **All components fall back to dynamic quantization**, as it is
difficult to distinguish between Prefill and Decode tokens.
___

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: Slightwind <slightwindsec@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-30 11:57:26 +08:00
Chao Lei
ff7061317f [Bugfix] Fix kvpool precision synchronization (#4574)
### What this PR does / why we need it?
Fix kvpool precision synchronization
Issue https://github.com/vllm-project/vllm-ascend/issues/4412


- vLLM version: v0.11.2

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-11-30 09:39:07 +08:00
weijinqian0
2b3bfe432e [bugfix] Repair the problem of moe model accuracy caused by version upgrade. (#4562)
Repair the problem of moe model accuracy caused by version upgrade.

Reason:
The new version adds the "reduce_output" operation after "forward_impl".

Then we have fully taken over the implementation of the FusedMoe module.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-11-30 06:12:39 +08:00
Mengqing Cao
c84efeae25 [CI] Skip test_ngram_correctness as the oom issue block CI (#4578)
### What this PR does / why we need it?
Skip test_ngram_correctness as the oom issue block CI
related CI failure:
https://github.com/vllm-project/vllm-ascend/actions/runs/19780591780/job/56680823606

### Does this PR introduce _any_ user-facing change?
N/A

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-11-30 01:34:50 +08:00
Mengqing Cao
517fd9272d Revert "drop ascend scheduler" (#4580)
Reverts vllm-project/vllm-ascend#4498
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
2025-11-29 22:20:48 +08:00
DreamerLeader
4dbe4fd123 [feature]Pooling Features and PCP Adaptation (#4143)
This PR let pooling kv connector support pcp feature

- vLLM version: v0.11.2

---------

Signed-off-by: fjw <2270923832@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
2025-11-29 22:07:45 +08:00
wangxiyuan
1eb5295a1b remove qwen3-next model file (#4573)
Let's remove qwen3-next model filecurrently. We'll support it later by
using vLLM origin model file

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:37:26 +08:00
Nengjun Ma
a3041cd78c [Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539)
### What this PR does / why we need it?
fix dp parallel + tp > 1 offline inference port conflict

issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-29 18:37:11 +08:00
wangxiyuan
1874265074 Move mla to ops module (#4575)
Move mla custom op to correct module
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 18:36:55 +08:00
Shanshan Shen
2a19215e5f [MM][Model] Remove Qwen2-VL modeling files (#4534)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-29 18:07:01 +08:00
wangxiyuan
6664a4e5ce improve soc version (#4522)
Make SOC_VERSION be readable for users. Now users can set simply
"910b"、“910c”、“310p”


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 17:42:16 +08:00
wangxiyuan
f10acddb78 drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.

Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 16:18:34 +08:00
liziyu
53a52d6614 [P/D] [bugfix] add get_kv_connector_handshake_metadata func for 0.11.2 (#4567)
### What this PR does / why we need it?
add get_kv_connector_handshake_metadata func for 0.11.2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-11-29 16:09:45 +08:00
LI SHENGYONG
0151022ab8 [bugfix] dep ineffective (#4417)
### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-29 15:18:29 +08:00
wangxiyuan
8ebbf13c1a Update triton package name (#4563)
Add `aarch64` suffix to make sure the package name is OK


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 15:00:40 +08:00
Ting FU
b747c95cfa [Doc] Add single NPU tutorial for Qwen2.5-Omni-7B (#4446)
### What this PR does / why we need it?
Add single NPU tutorial for Qwen2.5-Omni-7B

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 11:57:29 +08:00
Ting FU
9af34755ff [Bugfix] Fix model run _npu_flash_attention hang issue (#4410)
Fix model run _npu_flash_attention in _forward_prefill_no_cache hang
issue, it was caused by wrong attention mask dtype.
### How was this patch tested?
Yes, tesed on Qwen2.5-VL and Qwen2.5-Omni

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Ting FU <futing10@huawei.com>
2025-11-29 09:20:22 +08:00
wangxiyuan
048d350f9e update triton package url (#4552)
Triton package url is not correct. This PR fix it

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-28 21:00:49 +08:00
shiyuan680
1c4a0468ee 【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070)
### What this PR does / why we need it?
qwen3-next suppot  triton chunk_gated_delta_rule ops

### co-owners
@OsirisDuan

- vLLM version: v0.11.2

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-28 20:55:43 +08:00
fems14
5447a039b9 [Feature][main]reconstruction kvpool connector to ascend connector (#4438)
### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:https://github.com/vllm-project/vllm-ascend/issues/4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-11-28 18:08:37 +08:00
Chenxi Qian
554f16ae1f [Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804)
### What this PR does / why we need it?

This PR introduces support for adding custom CANN `aclnn` ops to
`vllm-ascend`, allowing users to define and use their own custom
operators.

Key changes include:
- Building and installing custom ops into the `vllm-ascend`-specified
directory
- Binding the `aclnn` op interface to the `torch.ops._C_ascend` module
- Enabling invocation of these ops within `vllm-ascend`

This PR includes a sample custom op:
`aclnnGroupedMatmulSwigluQuantWeightNzTensorList`, which is adapted from
the CANN operator
[`aclnnGroupedMatmulSwigluQuantWeightNZ`](https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/API/aolapi/context/aclnnGroupedMatmulSwigluQuantWeightNZ.md).
Its input parameters `weight` and `weight_scale` now accept
`list[torch.Tensor]` (i.e., `at::TensorList`).

### Does this PR introduce _any_ user-facing change?

No.


- vLLM version: v0.11.2

---------

Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
2025-11-28 18:06:39 +08:00
herizhen
3199fe8350 [Doc]Delete equals sign (#4537)
### What this PR does / why we need it?
Delete equals sign in doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-28 17:09:26 +08:00
wangxiaoteng888
366d2d95e8 [P/D] Add readme for PD separation (#4182)
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-11-28 15:17:59 +08:00
Shanshan Shen
e52ebf8674 [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349)
### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at https://github.com/vllm-project/vllm/pull/29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for https://github.com/vllm-project/vllm/pull/28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark


- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-28 14:23:00 +08:00
LHXuuu
bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
SILONG ZENG
ab37a7d5ae [main]Upgrade cann to 8.3rc2 (#4350)
### What this PR does / why we need it?
Upgrade cann to 8.3rc2

### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.3.RC2


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-11-28 14:06:01 +08:00
Zhu Yi Lin
755b635844 [TEST] Add eagle proposer ut (#4447)
### What this PR does / why we need it?
Add eagle proposer ut

- vLLM version: v0.11.2

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-27 21:59:31 +08:00
Slightwind
9fdabb7b60 [feature] Add Custom Op grouped_matmul_swiglu_quant (#4431)
This PR introduces the `EXEC_NPU_CMD` macro, serving as an adapter layer
to simplify the invocation of `aclnn` operators on Ascend NPUs.

**Key Changes:**
* **Adapter Layer:** Added `EXEC_NPU_CMD` macro and related dependencies
to standardize `aclnn` calls.
* **Operator Support:** Integrated `grouped_matmul_swiglu_quant` as a
reference implementation to demonstrate the usage of the new macro.

---


- vLLM version: v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-11-27 21:56:18 +08:00
Nengjun Ma
89a1a65300 [bugfix] fix ray start failed: local_world_size cannot little than visible device count error (#4457)
### What this PR does / why we need it?
Fix the ray start failed bug: local_world_size cannot little than
visible device count error
detail see issue #4456.

This fix code is copied from vllm fixing modify, PR:
[#28873](https://github.com/vllm-project/vllm/pull/28873)


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-27 21:18:32 +08:00
drslark
1cae3e4a49 [BugFix] Adapted Qwen3-Next eager mode to v0.11.2 (#4477)
### What this PR does / why we need it?

Adapted Qwen3-Next eager mode to `v0.11.2`.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-27 17:44:59 +08:00
Li Wang
b220de33e8 [CI][Nightly] Support local debugging for multi-node CI test cases (#4489)
### What this PR does / why we need it?
 This patch mainly doing the following things:
1. Make k8s/lws optional for multi-node testing, allowing developers to
run multi-node tests locally by actively passing in the IP addresses of
all nodes.
2. Allows passing a custom proxy script path in the config file to load
the proxy.

- vLLM version: v0.11.2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-11-27 17:20:29 +08:00
zzzzwwjj
1fd56b1106 chip type judgement code optimization (#4485)
### What this PR does / why we need it?
| | cpu envir | npu envir |
|---|---|---|
| set `SOC_VERSION` | check if `SOC_VERSION` is in dict `soc_to_device`,
if not, raise an error that can not support current chip type. | print a
warning log when `SOC_VERSION` is not equal to chip type from `npu-smi`,
same as left for others. |
| not set `SOC_VERSION` | raise an error that `SOC_VERSION` is necessary
when compiling in a cpu envir. | use chip type from `npu-smi` to compile
vllm-ascend. |

### Does this PR introduce _any_ user-facing change?

Now we must set env `SOC_VERSION` when compiling in cpu envir. 

### How was this patch tested?


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-27 17:18:49 +08:00
zhangxinyuehfad
84d7f5a10d [UT] Fix ut test (#4472)
### What this PR does / why we need it?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-11-26 21:37:47 +08:00
herizhen
d252e36ae8 Change comment location (#4432)
### What this PR does / why we need it?
When running 'python example.py',connection issues often occur.The
solution is to comment out the first line the code.
Complete the specific names of machines A2 and A3.
Standardize document format,a space should be added after the colon.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-26 16:13:31 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
a91e76cd84 [CI] clean up ci (#4452)
1. Run 4-card test only when single and 2-card test passed
2. rename file to make it more clear
3. remove useless pd workflow, it has been managed by nightly test
already.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-26 14:07:56 +08:00
wangxiyuan
bc69d7cfe1 upgrade to vllm 0.11.2 (#4400)
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
https://github.com/vllm-project/vllm/pull/26866
2. get_mrope_input_positions is broken by
https://github.com/vllm-project/vllm/pull/28399
3. graph mode is broken by
https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
https://github.com/vllm-project/vllm/pull/27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
https://github.com/vllm-project/vllm/pull/28534
6. spec decode is broken by
https://github.com/vllm-project/vllm/pull/28771
7. sp feature is broken by
https://github.com/vllm-project/vllm/pull/27126
8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922
9. lora is broken by https://github.com/vllm-project/vllm/pull/21068
10. execute_model is broken by
https://github.com/vllm-project/vllm/pull/26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
https://github.com/vllm-project/vllm/pull/28159
12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753
13. dp is broken by https://github.com/vllm-project/vllm/pull/25110

 
What's broken and changed by ourself:
1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
https://github.com/vllm-project/vllm/pull/28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: shen-shanshan <467638484@qq.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-11-26 11:48:58 +08:00
shiyuan680
d5f77f14d0 mkdir triton package and move triton files (#4420)
### What this PR does / why we need it?
mkdir triton package and move triton files

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-26 11:06:12 +08:00
Zhu Yi Lin
1b137d6b1b [TEST] Delete Comment (#4427)
### What this PR does / why we need it?
Delete useless comments.
### Does this PR introduce _any_ user-facing change?
No

- vLLM main:
2918c1b49c

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-11-25 21:39:04 +08:00