205 Commits

Author SHA1 Message Date
fluctlux
f1f6370ed9 [Feature] Integrate Suffix Spec Decoding (#4045)
### What this PR does / why we need it?
This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975)
from vllm (https://github.com/vllm-project/vllm/pull/25784)

#
Suffix Decoding is a dynamic n-gram matching method that:

1. Uses suffix trees to generate speculative tokens quickly using branch
frequency counts.
2. Can keep a history of prior model responses, which tends to work very
well with repetitive agentic use cases.
3. Can be dynamically updated with newly generated tokens, and FIFO
eviction of older requests.
#
### Does this PR introduce _any_ user-facing change?
This feature should be implemented as opt-in and remain seamless for
users who do not require suffix speculative decoding.

For users who wish to enable it, they must first install
arctic-inference:
`pip install arctic-inference
`

After installation, the suffix speculative decoding feature can be
enabled using the following speculative config:
`--speculative_config '{"method": "suffix", "num_speculative_tokens":
5}'
`

### How was this patch tested?
This PR is currently being tested on vLLM
main:83f478bb19
 with PR https://github.com/vllm-project/vllm/pull/25784

In our previous testing, suffix decoding achieved a 13%-30% throughput
improvement over n-gram on the sonnet dataset, tested on vllm-ascend
v0.9.1 with concurrency ranging from 2 to 40.

- vLLM version: v0.11.2

---------

Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
2025-12-01 18:41:42 +08:00
Ting FU
e8e20c0bbf [BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568)
Qwen2.5_Omni vision tower use AscendRMSNorm, which conatins a property
function. It would be override by set_forward_context(), patch
Qwen2_5OmniThinkerForConditionalGeneration func with customized
_process_image_input() and _process_video_input() to fix it.

### What this PR does / why we need it?

Fix Qwen2.5_Omni model infer image/video issue

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

Signed-off-by: Ting FU <futing10@huawei.com>
2025-12-01 09:18:55 +08:00
Nengjun Ma
a3041cd78c [Bugfix] fix dp parallel + tp > 1 offline inference port conflict (#4539)
### What this PR does / why we need it?
fix dp parallel + tp > 1 offline inference port conflict

issue import PR:https://github.com/vllm-project/vllm-ascend/pull/429


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-11-29 18:37:11 +08:00
Shanshan Shen
2a19215e5f [MM][Model] Remove Qwen2-VL modeling files (#4534)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-29 18:07:01 +08:00
shiyuan680
1c4a0468ee 【OPS】qwen3-next support triton chunk_gated_delta_rule ops (#4070)
### What this PR does / why we need it?
qwen3-next suppot  triton chunk_gated_delta_rule ops

### co-owners
@OsirisDuan

- vLLM version: v0.11.2

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-28 20:55:43 +08:00
Shanshan Shen
e52ebf8674 [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention (#4349)
### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at https://github.com/vllm-project/vllm/pull/29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for https://github.com/vllm-project/vllm/pull/28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark


- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-11-28 14:23:00 +08:00
zzzzwwjj
136ea9ff56 [refact] unified soc_version code (#4359)
### What this PR does / why we need it?

Currently, there are two paths to judge the chip type in code,
`get_ascend_soc_version` use `get_soc_version` api in torch_npu, and
`is_310p` `use _build_info.__soc_version__`, which generate when
install. We need to unify the two paths.

We need to unify these codes based on the following points:

1. We need to ensure consistency in chip type judgment between compiling
and running states;
2. In compiling state, we need chip type to complete op's compilation,
but in running state, we only need device
type(910B/910_93/310P/910_95/etc) to make code branch judgement;
3. In compiling state, torch_npu may not have been installed yet, so we
can't use torch_npu's api.

Based on the above points, we have made the following changes:

1. When user set env `SOC_VERSION`, use it; when not set, query
soc_version by `npu-smi`;
2. generate device_type based on soc_version when compiling, and write
`__device_type__` instead of `__soc_version__` in `_build_info.py`;
3. In running state, use `__device_type__` to judge code branch.

### Does this PR introduce _any_ user-facing change?

When not set env `SOC_VERSION`, it will not be `ASCEND910B1` by default,
we will query soc_version by `npu-smi`. And env `SOC_VERSION` must be in
the list `soc_to_device` in `setup.py`.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-11-26 14:28:55 +08:00
wangxiyuan
bc69d7cfe1 upgrade to vllm 0.11.2 (#4400)
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
https://github.com/vllm-project/vllm/pull/26866
2. get_mrope_input_positions is broken by
https://github.com/vllm-project/vllm/pull/28399
3. graph mode is broken by
https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
https://github.com/vllm-project/vllm/pull/27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
https://github.com/vllm-project/vllm/pull/28534
6. spec decode is broken by
https://github.com/vllm-project/vllm/pull/28771
7. sp feature is broken by
https://github.com/vllm-project/vllm/pull/27126
8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922
9. lora is broken by https://github.com/vllm-project/vllm/pull/21068
10. execute_model is broken by
https://github.com/vllm-project/vllm/pull/26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
https://github.com/vllm-project/vllm/pull/28159
12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753
13. dp is broken by https://github.com/vllm-project/vllm/pull/25110

 
What's broken and changed by ourself:
1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
https://github.com/vllm-project/vllm/pull/28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: shen-shanshan <467638484@qq.com>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2025-11-26 11:48:58 +08:00
shiyuan680
d5f77f14d0 mkdir triton package and move triton files (#4420)
### What this PR does / why we need it?
mkdir triton package and move triton files

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: shiyuan680 <917935075@qq.com>
2025-11-26 11:06:12 +08:00
wangxiyuan
98031653df [misc] Remove useless patch_logits (#4252)
Torch-npu 2.7.1 has fixed the device check bug. This patch can be
removed now.

- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-25 21:25:54 +08:00
欧派果奶我还要
31a2c09e79 [Bugfix] fix patch typo (#4351)
### What this PR does / why we need it?
Fix a bug caused by this pr:
https://github.com/vllm-project/vllm-ascend/pull/4223
The bug makes
vllm-ascend/vllm_ascend/patch/platform/patch_multiproc_executor.py patch
in a wrong way

### How was this patch tested?
Tested in a single node. When the environment DYNAMIC_EPLB is set to
true, the patch works correctly. When it's set to false, the patch do
not patch
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
2025-11-25 15:13:06 +08:00
wangxiyuan
a1f142b7ad Drop 0.11.0 support (#4377)
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-24 17:08:20 +08:00
anon189Ty
5c9f4a40c6 [Feat] Support MTP to running in full graph mode (#3892)
### What this PR does / why we need it?
Currently, the MTP model still runs in eager in full graph mode. This PR
adapts the MTP with the full graph capture and execution. When the graph
mode is set to "FULL_DECODE_ONLY", the MTP will run in full-graph to
improve the performance.

The change in both disable_padded_drafter_batch is True and False case
include:

1. Add _mtp_graph_params in acl_graph.py to isolate the data of main
model and the data of MTP.
2. Padding some metadata in mla_v1.py when in fullgraph mode.
3. Fixed the essential data address that will be used in model.forward.
4. Adapted according to the aclgraph capture framwork:
    1). Rebuild MTP model with ACLGraphWrapper.
    2). Add common attn metadata when start capture in MTP dummy_run.
    3). Add common attn metadata update in MTP.
    4). Addapted data update when num_speculative_tokens > 1.
5. Add a patch of MTP to adapt vllm v0.11.0.

Existing Issues:
1. When disable_padded_drafter_batch=True and running in FullGraph mode,
the data of the first-round requests in MTP is abnormal. We need to
identify the cause subsequently.
2. When disable_padded_drafter_batch=False and running in FullGraph
mode, the acceptance rate of the second and third tokens will decrease
(For example, if we set the num_speculative_tokens=3, the acceptance
rate of first token is 90%, the second is only 50% lower than 60%, the
third is only 20% lower than 30%). The reason is that the data processed
after the model runs does not match. This is a problem from another PR.
It works fine in eager and PIECEWISE mode, but has problem in FullGraph
mode. Once we have a solution, we will submit a bugfix.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
2025-11-20 20:34:54 +08:00
欧派果奶我还要
c848da0687 [Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223)
### What this PR does / why we need it?
fix nightly multi-node EPLB tests by adjusting
vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking
### Does this PR introduce _any_ user-facing change?
no

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com>
Co-authored-by: 白永斌 <baiyongbin3@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 21:31:58 +08:00
weiguihua2
01195e860c [Bugfix] fix cannot import name get_mp_context (#4174)
### What this PR does / why we need it?
fix bug: cannot import vllm package

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-11-14 09:09:14 +08:00
22dimensions
c272747d13 Upgrade to 0.11.1 newest vllm commit (#3982)
### What this PR does / why we need it?
adapt vllm-ascend main branch with vllm releases/v0.11.1

fix `forward context not set` in test_vlm.py caused by:
https://github.com/vllm-project/vllm/pull/23207

fix import `cdiv round` failed caused by:
https://github.com/vllm-project/vllm/pull/27188

fix import `init_cached_hf_modules` failed caused by:
https://github.com/vllm-project/vllm/pull/27567

adapt triton kernel `fused_recurrent_gated_delta_rule_fwd_kernel` caused
by: https://github.com/vllm-project/vllm/pull/27654
- remove unused code in sigmoid_gating.py
- `class FusedRecurrentFunction` , `fused_recurrent_gated_delta_rule`,
`fused_recurrent_gated_delta_rule_fwd`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI 


- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-11-12 23:01:19 +08:00
wangxiyuan
f811a24bf0 Remove VLLM_USE_V1 (#4086)
Drop VLLM_USE_V1 usage.  This env has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 15:43:39 +08:00
drslark
1804b60ec8 [BugFix][main] Adapted to torch_npu.npu_fused_infer_attention_score (#4025)
### What this PR does / why we need it?

Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.

We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.

And run below codes:

```python
# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Outputs:

```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```

Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: drslark <slarksblood@qq.com>
2025-11-06 22:00:24 +08:00
whx
f6149f3894 [Model][3/N] Refactor sfa into mla and remove deepseek_v3_2.py (#3769)
This is the follow-up PR to PR #3189, which continues to refactor sfa
into mla and finally remove deepseek_v3_2.py. This is the last PR of
deepseek modeling refactoring. After this, all deepseek-related model
codes are removed from vllm_ascend.

FurtherMore, after this PR deepseek v3.2 can run chunk-prefill with
correct accuracy.

- vLLM version: v0.11.0rc3
- vLLM main:
83f478bb19

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-30 17:06:38 +08:00
Icey
d9cdc65854 Upgrade to new vllm commit (#3719)
### What this PR does / why we need it?
Upgrade to new vllm commit:
c9461e05a4

- Fix many imports, caused by
https://github.com/vllm-project/vllm/pull/26908
- Fix import ```sha256```, caused by
https://github.com/vllm-project/vllm/pull/27169
- Remove ```SchedulerConfig.send_delta_data```, caused by
https://github.com/vllm-project/vllm/pull/27142
- Fix ```FusedMoE``` because of dual stream execution, caused by
https://github.com/vllm-project/vllm/pull/26440

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-10-25 15:36:32 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
fems14
2bcadcb9d5 【main】patch sched_yield (#3648)
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 00:06:45 +08:00
whx
72695c97d0 [BugFix][main] Fix quantization related mtp bug with patch (#3620)
vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp. main also need this bugfix to support vllm's
v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-23 09:54:31 +08:00
offline893
e916265b2b [CI]Add EPLB CI. (#3568)
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params. 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Qwen in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-21 22:58:02 +08:00
wangxiyuan
13e8e75143 [Refactor] refactor patch module (#3555)
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-21 20:19:46 +08:00
Wang Kunpeng
4b3bd4f397 [main][bugfix] bugfix for minicpm models (#3527)
### What this PR does / why we need it?
bugfix for minicpm-2b and minicpm3-4b

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2025-10-19 11:00:55 +08:00
offline893
6c9909c861 [Patch]patch of v1 executor when enable eplb. (#3511)
### What this PR does / why we need it?
when using dynamic eplb, patch v1 executor to avoid create child process
failed.

### How was this patch tested?
deepseek in v3.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-19 10:54:26 +08:00
Mengqing Cao
8abe517870 [Refactor] Adapt deepseek-v3.2 to vllm 0.11.0 (#3432)
### What this PR does / why we need it?
Adapt deepseek-v3.2 to vllm 0.11.0, removing the useless patch.

The final goal is to remove all the patches and align the code arch to
vllm, thus we need to do the following work in next prs.
TODO:
- [x] remove patch on attention spec
- [ ] refactor the kvcache creation logic

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
1. CI passed with existing test.
2. Test pass with deepseek-v3.2-exp


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-15 17:48:58 +08:00
xuyexiong
02c26dcfc7 [Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
Mengqing Cao
223cc34085 [KVCache] Refactor KVCache as page_size_bytes is ineffective (#3438)
### What this PR does / why we need it?
Refactor KVCache as page_size_bytes is ineffective.

1. Currently the `AttentionSpec` is patched, but the `page_size_bytes`
is still using that in vLLM in runtime, thus the patch is not working
actually. Thus this pr removes the patch on `AttentionSpec`, and will do
the final fix in vLLM.
2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce
`page_size_bytes` of spec, so that num_blocks in spec could double

### How was this patch tested?
Test pass with Qwen3-Next and DeepSeek-V3.2-Exp

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-14 21:28:41 +08:00
linfeng-yuan
e4acb2dfc7 [feat] support customized and separated hccl_buffer_size for process group initialization (#3073)
### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-11 15:55:22 +08:00
Peipei
8c1a4dedf3 [Bugfix]modify the enable range of _merge_multimodal_embeddings patch (#3360)
### What this PR does / why we need it?
Modify the enable range of _merge_multimodal_embeddings patch. The
current patch is only enabled for offline inference on the platform. For
online serviceization, due to the addition of the worker sub-process, it
is not enabled within the sub-process.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: booker123456 <945658361@qq.com>
2025-10-11 08:37:07 +08:00
wangxiyuan
1c5b302f0d [Misc] Clean up useless patch (#3320)
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 14:07:26 +08:00
wangxiyuan
f12f76d7ba Drop 0.10.2 (#3284)
Drop v0.10.2 support, we support vLLM 0.11.0rc3 now.
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-09 10:28:38 +08:00
wangxiyuan
81bd6e4c99 Add DeepSeek V3.2 support (#3270)
### What this PR does / why we need it?

This PR added the initial DeepSeek V3.2 support with [vLLM
v0.11.0](https://github.com/vllm-project/vllm/tree/releases/v0.11.0)
(not released yet). We will complete vLLM adaptation as soon as
possible. This feature will be ready in recent 1-2 days.

Related doc: https://github.com/vllm-project/vllm-ascend/pull/3223 .

### Does this PR introduce _any_ user-facing change?
Yes!

### How was this patch tested?
CI passed and Run deepseek doc soon.


- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-30 03:25:58 +08:00
wangxiyuan
2930e4a6bd [CI] Upgrade vllm to newest commit (#3182)
### What this PR does / why we need it?
Upgrade vLLM to newest commit

- Fix the aclgraph doesn't work problem, caused by
24fab45d96
- Fix PoolerOutput import error, caused by
755ed7b05b
- Fix the aclgraph weight load error to keep the same with torchair fix.
4492e3a554

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All test should pass


- vLLM version: v0.10.2
- vLLM main:
52d0cb8458

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-26 06:18:15 +08:00
wangxiyuan
a055183821 [CI] Upgrade vLLM version (#3139)
Upgrade vLLM version to the newest commit.
- Fix the break change introduced by
969b4da3a6
- Add a patch to quick fix torhcair
de94289a98
- fix the ut error introduced by
de94289a98

Close: https://github.com/vllm-project/vllm-ascend/issues/3138


- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-25 07:36:51 +08:00
Icey
e7618d9414 [2/N][Refactor][Qwen3-Next] remove redundant methods and patch methods in Qwen3NextGatedDeltaNet (#3082)
### What this PR does / why we need it?
remove redundant methods and patch methods in Qwen3NextGatedDeltaNet
involved causal_conv1d_fn, causal_conv1d_update_npu, fused_gdn_gating,
fused_reccrrent_gated_delta_rule, torch_chunk_gated_delta_rule,
RMSNormGated

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
```
def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
    # Create an LLM.
    llm = LLM(
        model="Qwen/Qwen3-Next-80B-A3B-Instruct",
              tensor_parallel_size=4,
              enforce_eager=True,
              trust_remote_code=True,
              max_model_len=256,
              gpu_memory_utilization=0.7,
              block_size=64,
              )
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

CI passed with new added/existing test.


- vLLM version: v0.10.2
- vLLM main:
5aeb925452

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-09-24 11:25:42 +08:00
Peipei
c4b976af1a [Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071)
### What this PR does / why we need it?

This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: booker123456 <945658361@qq.com>
2025-09-24 10:25:28 +08:00
Li Wang
12bcbd02bb [CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907)
### What this PR does / why we need it?
1. This pr bump vllm commit to
6d8246aaff
2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548
abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable
3. fix metadata_builder changes introduced by
https://github.com/vllm-project/vllm/pull/23693
4. fix `structured_outputs_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22772
5. fix `moe_config` changes introduced by
https://github.com/vllm-project/vllm/pull/22537

Co-authored-by:  MengqingCao <cmq0113@163.com>
Co-authored-by:  Yikun Jiang <yikunkero@gmail.com>


- vLLM version: v0.10.2
- vLLM main:
c60e6137f0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-09-20 17:37:57 +08:00
Shanshan Shen
8326f15ecf [CustomOp] Register AscendSharedFusedMoE custom op (#2980)
### What this PR does / why we need it?
Register `AscendSharedFusedMoE` custom op.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`DeepSeek-V2-Lite` is a MoE model with shared experts.

Test:

```bash
vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \
--trust-remote-code \
--enforce-eager \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite",
        "messages": [
            {"role": "user", "content": "介绍一下联通公司?"}
        ],
        "stream": false,
        "max_tokens": 100
    }'
```

Output:

```bash
中国联合网络通信集团有限公司(简称“中国联通”)于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成,在国内31个省(自治区、直辖市)和境外多个国家和地区设有分支机构,是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业,连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务,移动通信业务,国内
```


- vLLM version: v0.10.2
- vLLM main:
486c5599e3

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-09-19 19:05:01 +08:00
linfeng-yuan
1c5900327b [refactor] refactor deepseek-related files (#2849)
### What this PR does / why we need it?
This PR deletes ~2K lines of code about deepseek modeling. It falls back
CustomDeepseekV2 modules to original vllm implementations and adapts
some modifications in vllm about deepseek and moe.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E  vllm serving with torchair graph mode and eager mode.

- vLLM version: v0.10.2
- vLLM main:
759ef49b15

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-16 14:13:07 +08:00
wangxiyuan
c556038ef0 [New model] Qwen3-next support (#2917)
### What this PR does / why we need it?
Add Qwen3-next support.

### Does this PR introduce _any_ user-facing change?
Yes, users can use Qwen3 next.
Related doc: https://github.com/vllm-project/vllm-ascend/pull/2916 the
tutorial will be ready in
[here](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_qwen3_next.html)

### How was this patch tested?
Doc CI passed

Related: https://github.com/vllm-project/vllm-ascend/issues/2884

Co-Authored-By: Angazenn <supperccell@163.com>
Co-Authored-By: zzzzwwjj <1183291235@qq.com>
Co-Authored-By: MengqingCao <cmq0113@163.com>
Co-Authored-By: linfeng-yuan <1102311262@qq.com>
Co-Authored-By: hust17yixuan <303660421@qq.com>
Co-Authored-By: SunnyLee219 <3294305115@qq.com>
Co-Authored-By: maoxx241 <maoxx241@umn.edu>


- vLLM version: v0.10.2
- vLLM main:
b834b4cbf1

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Your Name <you@example.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: hust17yixuan <303660421@qq.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Angazenn <supperccell@163.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: hust17yixuan <303660421@qq.com>
2025-09-16 01:17:42 +08:00
yiz-liu
e13c4ddb42 [Fix] Fix SharedFusedMoE (#2817)
### What this PR does / why we need it?
Really strange that `register_oot` doesn't work with `SharedFusedMoE`,
so we have to add this patch, for now.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
This PR won't have any effect in DeepSeek since we currently still stick
with the old `CustomDeepseekV2`.

- vLLM version: v0.10.1.1
- vLLM main:
0cdd213641

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-09 18:19:56 +08:00
wangxiyuan
7d6d9449a8 [Misc] Move lora patch file into lora module (#2797)
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:42:12 +08:00
lidenghui1110
5a7181569c [feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167)
### What this PR does / why we need it?
This PR introduces Oproj matrix tensor model parallel to achieve
decreasing of memory consumption. It only support graph mode in pure DP
scenario.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8
GB NPU memory per RANK. We got best performance when
oproj_tensor_parallel_size=4 without TPOT increasing.

performance data:
<img width="1442" height="442" alt="image"
src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| oproj_tensor_parallel_size | Split the o_proj matrix along the row
dimension (head num * head dim) into oproj_tensor_parallel_size pieces.
| No | int | default value is None, once this value is set, the feature
will be enabled, head num * head dim must be divisible by this value. |

example

`--additional_config={"oproj_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
eddaafc1c7

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zzh <zzh_201018@outlook.com>
2025-09-07 10:31:32 +08:00
yupeng
9f1e054fe3 [Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672)
### What this PR does / why we need it?
Fix the LoRA accuracy issue that introduced by custom AscendC operator
"bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand".

The bug details are: 
- In the kernel function, if you want to call GlobalTensor.GetSize
method, you have to pass the second parameter of bufferSize when you
call GlobalTensor.SetGlobalBuffer first.
- Or GlobalTensor.GetSize method will return a random value.
- You can refer to [this
doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
a344a5aa0a

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-09-02 11:46:59 +08:00
yupeng
cf96366a39 [Bugfix][LoRA][Patch] Fix the LoRA inference bug after upstream vLLM codebase changed (#2560)
### What this PR does / why we need it?
The mergence of the upstream
https://github.com/vllm-project/vllm/pull/22592 caused a vllm-ascend
LoRA inference bug. The details are following:

According to
[torch_npu/npu/_stream_check.py](863b9071cb/torch_npu/npu/_stream_check.py (L74)),
NPU device type tensors have attributes is_cuda=True and is_npu=True.
This causes that vLLM's apply_repetition_penalties function will run
into the branch of "if logits.is_cuda and logits.is_contiguous()" and
call the custom op implemented in CUDA, which is not compatible with
NPU.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py

- vLLM version: v0.10.1.1
- vLLM main:
fe8d7b6f03

---------

Signed-off-by: paulyu12 <paulyu0307@gmail.com>
Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: paulyu12 <paulyu0307@gmail.com>
2025-08-28 10:40:51 +08:00
Mengqing Cao
61866b8ac6 [Quickfix] update CachedRequestState as NewRequestData changed (#2367)
### What this PR does / why we need it?
1. update `CachedRequestState` as `NewRequestData` changed in
https://github.com/vllm-project/vllm/pull/22570
2. drop maintenance of vllm v0.10.0 in the branch main

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.10.0
- vLLM main:
92ff41abea

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-08-15 07:35:27 +08:00
Shanshan Shen
103654ccd6 [Misc] Remove redundant imported envs, using envs_ascend instead (#2193)
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.

```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```

- vLLM version: v0.10.0
- vLLM main:
71683ca6f6

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-08-14 09:33:39 +08:00