Commit Graph

55 Commits

Author SHA1 Message Date
Ronald
e20813f441 [Feature] implement eagle spec decoding for model runner v2 (#5840)
### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.13.0

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-01-14 09:18:05 +08:00
Icey
b94fc13d3f [BugFix][Fusion] Fix graph fusion failure problem (#5676)
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-07 18:42:55 +08:00
Fager10086
77a029979e Revert "[BugFix][Fusion] Fix graph fusion failure problem (#5253)" (#5667)
### What this PR does / why we need it?

Revert PR 5253 to fix the smoking problem

### Does this PR introduce _any_ user-facing change?

Does not.

### How was this patch tested?

It was tested in the failure case.

Signed-off-by: Rifa <865071616@qq.com>
2026-01-06 21:55:47 +08:00
wangxiyuan
cd1162e25a [Misc] Remove useless weight loader patch (#5619)
The patch for weight loader is useless now. Let's remove it

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-06 20:17:32 +08:00
Icey
e7b623b363 [BugFix][Fusion] Fix graph fusion failure problem (#5253)
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-05 17:49:09 +08:00
Icey
9b2a7d8866 [BugFix][Fusion] Patch compile backend to make fusion available (#5308)
Currently, the vllm pr: https://github.com/vllm-project/vllm/pull/24252
is causing operator fusion to fail, which can be mitigated by patching
the backend. Once the problem is completely resolved, I will submit a
new pull request to remove the patch.

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
2025-12-26 09:18:16 +08:00
Shanshan Shen
6c478531f8 [CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm/pull/29873, register
`AscendApplyRotaryEmb` CustomOp and remove related patch.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

####  Test Qwen2.5-VL

Run:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ  ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio":   null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok    en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d
```

####  Test Qwen3-VL

Run:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-23 10:04:37 +08:00
Zhu Yi Lin
3d04ae8e7d [Main] [Patch] support balance scheduling patch (#5212)
### Motivation.

**Limitations of the current vLLM v1 scheduling strategy**
vLLM v1 scheduling currently enables chunkedprefill by default, which
processes prefill and decode requests simultaneously in a single
scheduling session. This can impact the overall system throughput and
performance in some scenarios.

Balance scheduling addresses this issue by synchronizing the number of
running queues across all schedulers to delay the scheduling of new
requests, thereby improving the overall system's steady-state decoding
time. This achieves:
Adding `balance_gather` to the scheduler synchronizes the number of
requests in the running queues between DPs.
Balance scheduling improves the decode steady-state time, thereby
increasing the overall output throughput of the inference system.


### Proposed Change.

 **1.Feature Overview**

In the vLLM scheduler, running requests (i.e., requests that are already
undergoing pre-filled computation) have the highest priority, followed
by waiting requests (i.e., requests that have not yet been computed).


As shown in the diagram above, when the entire inference system exits
from a steady state, the scheduler will schedule a batch of new requests
for prefill operations and then synchronize them among the dynamic
programming (DP) models. This can cause some DP models that are entirely
decoded to synchronize with the number of prefilled tokens. Frequent
prefill scheduling by certain DP models can lead to a deterioration in
the overall system output throughput.

Balance scheduling synchronizes the number of running queue requests
across different DPs, and only schedules new requests for prefilling
when at least every scheduler has fewer than max_nun_requst.

 **2.Implementation Design**

 **3.Experiment Results**
- Fixed-length input scenario: In the performance test scenario with
3.5K fixed-length input and 1.5K fixed-length output, the throughput
performance was improved by approximately **18%** after adding balance
scheduling.

| Method | Model | Input Len | Request Count | Output Len | BatchSize |
Average TTFT | Average TPOT | e2e duration | Input Token Throughput |
Output Token Throughput | Request Throughput
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
---- | ---- |
| Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 |
591.9s | 3030.5 | 1297.3 | 0.86 |
| Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 |
70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 |

**4.Demo PR**

[#29721 ](https://github.com/vllm-project/vllm/pull/29721)

---------

Signed-off-by: GDzhu01 <809721801@qq.com>
2025-12-23 09:04:38 +08:00
Ascendyh
b2c121637f [task] Add fused gdn gating triton kernel (#4304)
### What this PR does / why we need it?
This commit introduces a Triton-based fused GDN gating kernel for Ascend
NPU, aimed at improving performance in the Gated Delta Net workflow.
### Does this PR introduce _any_ user-facing change?
It only adds and refactors internal Triton kernels and wrappers for
Ascend. These are backend implementation details. There are no new APIs,
flags, CLI options, or behavior changes visible to end users.
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Ascendyh <hw7osiris@outlook.com>
2025-12-22 14:09:19 +08:00
XiaoxinWang
0cc3fc357f [pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818)
### What this PR does / why we need it?
qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused
fused_gdn_gating+fused_recurrent_gated_delta_rule

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2025-12-19 16:34:11 +08:00
ZT-AIA
39fb9e7c83 qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788)
### What this PR does / why we need it?
add triton ops fused_qkvzba_split_reshape_cat for qwen3_next
GatedDeltaNet
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT 
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2025-12-18 11:31:04 +08:00
realliujiaxu
9e24bdd44c [Feat] Refactor rejection sampler (#4975)
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- https://github.com/vllm-project/vllm/pull/19482
- https://github.com/vllm-project/vllm/pull/26060
- https://github.com/vllm-project/vllm/pull/29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
https://github.com/vllm-project/vllm/pull/26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
https://github.com/vllm-project/vllm-ascend/pull/4893/)


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
2025-12-16 11:32:26 +08:00
drslark
8fb0ef5ffa [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)
### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: drslark <slarksblood@qq.com>
2025-12-15 13:22:30 +08:00
wangxiyuan
3362be7f86 Update patch doc (#4869)
Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 23:27:45 +08:00
drslark
0fb1dc43a1 [BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)
### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 22:54:24 +08:00
lianyibo
e32014ac1d [Model] Support pooling models (#3122)
### What this PR does / why we need it?

Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).

After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.

Fixes #1960

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-12-10 11:37:57 +08:00
wangxiyuan
98031653df [misc] Remove useless patch_logits (#4252)
Torch-npu 2.7.1 has fixed the device check bug. This patch can be
removed now.

- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-25 21:25:54 +08:00
whx
72695c97d0 [BugFix][main] Fix quantization related mtp bug with patch (#3620)
vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp. main also need this bugfix to support vllm's
v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-23 09:54:31 +08:00
wangxiyuan
13e8e75143 [Refactor] refactor patch module (#3555)
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-21 20:19:46 +08:00
xuyexiong
02c26dcfc7 [Feat] Supports Aclgraph for bge-m3 (#3171)
### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
2025-10-14 23:07:45 +08:00
linfeng-yuan
e4acb2dfc7 [feat] support customized and separated hccl_buffer_size for process group initialization (#3073)
### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-11 15:55:22 +08:00
Peipei
8c1a4dedf3 [Bugfix]modify the enable range of _merge_multimodal_embeddings patch (#3360)
### What this PR does / why we need it?
Modify the enable range of _merge_multimodal_embeddings patch. The
current patch is only enabled for offline inference on the platform. For
online serviceization, due to the addition of the worker sub-process, it
is not enabled within the sub-process.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: booker123456 <945658361@qq.com>
2025-10-11 08:37:07 +08:00
Peipei
c4b976af1a [Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071)
### What this PR does / why we need it?

This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: booker123456 <945658361@qq.com>
2025-09-24 10:25:28 +08:00
wangxiyuan
7d6d9449a8 [Misc] Move lora patch file into lora module (#2797)
Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-09-08 21:42:12 +08:00
leo-pony
807f0895b2 Bump torch version to 2.7.1 (#1562)
### What this PR does / why we need it?
Bump torch version to 2.7.1, and cleanup infer schema patch
https://github.com/vllm-project/vllm-ascend/commit/857f489
(https://github.com/vllm-project/vllm-ascend/pull/837), this patch
depends on also: https://github.com/vllm-project/vllm-ascend/pull/1974

### Does this PR introduce any user-facing change?
No

#### How was this patch tested?
CI passed

torch-npu 2.7.1rc1 install guide:
https://gitee.com/ascend/pytorch/tree/v2.7.1/
install depending:
```
pip3 install pyyaml
pip3 install setuptools
```
install torch-npu:

Closes: https://github.com/vllm-project/vllm-ascend/issues/1866
Closes: https://github.com/vllm-project/vllm-ascend/issues/1390


- vLLM version: v0.10.0
- vLLM main:
9af654cc38

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-08-05 08:43:24 +08:00
wangxiyuan
9b67c87b14 [Refactor]Refactor sampler (#2050)
Refactor Sampler implementation from patch way to inherit from vLLM
Sampler interface.

Next step: Make the op `TopKTopPSampler` in vLLM support custom ops
register mechanism

- vLLM version: v0.10.0
- vLLM main:
61a6905ab0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-30 08:47:22 +08:00
Ronald1995
32a9c5f694 [Feature]: implement the fusion of allreduce and matmul in prefill phase when tp is enabled (#1926)
### What this PR does / why we need it?
it'll execute allreduce and malmul seperately in vllm RowParallelLinear
forward funcion, this function use torch_npu.npu_mm_all_reduce_base to
execute allreduce and matmul in a fused kernel way. this will gain a 20%
performance
promotion in eager mode.
### Does this PR introduce _any_ user-facing change?
this PR introduce a new env `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` to
control whether enable the feature or not.

### How was this patch tested?
the patch is tested by adding a new test file `test_patch_linear.py` to
guard the ut


- vLLM version: v0.10.0
- vLLM main:
7728dd77bb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-07-28 15:13:37 +08:00
Yikun Jiang
17a430f7b8 Upgrade vLLM to v0.10.0 (#1927)
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed

- vLLM version: v0.9.2
- vLLM main:
7728dd77bb

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-26 15:43:29 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
Shanshan Shen
f9e2e9bb31 [Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810)
### What this PR does / why we need it?
Remove draft model runner used for V0 spec decode.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
34cda778a0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 10:51:23 +08:00
Shanshan Shen
a929699e98 [Misc][V0 Deprecation] Remove multi-step worker (#1809)
### What this PR does / why we need it?
Remove multi-step worker

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:48:47 +08:00
Pr0Wh1teGivee
d13fb0766e [Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
yuancaoyaoHW
e7efc7e7e7 [BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
wangxiyuan
9cbce423ce [MISC] Remove useless patch (#1366)
### What this PR does / why we need it?
`stateless_init_dp_group` in vllm works with non-cuda platform now.
Remove this useless patch.

Which was introduced in vllm-ascend by
e74331a1ed
(v0.8.4rc2)
vLLM upstream merged:
3e472d882a
(v0.8.0)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-24 10:05:59 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
wangxiyuan
4f5964420e [CI] Upgrade vllm to 0.9.1 (#1165)
1. upgrade vllm to 0.9.1. 0.9.0 is not supported for main branch now.
keep doc to 0.9.0 until we release the first 0.9.1 release.
2. disable V0 test for PR
3. move actionlint check to lint job

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-11 16:33:11 +08:00
wangxiyuan
95414bae70 [CI] Run e2e after pre check pass (#1132)
Make sure the lint test passed before start the e2e test to save compute
resource.

Updated the patch doc to make sure the CI works as expect.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-10 17:18:09 +08:00
sherie
908a851a77 optimize the funtion of computing topk and topp in sampler. (#970)
### What this PR does / why we need it?
Optimize the performance of calculation logic in sampler and deepseekv2.

### Does this PR introduce _any_ user-facing change?
Added VLLM_ENABLE_TOPK_OPTIMZE config in sampler

### How was this patch tested?
pytest test_sampler.py

Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Co-authored-by: ZhengWG <zwg0606@gmail.com>
2025-06-05 16:42:18 +08:00
wangxiyuan
f6e5decc10 [CI] upgrade to vllm 0.9.0 (#959)
Upgrade to vllm 0.9.0.
0.8.5 will not be supported any more.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-28 21:18:41 +08:00
jiangpeng
df58fb80ee Spec decode support for V1 Engine (#874)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Make spec decode support for V1 Engine
- Currently, Ascend does not support the triton kernel. PyTorch is used
to rewrite the `rejection_sampler.py` triton kernel. However, PyTorch is
not as good as Triton. Therefore, ascend c is used to implement the
function in the future.
- Currently, spec decode supports only the ngram algorithm. The eagle
algorithm needs to be further adapted.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
Not change user facing.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
test by `tests/singlecard/spec_decode/e2e/test_v1_spec_decode.py` and
`tests/sample/test_rejection_sampler.py`, test base function of
rejection sampler and e2e function of spec decode.

Signed-off-by: ponix-j <657511300@qq.com>
2025-05-23 14:25:46 +08:00
Yikun Jiang
afe1767c17 [Core] Cleanup triton patch which has been fixed in vllm (#764)
### What this PR does / why we need it?
- Revert "Re-patch TritonPlaceholder on main to make CI happy (#753)"
because upstream main CI already merged:
https://github.com/vllm-project/vllm/pull/17446
- Keep 0.8.5.post1 compatible

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-06 18:52:15 +08:00
Yikun Jiang
d7e1110c8e Re-patch TritonPlaceholder on main to make CI happy (#753)
### What this PR does / why we need it?
Re-patch TritonPlaceholder on main to make CI happy
- Add triton patch back until
https://github.com/vllm-project/vllm/pull/17446 resolved
- Move patch_main before patch_common to resolve minicpm triton import
issue
- Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions

Related:
- https://github.com/vllm-project/vllm-ascend/pull/704
- https://github.com/vllm-project/vllm-ascend/pull/690

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All CI passed include main

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-05 23:22:24 +08:00
wangxiyuan
f8350569e6 [CI] upgrade vllm to 0.8.5 (#715)
1. Upgrade vllm to 0.8.5
2. Drop 0.8.4 support
3. Keep doc to 0.8.4rc2 until we release 0.8.5

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-30 09:15:50 +08:00
wangxiyuan
95e7aa4736 [Platform] format platform to make it more clear (#610)
Platform should only contain the function that based from vllm. This PR
move the unrelated function to the right place to make platform more
clear.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-04-30 09:03:10 +08:00
wemaster
54c0e63df7 [MTP] follow custom deepseek modeling changes to support graph mode (#636)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?

As custom deepseek modeling do some changes to support graph mode in
https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to
change custom deepseek_mtp modeling.

And some modifications for k>1 were not carried over by the
https://github.com/vllm-project/vllm-ascend/pull/429, now i add it.

In order to better take care of the MTP feature in the vllm-ascend
repository, I added cases related to graph mode(torchair), but i skip it
since torchair can not correctly clean up memory in vllmrunner.

Also i add some case for MTP quantization weights, but test weight is
not ready, so i skip it and i will open it when test quant weights is
ready.

https://github.com/vllm-project/vllm-ascend/pull/648 did not completely
fix the sample
change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I
added the relevant changes.

### Does this PR introduce _any_ user-facing change?
now, u can use following method to use mtp in deepseek v3/r1 float or
quant weights with eager mode.
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    enforce_eager=True,
    trust_remote_code=True,
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

or use mtp in deepseek v3/r1 float or quant weights with graph
mode(torchair)
```python
llm = LLM(
    model="wemaster/deepseek_mtp_main_random_bf16",
    tensor_parallel_size=2,
    speculative_config={
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    additional_config={
        'enable_graph_mode': True,
    },
    disable_log_stats=False,
    gpu_memory_utilization=0.8,
    max_model_len=64,
)
```

add notes:
1. now, we support k>1, so u can set num_speculative_tokens > 1 if there
is sufficient redundant computing power;
2. MTP is not supported in V1, we will support it when vLLM does it in
https://github.com/vllm-project/vllm/issues/13500.
3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3
patch https://github.com/vllm-project/vllm-ascend/pull/236 file
`vllm_ascend/patch/patch_metrics.py` method
`__npu_async_metrics_collector_init__`

### How was this patch tested?
local tested passed and test by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-28 21:18:53 +08:00
Mengqing Cao
ba3d8aae94 [Model][MiniCPM] support MiniCPM (#645)
### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-04-27 11:27:24 +08:00
Pleaplusone
e74331a1ed Add dp initialize patch with hccl backend (#626)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Add dp stateless process group initialization path with hccl backend as
vllm-ascend patch.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-04-23 15:47:51 +08:00
Shanshan Shen
4a0ce3660e [Misc] Remove some parts of metrics patch (#603)
### What this PR does / why we need it?
Remove some parts of metrics patch, since the `cuda` hard code has been
fixed by https://github.com/vllm-project/vllm/pull/14411.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-04-22 18:45:21 +08:00