Commit Graph

126 Commits

Author SHA1 Message Date
Li Wang
2f50304c19 [Bugfix] Add get_supported_tasks interface to fix broken CI (#2023)
### What this PR does / why we need it?
Added `get_supported_tasks` interface to adapt to vllm
[changes](46d81d6951 (diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29))
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.9.2
- vLLM main:
5ac3168ee3

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-26 08:20:21 +08:00
wangxiyuan
846555cdb5 [Misc] Clean up uesless code in attention (#1933)
Before do attention module refactor, we can do some code cleanup to make
the next step easier.

What this PR does:

1. remove uesless `common_prefix_len` for attention builder
2. remove uesless `is_only_prefill` and `num_input_tokens` in attention
metadata.
3. remove `CommonAttentionMetadata` and ues `query_start_loc` instead,
`CommonAttentionMetadata` is over designed and uesless
4. update the attention backend input parameters to keep the same as
vLLM.
5. Rename attention name to the same style with `ASCEND` prefix

- vLLM version: v0.9.2
- vLLM main:
107111a859

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-24 10:23:34 +08:00
Mengqing Cao
3aa3b46bfe [V1][PP] Support pp with ray backend in V1 (#1800)
### What this PR does / why we need it?
Support pipeline parallel with ray backend in V1Engine.

Fixes #1751

### Does this PR introduce _any_ user-facing change?
Users could specify ray as distributed backend when inferencing with pp

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.9.2
- vLLM main:
32142b3c62

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-23 14:52:52 +08:00
Li Wang
33e1ea4d1a [CI] Fix broken CI (#1915)
### What this PR does / why we need it?
Fix [#21227](https://github.com/vllm-project/vllm/pull/21227) to make ci
happy

- vLLM version: v0.9.2
- vLLM main:
6b46c4b653

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-22 08:38:30 +08:00
wangxiyuan
7265dc090d [2/4][Refactor] Refactor torchair utils (#1892)
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.

1. Refactor worker
2. Refactor utils (this PR)
- simple change that move all torchair related util function to torchair
module
3. Refactor model_runner
4. Refactor attention

- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-21 19:43:30 +08:00
Shanshan Shen
957b0b611f [Misc][V0 Deprecation] Remove V0 Model Runner (#1823)
### What this PR does / why we need it?
Remove V0 model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
7ba34b1241

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-21 16:35:50 +08:00
wangxiyuan
af56ae3ed1 [1/4][Refactor] Refactor torchair worker (#1885)
There is a lot torchair specified logic in common code. It results hard
code maintenance. We will create a new torchair module to launch
torchair related logic there. I plan to add 4 PR.

1. Refactor worker (this PR)
- create torchair module and move torchair related code in worker to the
new module
3. Refactor utils
4. Refactor model_runner
5. Refactor attention


- vLLM version: v0.9.2
- vLLM main:
8188196a1c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-21 11:50:46 +08:00
Mengqing Cao
8cfd257992 [Dist][EP] Remove ETP/EP maintained in vllm-ascend (#1681)
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced

This is a part of #1422 backport.

Fixes https://github.com/vllm-project/vllm-ascend/issues/1396
https://github.com/vllm-project/vllm-ascend/issues/1154

### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.

### How was this patch tested?
CI passed with new added and existing test.


- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-21 09:08:04 +08:00
wangxiyuan
2b726d8f90 [CI] Fix broken CI (#1889)
1. vLLM commit
45badd05d0
changed the pooling check logic which broken vLLM Ascend.
2. vLLM commit
3e04107d97
requires higher version of transformers. The transformers version bug
has been fixed by
e936e401de.
We can safe to remove the version limit now.
3. vLLM commit
217937221b
added a new input `enable_eplb` for FusedMoe Ops

This PR fix the broken CI.


- vLLM version: v0.9.2
- vLLM main:
6a971ed692

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-20 02:11:57 +08:00
Shanshan Shen
ab68d31a24 [Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878)
### What this PR does / why we need it?
This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
5895afd780

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-19 09:42:32 +08:00
Li Wang
f9dfde02fd [Bugfix] Fix broken CI (#1848)
### What this PR does / why we need it?
- Fix broken commit by
[#20927](https://github.com/vllm-project/vllm/pull/20927)
- Fix broken commit by
[#20466](https://github.com/vllm-project/vllm/pull/20466)
- TODO: more fully adapt to the upstream reconstruction, let's first
make CI happy

- vLLM version: v0.9.2
- vLLM main:
11dfdf21bf

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-17 20:10:12 +08:00
Shanshan Shen
c66b0827a7 [Misc][V0 Deprecation] Remove Pooling Model Runner (#1824)
### What this PR does / why we need it?
Remove pooling model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
d31a647124

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 17:48:21 +08:00
Shanshan Shen
06655002c5 [Misc][V0 Deprecation] Remove V0 Worker (#1821)
### What this PR does / why we need it?
Remove V0 worker.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
6cbc4d4bea

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:07:17 +08:00
Shanshan Shen
b005def0a5 [Misc][V0 Deprecation] Remove Multi-Step Model Runner (#1820)
### What this PR does / why we need it?
Remove multi-step model runner.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.



- vLLM version: v0.9.2
- vLLM main:
34cda778a0

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 14:06:49 +08:00
Shanshan Shen
f9e2e9bb31 [Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810)
### What this PR does / why we need it?
Remove draft model runner used for V0 spec decode.

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
34cda778a0

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-16 10:51:23 +08:00
Shanshan Shen
a929699e98 [Misc][V0 Deprecation] Remove multi-step worker (#1809)
### What this PR does / why we need it?
Remove multi-step worker

This PR is a part of
https://github.com/vllm-project/vllm-ascend/issues/1620.

- vLLM version: v0.9.2
- vLLM main:
235bfd5dfe

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-07-15 19:48:47 +08:00
wangxiyuan
494b0f474f [CI]Fix broken CI (#1773)
This PR fixed the broken CI. It require
https://github.com/vllm-project/vllm/pull/20900 merged first.

- vLLM version: v0.9.2
- vLLM main:
e8cc53af5e

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-15 00:54:20 +08:00
weiguihua2
aa4240c67f Support pipeline parallel in V1 Engine (#1700)
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine

### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1

### How was this patch tested?
Manully test














- vLLM version: v0.9.2
- vLLM main:
31d5c1797f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-07-11 15:30:51 +08:00
Li Wang
c7446438a9 [1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
wangxiyuan
b979ee353d [Misc] Code clean up (#1679)
Make model_runner_v1 more readable

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 14:33:40 +08:00
wangxiyuan
392fd7239b [Misc] Add attention mask (#1673)
Move attention mark from V0 to common place.
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 09:12:03 +08:00
wangxiyuan
cc1588be50 [Misc] Code clean up (#1674)
Remove useless function
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:54:12 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
NeverRaR
71de52d3a9 feat: add kv cache memory cache and skip dynamo guard (#1549)
### What this PR does / why we need it?

1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)

### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:37:14 +08:00
Vincent Yuan
eb390545ec [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591)
### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
2025-07-05 16:29:21 +08:00
Mengqing Cao
dd22ac38b2 [CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
Angazenn
a5f33590d3 [CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
Angazenn
9fbd8017c0 [Quantization]300I Duo support w8a8 quantization (#1560)
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
offline inference on 310p runs normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:12:46 +08:00
wangxiyuan
a45dfde283 [CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
Li Wang
30bf7014d0 [Bugfix] Add func swap_states to fix MLA attention (#1580)
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`

This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 17:42:53 +08:00
Zhu Yi Lin
6b80c5acba Fix W8A8 fused moe bug (#1529)
### What this PR does / why we need it?
1. drop some useless code for w8a8 fusedmoe
2. Add in8 kv cache check
3. Add more ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: zhuyilin <809721801@qq.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-02 16:40:51 +08:00
wangxiyuan
641a4e6092 [CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone
0e43813120 [ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
Shanshan Shen
8013634e9c [Structured Output] Remove redundant check for grammar_bitmask (#1459)
### What this PR does / why we need it?
Remove redundant check since we have check this at
https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:39:19 +08:00
Li Wang
5f8241c25c [V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
Zhu Yi Lin
b308a7a258 support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Mengqing Cao
5f4391652f [PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00
wangxiyuan
5968dff4e0 [Build] Add build info (#1386)
Add static build_info py file to show soc and sleep mode info. It helps
to make the code clean and the error info will be more friendly for
users

This PR also added the unit test for vllm_ascend/utils.py

This PR also added the base test class for all ut in tests/ut/base.py

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:14:43 +08:00
yiz-liu
2690697caa [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)
### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-26 09:27:43 +08:00
wangxiyuan
ca884ef86d [Misc] Clean up uesless code for LLM initialize (#1373)
This PR aims to clean up the useless code for LLM setup. It helps to
make the code more clear.
1. remove useless `self.xxx` property
2. change `set_random_seed` to `seed_everything`
3. remove `set_custom_all_reduce`, it's only used for cuda

This is just a code clean. no change for any code logic.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 16:20:14 +08:00
Li Wang
5f5800ba42 [Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399)
### What this PR does / why we need it?

Sync MRotaryEmbedding interface change to recover main CI
(https://github.com/vllm-project/vllm/pull/19939)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-24 22:56:39 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Yikun Jiang
097e7149f7 [Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
yuancaoyaoHW
00ae250f3c [V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
yiz-liu
2c7dd85fd8 [Fix] Fix the token-wise padding mechanism (#1300)
### What this PR does / why we need it?
Fix the token-wise padding mechanism.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-20 14:46:17 +08:00
wangxiyuan
b350edae9a [UT] refactor test_expert_load_balancer and fix broken CI (#1293)
refactor test_expert_load_balancer to keep the ut code style

This PR also fixed the break change from
https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280

This is just a quick fix. We'll support embedding on V1 later

Closes: https://github.com/vllm-project/vllm-ascend/issues/1299

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 01:02:52 +08:00
zzzzwwjj
23ca68d0c8 [refactor] Refactoring AscendFusedMoE (#1229)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
This PR is used for resolved [issue
1147](https://github.com/vllm-project/vllm-ascend/issues/1147)
1. Move fused_moe code into one file `fused_moe.py`.
2. Integrate branch conditions into function `get_fused_moe_state`.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this
env is useless, we can make judgments based on the current scenario
without this env, it will only increase complexity.
2. This PR has removed the env `USING_LCCL_COM`, because this env has
already expired.
3. `additional_config.expert_tensor_parallel_size` has already expired,
and now we also use parameter `enable_expert_parallel`, consistent with
the vLLM.
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-06-17 17:49:03 +08:00
Mengqing Cao
96fa7ff63b [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235)
### What this PR does / why we need it?
1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: https://github.com/vllm-project/vllm-ascend/issues/1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: https://github.com/vllm-project/vllm-ascend/pull/1242
Closes: https://github.com/vllm-project/vllm-ascend/issues/1232


### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-06-16 23:09:53 +08:00
zhuo97
f5404dc650 Fix the device error when using ray as vllm-acend backend (#884)
1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
2025-06-16 21:03:16 +08:00
fems14
ab5d110fcc vllm-ascend support chunked prefill (#1172)
### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA


---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-06-14 22:31:16 +08:00