Commit Graph

442 Commits

Author SHA1 Message Date
Angazenn
5c53cbaf2a [BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478)
### What this PR does / why we need it?
This PR fixes a bug that use broadcast with cpu_group when running dp.
The `broadcast310p` patch will take effects for both cpu_group and
device group, but we only need it for device group. Hence a wrapper is
added to allow cpu_group use native torch broadcast and it solves the
bug.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With this PR, DP on 310p runs normally and generates reasonable answers.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:07:52 +08:00
Mengqing Cao
2cf9c4c3a2 [CI/Build] Fix version conflict on transformers (#1490)
### What this PR does / why we need it?
Fix version conflict on transformers:
`pip._vendor.pkg_resources.ContextualVersionConflict: (transformers
4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages),
Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})`
Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642

### Does this PR introduce _any_ user-facing change?
Fix broken build

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 15:11:04 +08:00
Mengqing Cao
5f4391652f [PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00
Shanshan Shen
99e685532d [Doc] Add Qwen2.5-VL eager mode doc (#1394)
### What this PR does / why we need it?
Add Qwen2.5-VL eager mode doc.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-28 09:08:51 +08:00
Mengqing Cao
d59e7fa095 [CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482)
### What this PR does / why we need it?

- Fix vLLM EPLB break
e9fd658a73
by recovering load_weights back to [v0.9.1
version](07b8fae219)
temporarily.

- Fix transformers>=4.53.0 image processor break
Related: https://github.com/vllm-project/vllm-ascend/issues/1470

- Mirror torch_npu requirements to pyproject.toml

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-28 00:12:43 +08:00
Shanshan Shen
3687676fa7 [Doc] Add guidance on how to implement and register new models (#1426)
### What this PR does / why we need it?
Add guidance on how to implement and register new models.

Modified based on PR
https://github.com/vllm-project/vllm-ascend/pull/1126, thanks for the
contribution of @linfeng-yuan.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-27 16:46:49 +08:00
wangxiyuan
5571fb7118 [Misc] Add release checklist issue template (#1447)
Add the release checklist issue template.

Every release manager should create and follow the checklist to do the
release step by step.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:15:36 +08:00
wangxiyuan
5968dff4e0 [Build] Add build info (#1386)
Add static build_info py file to show soc and sleep mode info. It helps
to make the code clean and the error info will be more friendly for
users

This PR also added the unit test for vllm_ascend/utils.py

This PR also added the base test class for all ut in tests/ut/base.py

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:14:43 +08:00
Li Wang
c563a08f0a [CI] Fix nightly benchmark (#1453)
### What this PR does / why we need it?
Sometimes the performance benchmark workflow may fail. We hope to add a
prompt when the operation fails and not upload the dirty data of the
failed operation.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-26 19:39:18 +08:00
Zesheng Zong
192dbbcc6e Optimize Patch developer guide (#1452)
### What this PR does / why we need it?
Fix some terms in the user guide.


Signed-off-by: zeshengzong <zesheng.zong@outlook.com>
2025-06-26 19:10:16 +08:00
wangyanhui-cmss
e5eea64b66 [CI/UT] Add ut for parallel_state.py (#1460)
### What this PR does / why we need it?
 Add ut for parallel_state.py

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
 python -m unittest  test_parallel_state.py

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-06-26 19:03:27 +08:00
Shanshan Shen
4e2daf5ab7 [Doc] Add qwen2-audio eager mode tutorial (#1371)
### What this PR does / why we need it?
Add qwen2-audio eager mode tutorial.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-26 16:56:05 +08:00
leo-pony
1025344912 Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374)
### What this PR does / why we need it?
Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode.
Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248

### Does this PR introduce _any_ user-facing change?
No changes.


### How was this patch tested?
Preview

 Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: leo-pony <nengjunma@outlook.com>
2025-06-26 16:52:54 +08:00
sdmyzlp
53c2d58ae1 Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
yiz-liu
2690697caa [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)
### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-26 09:27:43 +08:00
zhangxinyuehfad
06ccce1ddf [FOLLOWUP] fix name and format in accuracy test (#1288) (#1435)
### What this PR does / why we need it?
fix accuracy test:
1. fix accuracy report
like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html
2. fix create pr for report

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-26 00:26:54 +08:00
Pr0Wh1teGivee
2fda60464c [Perf] Use fused ops npu_top_k_top_p (#1308)
### What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are
not None, otherwise fallback to the original one. The replacement will
take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` .

This patch are using `npu_top_k_top_p` which required
torch_npu>=2.5.1.post1.dev20250619

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested by DeepSeek R1 and UT passed

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-06-25 20:59:06 +08:00
yuancaoyaoHW
e7efc7e7e7 [BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
sharonyunyun
941269a6c5 adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
wangxiyuan
205cb85a1e [Doc] Fix doc typo (#1424)
1. Fix the typo
2. Fix 404 url
3. update graph mode and additional config user guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 19:28:26 +08:00
wangxiyuan
ca884ef86d [Misc] Clean up uesless code for LLM initialize (#1373)
This PR aims to clean up the useless code for LLM setup. It helps to
make the code more clear.
1. remove useless `self.xxx` property
2. change `set_random_seed` to `seed_everything`
3. remove `set_custom_all_reduce`, it's only used for cuda

This is just a code clean. no change for any code logic.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 16:20:14 +08:00
zhangxinyuehfad
0060886a37 [CI]Update accuracy report test (#1288)
### What this PR does / why we need it?
Update accuracy report test
1. Add Record commit hashes and GitHub links for both vllm and
vllm-ascend in accuracy reports
2. Add accuracy result verification checks to ensure output correctness
3. Creat PR via forked repository workflow

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dense-accuracy-test:
https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485
create pr via forked repository workflow:
https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080
accuracy report pr:
https://github.com/vllm-project/vllm-ascend/pull/1292

Currently, the accuracy report used is old and needs to be merged into
pr, retest, update new report, then close #1292 .


Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-06-25 14:10:34 +08:00
Li Wang
15df8be937 [Doc] Add sleep mode doc (#1295)
### What this PR does / why we need it?
Add sleep related doc and example

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-25 14:07:14 +08:00
wangxiyuan
e4e0b7af05 [Doc] Add patch doc (#1414)
1. Format the developer guide  content to make it more clear
2. Add the patch doc for developer guide

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 12:00:45 +08:00
Mengqing Cao
52317f92cb [DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
Mengqing Cao
c1c5d56255 [Doc] Update FAQ and add test guidance (#1360)
### What this PR does / why we need it?
- Add test guidance
- Add reduce layer guidance
- update faq on determinitic calculation

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-25 09:59:23 +08:00
Li Wang
5f5800ba42 [Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399)
### What this PR does / why we need it?

Sync MRotaryEmbedding interface change to recover main CI
(https://github.com/vllm-project/vllm/pull/19939)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-24 22:56:39 +08:00
liziyu
6ed3f00427 [Doc] remove environment variable VLLM_ENABLE_MC2 (#1406)
### What this PR does / why we need it?
remove unused environment variable VLLM_ENABLE_MC2


Signed-off-by: liziyu <liziyu16@huawei.com>
2025-06-24 21:18:10 +08:00
Mengqing Cao
20767a043c [CI/UT] Fix disaggregated prefill ci (#1313)
### What this PR does / why we need it?
Use eager mode to run disaggregated prefill ci

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-24 17:11:00 +08:00
wangxiyuan
9cbce423ce [MISC] Remove useless patch (#1366)
### What this PR does / why we need it?
`stateless_init_dp_group` in vllm works with non-cuda platform now.
Remove this useless patch.

Which was introduced in vllm-ascend by
e74331a1ed
(v0.8.4rc2)
vLLM upstream merged:
3e472d882a
(v0.8.0)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-24 10:05:59 +08:00
lyj-jjj
5177bef87a support fused_moe_allgather_ep (#1335)
### What this PR does / why we need it?
support fused_moe_allgather_ep

### How was this patch tested?
It was tested by UT.

Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>
2025-06-23 22:03:38 +08:00
Yikun Jiang
917c6b71af [TEST][DOC] Fix doctest and add system package installation (#1375)
### What this PR does / why we need it?
- Fix
[doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule)
- add system package installation
- Add doc for run doctests
- Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml
- Change schedule job from 4 ---> 12 hours

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- doctest CI passed
- Local test with
`/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-23 20:50:33 +08:00
Icey
08cfc7cb4b Modify installation.md for adding pip extra index of torch-npu (#1272)
### What this PR does / why we need it?
Modify installation.md for adding pip extra index of torch-npu

### How was this patch tested?
No need

---------

Signed-off-by: Icey <1790571317@qq.com>
2025-06-23 15:37:50 +08:00
weiguihua2
e1123172d1 [Doc] Add reinstall instructions doc (#1303)
Add a new FAQ, if users re-install vllm-ascend with pip, the `build`
folder should be removed first

---------

Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: weiguihua <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-06-23 14:06:27 +08:00
linfeng-yuan
15592c0d48 [bugfix] fix accuracy prolem for deepseek V3/R1 models with torchair graph in long sequence predictions (#1331)
### What this PR does / why we need it?
Fix the issue of insufficient cached cosine and sine length in MLA's
TorchAir graph mode, which causes accuracy deviation during
long-sequence inference.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark
serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16
setting.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-06-23 09:52:27 +08:00
zxdukki
f04c6763d8 [Bugfix] fix env variable in dbo (#1284)
### What this PR does / why we need it?
Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we
have fixed an known issue in deepseek-dbo.


### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
This patch can be tested with newly added e2e tests:
[tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e).
It can be verified with pytest.

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
2025-06-23 09:07:57 +08:00
Shanshan Shen
21fb68a03a [CI] Update guided decoding ut (#1312)
### What this PR does / why we need it?
Update guided decoding ut.

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-23 09:06:20 +08:00
wemaster
339d6894f6 [CI/UT][bugfix] fix v0 spec decode (#1321)
### What this PR does / why we need it?
1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove it in this pr.

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-06-23 09:05:13 +08:00
Pleaplusone
7e6efbf2a9 update torch-npu to 2.5.1.post1.dev20250619 (#1347)
### What this PR does / why we need it?
This PR update the torch_npu to newest release version
2.5.1.post1.dev20250619 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI tested will guarantee the update

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
2025-06-23 09:02:09 +08:00
xleoken
4447e53d7a [Doc] Change not to no in faqs.md (#1357)
### What this PR does / why we need it?

Change not to no in faqs.md.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local Test

Signed-off-by: xleoken <xleoken@163.com>
2025-06-23 09:01:00 +08:00
Yikun Jiang
a95afc011e [CI] Enable merge trigger unit test and accuracy test schedule job (#1345)
### What this PR does / why we need it?
- Enable merge trigger unit test and accuracy test schedule job
- Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 17:21:57 +08:00
Yikun Jiang
2e5f312530 Cleanup ununsed doc (#1352)
### What this PR does / why we need it?
Cleanup ununsed doc for MoGE model, we will add back this when MoGE
model ready.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-22 15:05:30 +08:00
Yikun Jiang
c30ddb8331 Bump v0.9.1rc1 release (#1349)
### What this PR does / why we need it?
Bump v0.9.1rc1 release

Closes: https://github.com/vllm-project/vllm-ascend/pull/1341
Closes: https://github.com/vllm-project/vllm-ascend/pull/1334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed


---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-22 13:15:36 +08:00
Yikun Jiang
097e7149f7 [Platform] Add initial experimental support for Altlas 300I series (#1333)
### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- https://github.com/vllm-project/vllm-ascend/pull/914
- https://github.com/vllm-project/vllm-ascend/pull/1318
- https://github.com/vllm-project/vllm-ascend/pull/1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Vincent Yuan <farawayboat@gmail.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: shen-shanshan <467638484@qq.com>
2025-06-21 09:00:16 +08:00
Yikun Jiang
2009fdb8da [Test] Enable code cov for V1 and enable push trigger (#1164)
### What this PR does / why we need it?
- Enable code cov for V1
- Enable push triggered job

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-21 00:01:05 +08:00
Angazenn
2f1266d451 Support Pangu Pro MoE model (#1204)
### What this PR does / why we need it?
Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411)

### Does this PR introduce _any_ user-facing change?
Yes, new model supported

### How was this patch tested?
Test locally

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-20 23:59:59 +08:00
yuancaoyaoHW
00ae250f3c [V1][eagle3] Support eagle3 proposer for v1 (#1032)
### What this PR does / why we need it?
This PR implements the Eagle Pososer feature for vLLM v1, which enables
more efficient speculative decoding by using a draft model to predict
potential future tokens.
- The implementation includes the core Eagle algorithm integration with
vLLM's existing architecture, allowing for faster inference while
maintaining output quality.
- This is needed to significantly improve the generation speed of large
language models without compromising on the quality of generated text.

### Does this PR introduce any user-facing change?
Yes, this PR introduces a new speculative decoding mode that can be
enabled via configuration.
- Users can now choose to use Eagle Pososer by setting appropriate flags
in the inference configuration.
- The API remains backward compatible, with the new functionality being
opt-in.

### How was this patch tested?
CI passed with new unit tests added for the Eagle Pososer functionality.
- Benchmark tests were conducted comparing generation speed and quality
with and without Eagle Pososer.
- Integration tests were performed with various model architectures to
ensure compatibility.
- Manual testing was done using different prompt scenarios to verify
output quality remains consistent.
- we test accept rate on one Ascend 910B npu, The acceptance rate
results are basically consistent with those shown here:
https://github.com/vllm-project/vllm/pull/16937
- Currently, we support scenarios where num_spec_tokens <= 2. When
num_spec_tokens > 2, issues such as insufficient GPU memory and operator
computation errors may occur. We will address this in subsequent
updates.
- We will add support for Eagle v1 in future updates.

### Acceptance Test Script
```bash
SCRIPT="/offline/eagle.py"
DATASET="ShareGpt"
MODEL=Meta-Llama-3.1-8B-Instruct
DRAFT=EAGLE3-LLaMA3.1-Instruct-8B

CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \
    --dataset $DATASET \
    --num_spec_tokens 2 \
    --max_num_seqs 1 \
    --model_dir $MODEL \
    --eagle_dir $DRAFT \
    --tp 1 \
    --num_prompts 80
```
### Acceptance Test Results
```bash
██████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s]
-------------------------------------------------------------------------------------
mean acceptance length: 1.63
-------------------------------------------------------------------------------------
total_counts: 8062
acceptance at token 0: 1.00 (8062 times)
acceptance at token 1: 0.70 (5612 times)
acceptance at token 2: 0.47 (3765 times)
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1004

---------

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-20 17:19:54 +08:00
wangxiyuan
45be1aac0c [CI] Add codespell check for doc (#1314)
Add codespell check test for doc only PR

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 16:48:14 +08:00
22dimensions
761bd3d9d7 Add user guide for quantization (#1206)
### What this PR does / why we need it?

Add user guide for quantization

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-06-20 15:53:25 +08:00
yiz-liu
2c7dd85fd8 [Fix] Fix the token-wise padding mechanism (#1300)
### What this PR does / why we need it?
Fix the token-wise padding mechanism.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-20 14:46:17 +08:00