xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	4e2daf5ab7	[Doc] Add qwen2-audio eager mode tutorial (#1371 ) ### What this PR does / why we need it? Add qwen2-audio eager mode tutorial. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-26 16:56:05 +08:00
leo-pony	1025344912	Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode (#1374 ) ### What this PR does / why we need it? Doc Enhancement: Single NPU(Qwen3-8B) aclgraph mode + eager mode. Relate RFC: https://github.com/vllm-project/vllm-ascend/issues/1248 ### Does this PR introduce _any_ user-facing change? No changes. ### How was this patch tested? Preview Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-06-26 16:52:54 +08:00
sdmyzlp	53c2d58ae1	Handle with_prefill_across_dp for multistream mla (#1322 ) ### What this PR does / why we need it? After #1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-26 09:32:07 +08:00
yiz-liu	2690697caa	[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416 ) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-26 09:27:43 +08:00
zhangxinyuehfad	06ccce1ddf	[FOLLOWUP] fix name and format in accuracy test (#1288 ) (#1435 ) ### What this PR does / why we need it? fix accuracy test: 1. fix accuracy report like:https://vllm-ascend--1429.org.readthedocs.build/en/1429/developer_guide/evaluation/accuracy_report/Qwen2.5-7B-Instruct-V0.html 2. fix create pr for report Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-26 00:26:54 +08:00
Pr0Wh1teGivee	2fda60464c	[Perf] Use fused ops npu_top_k_top_p (#1308 ) ### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-06-25 20:59:06 +08:00
yuancaoyaoHW	e7efc7e7e7	[BugFix] Remove not using patch_eagle.py for CI. (#1385 ) ### What this PR does / why we need it? This PR aims to address a long-standing CI bug and remove unused code. The specific changes include: 1. Fixing CI Bug: Resolves the root cause of CI test failures or instability. This often stems from incorrect environment configurations, dependency version conflicts, or flawed test script logic. This fix ensures the reliability and consistency of the CI pipeline. 2. Removing `patch_eagle.py`: Deletes the `patch_eagle.py` file, which is no longer utilized by the project. This file was likely legacy code, experimental code, or its functionality has since been replaced by other modules. Its removal helps reduce codebase complexity, improves maintainability, and prevents potential confusion. ### Does this PR introduce _any_ user-facing change? No, this PR primarily focuses on internal CI stability maintenance and code cleanup. It does not introduce any user-visible changes to APIs, interfaces, or other behaviors. ### How was this patch tested? CI passed. Specifically: 1. Existing CI Pipelines Passed: After fixing the CI bug, all existing CI tests and pipelines were verified to run correctly and pass successfully. 2. Code Cleanup Verified: Following the removal of `patch_eagle.py`, it was ensured that any related functional modules (if applicable) continue to work as expected, without introducing new regressions. This was typically verified by running the project's main test suite. Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-25 20:36:05 +08:00
sharonyunyun	941269a6c5	adjusting the communication method in graph mode (#1194 ) ### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group，to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. Before Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003) Evaluation ![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7) ![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057) After Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e) Evaluation ![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0) ![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4) ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>	2025-06-25 19:56:49 +08:00
wangxiyuan	205cb85a1e	[Doc] Fix doc typo (#1424 ) 1. Fix the typo 2. Fix 404 url 3. update graph mode and additional config user guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 19:28:26 +08:00
wangxiyuan	ca884ef86d	[Misc] Clean up uesless code for LLM initialize (#1373 ) This PR aims to clean up the useless code for LLM setup. It helps to make the code more clear. 1. remove useless `self.xxx` property 2. change `set_random_seed` to `seed_everything` 3. remove `set_custom_all_reduce`, it's only used for cuda This is just a code clean. no change for any code logic. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 16:20:14 +08:00
zhangxinyuehfad	0060886a37	[CI]Update accuracy report test (#1288 ) ### What this PR does / why we need it? Update accuracy report test 1. Add Record commit hashes and GitHub links for both vllm and vllm-ascend in accuracy reports 2. Add accuracy result verification checks to ensure output correctness 3. Creat PR via forked repository workflow ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dense-accuracy-test: https://github.com/vllm-project/vllm-ascend/actions/runs/15745619485 create pr via forked repository workflow: https://github.com/zhangxinyuehfad/vllm-ascend/actions/runs/15747013719/job/44385134080 accuracy report pr: https://github.com/vllm-project/vllm-ascend/pull/1292 Currently, the accuracy report used is old and needs to be merged into pr, retest, update new report, then close #1292 . Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-06-25 14:10:34 +08:00
Li Wang	15df8be937	[Doc] Add sleep mode doc (#1295 ) ### What this PR does / why we need it? Add sleep related doc and example --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-25 14:07:14 +08:00
wangxiyuan	e4e0b7af05	[Doc] Add patch doc (#1414 ) 1. Format the developer guide content to make it more clear 2. Add the patch doc for developer guide Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 12:00:45 +08:00
Mengqing Cao	52317f92cb	[DP] Tiny fix of dp and update example (#1273 ) ### What this PR does / why we need it? Add `max_num_tokens_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg `max_num_tokens_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-25 11:03:04 +08:00
Mengqing Cao	c1c5d56255	[Doc] Update FAQ and add test guidance (#1360 ) ### What this PR does / why we need it? - Add test guidance - Add reduce layer guidance - update faq on determinitic calculation --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-25 09:59:23 +08:00
Li Wang	5f5800ba42	[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399 ) ### What this PR does / why we need it? Sync MRotaryEmbedding interface change to recover main CI (https://github.com/vllm-project/vllm/pull/19939) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-24 22:56:39 +08:00
liziyu	6ed3f00427	[Doc] remove environment variable VLLM_ENABLE_MC2 (#1406 ) ### What this PR does / why we need it? remove unused environment variable VLLM_ENABLE_MC2 Signed-off-by: liziyu <liziyu16@huawei.com>	2025-06-24 21:18:10 +08:00
Mengqing Cao	20767a043c	[CI/UT] Fix disaggregated prefill ci (#1313 ) ### What this PR does / why we need it? Use eager mode to run disaggregated prefill ci ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-24 17:11:00 +08:00
wangxiyuan	9cbce423ce	[MISC] Remove useless patch (#1366 ) ### What this PR does / why we need it? `stateless_init_dp_group` in vllm works with non-cuda platform now. Remove this useless patch. Which was introduced in vllm-ascend by `e74331a1ed` (v0.8.4rc2) vLLM upstream merged: `3e472d882a` (v0.8.0) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-24 10:05:59 +08:00
lyj-jjj	5177bef87a	support fused_moe_allgather_ep (#1335 ) ### What this PR does / why we need it? support fused_moe_allgather_ep ### How was this patch tested? It was tested by UT. Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>	2025-06-23 22:03:38 +08:00
Yikun Jiang	917c6b71af	[TEST][DOC] Fix doctest and add system package installation (#1375 ) ### What this PR does / why we need it? - Fix [doctest](https://github.com/vllm-project/vllm-ascend/actions/workflows/vllm_ascend_doctest.yaml?query=event%3Aschedule) - add system package installation - Add doc for run doctests - Cleanup all extra steps in .github/workflows/vllm_ascend_doctest.yaml - Change schedule job from 4 ---> 12 hours ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - doctest CI passed - Local test with `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh`. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-23 20:50:33 +08:00
Icey	08cfc7cb4b	Modify installation.md for adding pip extra index of torch-npu (#1272 ) ### What this PR does / why we need it? Modify installation.md for adding pip extra index of torch-npu ### How was this patch tested? No need --------- Signed-off-by: Icey <1790571317@qq.com>	2025-06-23 15:37:50 +08:00
weiguihua2	e1123172d1	[Doc] Add reinstall instructions doc (#1303 ) Add a new FAQ, if users re-install vllm-ascend with pip, the `build` folder should be removed first --------- Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: weiguihua <weiguihua2@huawei.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-06-23 14:06:27 +08:00
linfeng-yuan	15592c0d48	[bugfix] fix accuracy prolem for deepseek V3/R1 models with torchair graph in long sequence predictions (#1331 ) ### What this PR does / why we need it? Fix the issue of insufficient cached cosine and sine length in MLA's TorchAir graph mode, which causes accuracy deviation during long-sequence inference. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16 setting. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-06-23 09:52:27 +08:00
zxdukki	f04c6763d8	[Bugfix] fix env variable in dbo (#1284 ) ### What this PR does / why we need it? Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we have fixed an known issue in deepseek-dbo. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? This patch can be tested with newly added e2e tests: [tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e). It can be verified with pytest. --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>	2025-06-23 09:07:57 +08:00
Shanshan Shen	21fb68a03a	[CI] Update guided decoding ut (#1312 ) ### What this PR does / why we need it? Update guided decoding ut. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-23 09:06:20 +08:00
wemaster	339d6894f6	[CI/UT][bugfix] fix v0 spec decode (#1321 ) ### What this PR does / why we need it? 1. [PR913](https://github.com/vllm-project/vllm-ascend/pull/913) introduced an error that caused V0's spec decode function to fail. [PR1109](https://github.com/vllm-project/vllm-ascend/pull/1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. PS: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove it in this pr. ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>	2025-06-23 09:05:13 +08:00
Pleaplusone	7e6efbf2a9	update torch-npu to 2.5.1.post1.dev20250619 (#1347 ) ### What this PR does / why we need it? This PR update the torch_npu to newest release version 2.5.1.post1.dev20250619 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI tested will guarantee the update Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-06-23 09:02:09 +08:00
xleoken	4447e53d7a	[Doc] Change not to no in faqs.md (#1357 ) ### What this PR does / why we need it? Change not to no in faqs.md. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local Test Signed-off-by: xleoken <xleoken@163.com>	2025-06-23 09:01:00 +08:00
Yikun Jiang	a95afc011e	[CI] Enable merge trigger unit test and accuracy test schedule job (#1345 ) ### What this PR does / why we need it? - Enable merge trigger unit test and accuracy test schedule job - Pin lm-eval==0.4.8 to resovle Qwen3 8B accuracy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 17:21:57 +08:00
Yikun Jiang	2e5f312530	Cleanup ununsed doc (#1352 ) ### What this PR does / why we need it? Cleanup ununsed doc for MoGE model, we will add back this when MoGE model ready. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-22 15:05:30 +08:00
Yikun Jiang	c30ddb8331	Bump v0.9.1rc1 release (#1349 ) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-22 13:15:36 +08:00
Yikun Jiang	097e7149f7	[Platform] Add initial experimental support for Altlas 300I series (#1333 ) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - https://github.com/vllm-project/vllm-ascend/pull/914 - https://github.com/vllm-project/vllm-ascend/pull/1318 - https://github.com/vllm-project/vllm-ascend/pull/1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: https://github.com/vllm-project/vllm-ascend/pull/914#issuecomment-2942989322 - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。", "水的沸点是100摄氏度吗？请回答是或者否。", "若腋下体温为38摄氏度，请问这人是否发烧？请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <farawayboat@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: Vincent Yuan <farawayboat@gmail.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com>	2025-06-21 09:00:16 +08:00
Yikun Jiang	2009fdb8da	[Test] Enable code cov for V1 and enable push trigger (#1164 ) ### What this PR does / why we need it? - Enable code cov for V1 - Enable push triggered job ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-21 00:01:05 +08:00
Angazenn	2f1266d451	Support Pangu Pro MoE model (#1204 ) ### What this PR does / why we need it? Support Pangu Pro MoE model (https://arxiv.org/abs/2505.21411) ### Does this PR introduce _any_ user-facing change? Yes, new model supported ### How was this patch tested? Test locally --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-20 23:59:59 +08:00
yuancaoyaoHW	00ae250f3c	[V1][eagle3] Support eagle3 proposer for v1 (#1032 ) ### What this PR does / why we need it? This PR implements the Eagle Pososer feature for vLLM v1, which enables more efficient speculative decoding by using a draft model to predict potential future tokens. - The implementation includes the core Eagle algorithm integration with vLLM's existing architecture, allowing for faster inference while maintaining output quality. - This is needed to significantly improve the generation speed of large language models without compromising on the quality of generated text. ### Does this PR introduce any user-facing change? Yes, this PR introduces a new speculative decoding mode that can be enabled via configuration. - Users can now choose to use Eagle Pososer by setting appropriate flags in the inference configuration. - The API remains backward compatible, with the new functionality being opt-in. ### How was this patch tested? CI passed with new unit tests added for the Eagle Pososer functionality. - Benchmark tests were conducted comparing generation speed and quality with and without Eagle Pososer. - Integration tests were performed with various model architectures to ensure compatibility. - Manual testing was done using different prompt scenarios to verify output quality remains consistent. - we test accept rate on one Ascend 910B npu, The acceptance rate results are basically consistent with those shown here: https://github.com/vllm-project/vllm/pull/16937 - Currently, we support scenarios where num_spec_tokens <= 2. When num_spec_tokens > 2, issues such as insufficient GPU memory and operator computation errors may occur. We will address this in subsequent updates. - We will add support for Eagle v1 in future updates. ### Acceptance Test Script ```bash SCRIPT="/offline/eagle.py" DATASET="ShareGpt" MODEL=Meta-Llama-3.1-8B-Instruct DRAFT=EAGLE3-LLaMA3.1-Instruct-8B CUDA_VISIBLE_DEVICES="0" VLLM_USE_V1=1 $PYTHON $SCRIPT \ --dataset $DATASET \ --num_spec_tokens 2 \ --max_num_seqs 1 \ --model_dir $MODEL \ --eagle_dir $DRAFT \ --tp 1 \ --num_prompts 80 ``` ### Acceptance Test Results ```bash ██████████████████████████████████████████████████████████████████████████████████████████████████████████\| 80/80 [21:22<00:00, 16.03s/it, est. speed input: 4.72 toks/s, output: 13.56 toks/s] ------------------------------------------------------------------------------------- mean acceptance length: 1.63 ------------------------------------------------------------------------------------- total_counts: 8062 acceptance at token 0: 1.00 (8062 times) acceptance at token 1: 0.70 (5612 times) acceptance at token 2: 0.47 (3765 times) ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1004 --------- Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-20 17:19:54 +08:00
wangxiyuan	45be1aac0c	[CI] Add codespell check for doc (#1314 ) Add codespell check test for doc only PR Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 16:48:14 +08:00
22dimensions	761bd3d9d7	Add user guide for quantization (#1206 ) ### What this PR does / why we need it? Add user guide for quantization ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-06-20 15:53:25 +08:00
yiz-liu	2c7dd85fd8	[Fix] Fix the token-wise padding mechanism (#1300 ) ### What this PR does / why we need it? Fix the token-wise padding mechanism. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-20 14:46:17 +08:00
wangxiyuan	b350edae9a	[UT] refactor test_expert_load_balancer and fix broken CI (#1293 ) refactor test_expert_load_balancer to keep the ut code style This PR also fixed the break change from https://github.com/vllm-project/vllm/pull/16188/files#diff-e2942ece30a5c580437694ffb964bfc664b510c59244c08e5921b8f5cefb4280 This is just a quick fix. We'll support embedding on V1 later Closes: https://github.com/vllm-project/vllm-ascend/issues/1299 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-20 01:02:52 +08:00
songshanhu07	ebb2a70dbb	static EPLB fix bug, add unit test (#1186 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> 1.add static EPLB unit test 2.fix bug: Tensor cannot be directly judged by if statements ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Run the unit test. --------- Signed-off-by: songshanhu07 <1763685535@qq.com>	2025-06-18 19:46:56 +08:00
Shanshan Shen	2cd8ecdc4f	[Bugfix][Spec Decode] Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode (#1258 ) ### What this PR does / why we need it? Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode. Find more details at mengwei805's comment in https://github.com/vllm-project/vllm-ascend/pull/1123. ### Does this PR introduce _any_ user-facing change? The user will not be aware of `VLLM_ASCEND_ACL_OP_INIT_MODE` (`ACL_OP_INIT_MODE`). ### How was this patch tested? Test scripts: ```python from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="Qwen/Qwen2.5-1.5B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_max": 4, }, ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Results: ``` Adding requests: 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 76.70it/s] Processed prompts: 100%\|███████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 1.33it/s, est. speed input: 6.64 toks/s, output: 21.26 toks/s] Prompt: 'The future of AI is', Generated text: ' bright\n\n04/15/2020\n\nBy: James' ``` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-18 17:50:20 +08:00
zzzzwwjj	db2f630aeb	[bugfix] fix deepseek with mc2 (#1268 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-18 00:58:38 +08:00
whx	d7e19ed57a	[BugFix] fix length of sin/cos cache in rope (#1266 ) This PR fixes the bug that constructs shorter sin/cos cache than model's max positional embedding. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-17 23:14:25 +08:00
Jade Zheng	afc8edb046	[Bugfix]: Pass scaling args to mc2 (#1202 ) Pass `expert_scale` and `expand_scale` args to the dispatch and combine functions. Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2025-06-17 22:16:44 +08:00
Li Wang	f8029945c3	[Bugfix] Remove cuda related lines and add additional pip mirror (#1252 ) ### What this PR does / why we need it? - For npu environment, we should use `PYTORCH_NPU_ALLOC_CONF ` rather than `PYTORCH_CUDA_ALLOC_CONF` - Add `PIP_EXTRA_INDEX_URL` to make nightly_benchmarks happy --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-17 21:25:40 +08:00
zzzzwwjj	23ca68d0c8	[refactor] Refactoring AscendFusedMoE (#1229 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR is used for resolved [issue 1147](https://github.com/vllm-project/vllm-ascend/issues/1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-06-17 17:49:03 +08:00
Yikun Jiang	05dec7eda9	[Doc] Refactor and init user story page (#1224 ) ### What this PR does / why we need it? This PR refactor the user stories page: - Move it to community - Add initial info of LLaMA-Factory, Huggingface/trl, MindIE Turbo, GPUStack, verl - Add a new page for LLaMA-Factory ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview locally Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-17 09:36:35 +08:00
Yikun Jiang	9d3cbc0953	[Doctest] add installation doctest (#1179 ) ### What this PR does / why we need it? Install doctest ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Related: https://github.com/vllm-project/vllm-ascend/pull/983 Co-authored-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com>	2025-06-17 08:52:26 +08:00
Mengqing Cao	96fa7ff63b	[DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 (#1235 ) ### What this PR does / why we need it? 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: https://github.com/vllm-project/vllm-ascend/issues/1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: https://github.com/vllm-project/vllm-ascend/pull/1242 Closes: https://github.com/vllm-project/vllm-ascend/issues/1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-06-16 23:09:53 +08:00

1 2 3 4 5 ...

431 Commits