xc-llm-ascend

Author	SHA1	Message	Date
realliujiaxu	29bd9235ed	[v0.11.0][Perf] Delete redundant operations in model_runner and forward_context (#3775 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> cherry pick https://github.com/vllm-project/vllm-ascend/pull/3677 Remove redundant operations from `model_runner` and `forward_context`. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B). Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms : Before <img width="1655" height="696" alt="image" src="https://github.com/user-attachments/assets/d7608e52-2438-46dd-8fc9-391fd6274495" /> After <img width="1607" height="774" alt="image" src="https://github.com/user-attachments/assets/56daf081-2dba-4d2e-99d4-e055187d9806" /> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-29 15:58:53 +08:00
wangxiyuan	0794f64a18	Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist (#3194 ) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)" ### What this PR does / why we need it? This reverts commit `6995a7bc5b`. We'll add it back once the issue is fixed. related issue: https://github.com/vllm-project/vllm-ascend/issues/3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458`	2025-09-26 06:17:36 +08:00
Song Zhixin	6995a7bc5b	[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788 ) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-09-24 11:21:58 +08:00
weichen	37a0715eda	[Refactor] Adjustments to moe_comm_method selection process (#3001 ) ### What this PR does / why we need it? Fix issues mentioned in https://github.com/vllm-project/vllm-ascend/pull/2791 and some minor refactoring. 1. Use Enum instead of string. 2. Avoid setting a new property to forward_context in AscendFusedMoE.forward(). 3. Enabling TokenDispatcherWithMoge. 4. Remove redundant code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 2. Aclgraph & eager - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-22 19:12:58 +08:00
weichen	18ca7861f6	[Main] [Refactor] Enable MoECommMethod in Eager Mode (#2791 ) ### What this PR does / why we need it? 1. Replace prepare/finalize operation in fused_moe.py by moe_comm_method.prepare()/finalize() 2. Replace unified_fused_experts by moe_comm_method.fused_experts() in fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py 3. Add calling _select_moe_comm_method in spec-decode proposers. 4. Currently, w4a8_dynamic does not support gatherep, use all2allv instead. 5. Remove redundant code. ### Does this PR introduce _any_ user-facing change? AllgatherEP switch is disabled in aclgraph/eager mode, just follow the rules in modelrunner_v1._select_moe_comm_method() ### How was this patch tested? e2e & ut - vLLM version: v0.10.2 - vLLM main: `7f6f2c1182` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-16 11:06:00 +08:00
wangxiyuan	382c29f3e1	[BugFix] Fix world size bug in model_runner (#2915 ) - Fix world size bug in model_runner to make sure ep>16 runs with MC2 - enable e2e test for vl Co-Authored-By: whx-sjtu <2952154980@qq.com> Co-Authored-By: Icey <1790571317@qq.com> - vLLM version: v0.10.2 - vLLM main: `3e903b6cb4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-14 12:20:25 +08:00
yiz-liu	83eb40a51c	[Fix][MoE] Refine MoE communication strategy (#2734 ) ### What this PR does / why we need it? Refactors the Mixture-of-Experts (MoE) communication method selection logic. The choice between all-gather, all-to-all, and mc2 is now determined by expert parallel configuration, SoC version (A2/A3), and token count for better performance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Added. - vLLM version: v0.10.1.1 - vLLM main: `eafa8dcde6` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-05 09:04:04 +08:00

7 Commits