xc-llm-ascend

Author	SHA1	Message	Date
lianyibo	53d2ea3789	[Bugfix]Fix the performance gap between 0.9.2rc1 and 0.9.1 (#1811 ) ### What this PR does / why we need it? maybe fixes [#1728](https://github.com/vllm-project/vllm-ascend/issues/1728#issuecomment-3065083433) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test Qwen3-32B tp=4 with: ```bash vllm serve --port 1234 Qwen/Qwen3-32B \ --served-model-name Qwen3-32B \ --tensor-parallel-size 4 \ --swap-space 16 \ --max-model-len 6000 \ --load-format dummy \ --disable-log-stats \ --disable-log-requests \ ``` Request batch_size=128 input/output token=1024 In 0.9.2rc1 ```text ===================================================== Total TPS with prefill(tokens/s) : 785.1395 Total TPS without prefill : 846.6809 Mean TPS with prefill : 6.1339 Mean TPS without prefill : 6.6147 ===================================================== Mean TTFT(ms) : 10307.8123 Max TTFT(ms) : 21423.0733 Min TTFT(ms) : 362.3602 ===================================================== Mean TPOT(ms) : 151.3051 Max TPOT(ms) : 159.4649 Min TPOT(ms) : 140.899 ===================================================== Total Time(s) : 175.6032 Request Throughput(requests/s) : 0.7289 ===================================================== ``` Apply this PR ```text ===================================================== Total TPS with prefill(tokens/s) : 811.0014 Total TPS without prefill : 876.4423 Mean TPS with prefill : 6.3359 Mean TPS without prefill : 6.8472 ===================================================== Mean TTFT(ms) : 10263.8382 Max TTFT(ms) : 21151.2547 Min TTFT(ms) : 375.9136 ===================================================== Mean TPOT(ms) : 146.1686 Max TPOT(ms) : 154.0957 Min TPOT(ms) : 136.8879 ===================================================== Total Time(s) : 169.8579 Request Throughput(requests/s) : 0.7536 ===================================================== ``` The TPOT performance gap between these two sets of data is about 3%. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: lianyibo <lianyibo1@kunlunit.com>	2025-07-18 23:09:54 +08:00
Mengqing Cao	574fe407eb	[1/N][CustomOp] Register activation customop instead of overwrite forward_oot (#1841 ) ### What this PR does / why we need it? We'll refator `CustomOp` in vllm-ascend from this pr on. Use function `CustomOp.register_oot` to achieve the customop registery, taking `AscendQuickGELU` as an example: ```python from vllm_ascend.ops.activation import AscendQuickGELU CustomOp.register_oot(_decorated_op_cls=AscendQuickGELU, name="QuickGELU") ``` This is a quick adapt for `CustomOp.register_oot` mechanism from vllm 0.9.2. For further step, we can remove inherit from `QuickGELU` can write our own `QuickGELU` at all. Part of https://github.com/vllm-project/vllm-ascend/pull/1647 - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-18 23:07:14 +08:00
Shanshan Shen	d08ff304cd	[Misc][V0 Deprecation] Remove V0 Attention (#1835 ) ### What this PR does / why we need it? This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `8dfb45ca33` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-18 14:10:13 +08:00
Li Wang	f9dfde02fd	[Bugfix] Fix broken CI (#1848 ) ### What this PR does / why we need it? - Fix broken commit by [#20927](https://github.com/vllm-project/vllm/pull/20927) - Fix broken commit by [#20466](https://github.com/vllm-project/vllm/pull/20466) - TODO: more fully adapt to the upstream reconstruction, let's first make CI happy - vLLM version: v0.9.2 - vLLM main: `11dfdf21bf` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-17 20:10:12 +08:00
Icey	875a920d4a	[Platform] Add support for Altlas A3 series (#1794 ) ### What this PR does / why we need it? Add support for Ascend A3 and remove latest tag ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas A3 series ### How was this patch tested? CI passed with: - remove latest tag test: https://github.com/wxsIcey/wxs-vllm-ascend/actions/runs/16267635040/job/45926924765 - E2E image build for A3 - CI test on A3 with e2e test and longterm test - Unit test missing because need a real A3 hardware to have a test Closes: https://github.com/vllm-project/vllm-ascend/issues/1696 - vLLM version: v0.9.2 - vLLM main: `d0dc4cfca4` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-07-17 11:13:02 +08:00
Shanshan Shen	c66b0827a7	[Misc][V0 Deprecation] Remove Pooling Model Runner (#1824 ) ### What this PR does / why we need it? Remove pooling model runner. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `d31a647124` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-16 17:48:21 +08:00
Shanshan Shen	06655002c5	[Misc][V0 Deprecation] Remove V0 Worker (#1821 ) ### What this PR does / why we need it? Remove V0 worker. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `6cbc4d4bea` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-16 14:07:17 +08:00
Shanshan Shen	b005def0a5	[Misc][V0 Deprecation] Remove Multi-Step Model Runner (#1820 ) ### What this PR does / why we need it? Remove multi-step model runner. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `34cda778a0` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-16 14:06:49 +08:00
Shanshan Shen	f9e2e9bb31	[Misc][V0 Deprecation] Remove Draft Model Runner Used for V0 Spec Decode (#1810 ) ### What this PR does / why we need it? Remove draft model runner used for V0 spec decode. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `34cda778a0` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-16 10:51:23 +08:00
Shanshan Shen	f96100fad5	[Misc][V0 Deprecation] Remove V0 related codes of test, example, platform (#1805 ) ### What this PR does / why we need it? Remove V0 related codes of test, example, platform. This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:58:55 +08:00
Shanshan Shen	a929699e98	[Misc][V0 Deprecation] Remove multi-step worker (#1809 ) ### What this PR does / why we need it? Remove multi-step worker This PR is a part of https://github.com/vllm-project/vllm-ascend/issues/1620. - vLLM version: v0.9.2 - vLLM main: `235bfd5dfe` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-07-15 19:48:47 +08:00
wangxiyuan	7bdada58eb	[Misc] Remove VLLM_USE_V1 usage in code (#1764 ) We plan to remove V0 code from this version. The first step is to delete v0 usage. Related: https://github.com/vllm-project/vllm-ascend/issues/1620 - vLLM version: v0.9.2 - vLLM main: `61e20828da` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:16 +08:00
wangxiyuan	494b0f474f	[CI]Fix broken CI (#1773 ) This PR fixed the broken CI. It require https://github.com/vllm-project/vllm/pull/20900 merged first. - vLLM version: v0.9.2 - vLLM main: `e8cc53af5e` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 00:54:20 +08:00
Pr0Wh1teGivee	d13fb0766e	[Perf] add patch to optimize apply_topk_topp (#1732 ) ### What this PR does / why we need it? Performance optimization for apply_top_k_top_p ### Does this PR introduce _any_ user-facing change? Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature ### How was this patch tested? e2e & ut - vLLM version: v0.9.2 - vLLM main: `6a9e6b2abf` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-11 15:32:02 +08:00
weiguihua2	aa4240c67f	Support pipeline parallel in V1 Engine (#1700 ) ### What this PR does / why we need it? This patch supports pipeline parallel in V1 Engine ### Does this PR introduce _any_ user-facing change? Yes, users can run PP in V1 ### How was this patch tested? Manully test - vLLM version: v0.9.2 - vLLM main: `31d5c1797f` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-07-11 15:30:51 +08:00
ttanzhiqiang	ee40d3d850	use npu_moe_gating_top_k_softmax (#1355 ) ### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: `1a4f35e2ea` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-11 08:55:06 +08:00
ttanzhiqiang	9d16c9982e	rm router logits Improve TTOP 3ms (#1407 ) ### What this PR does / why we need it? The previous code is router_logits, _ = self.gate(hidden_states) hidden_states = get_dp_group().all_gather(hidden_states, 0) router_logits = get_dp_group().all_gather(router_logits, 0) I want to change the two all_gathers to one, reduce one all_gather communication, and make it hidden_states = get_dp_group().all_gather(hidden_states, 0) router_logits, _ = self.gate(hidden_states) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh gsm8k accuracy verification <img width="1809" alt="截屏2025-06-24 21 53 24" src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b" /> - vLLM version: v0.9.2 - vLLM main: `77f77a951e` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-11 08:53:17 +08:00
ApsarasX	0fc9b56d40	[Perf] Improve MLA multistream performance (#1353 ) ### What this PR does / why we need it? > Need to merge after PR #1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `65393ee064` Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-11 08:51:17 +08:00
Mengqing Cao	cc210f46e6	[AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718 ) ### What this PR does / why we need it? Now there is no need to calculate `num_draft_tokens` when allocating slots. This PR follows the changes in vllm: https://github.com/vllm-project/vllm/pull/20701 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test - vLLM version: v0.9.2 - vLLM main: `cc876d0f29` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-10 18:47:45 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
ApsarasX	643e6f5486	[Bugfix] Fix accuracy problem caused by mask pollution (#1678 ) ### What this PR does / why we need it? If a small batch of short requests is sent first, forming a chunk with a length <128, it will corrupt the `attn_mask_cache`, causing subsequent requests that do not form a chunk to have accuracy issues. The root cause of this problem is the use of in-place multiplication. Modifying it to use out-of-place multiplication will resolve the accuracy problem. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Yes. - vLLM version: v0.9.2 - vLLM main: `ad6c2e1a0b` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-10 14:06:49 +08:00
ttanzhiqiang	60519c71bd	shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395 ) ### What this PR does / why we need it? When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce In prefill and decode, as long as shared_experts+router_experts are all_reduce, there will be benefits. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh - vLLM version: v0.9.1 - vLLM main: `977180c912` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-10 12:07:05 +08:00
ApsarasX	89c1a0f006	[Bugfix] Fix memory-leak caused by dist._functional_collectives.reduce_scatter_tensor (#1380 ) ### What this PR does / why we need it? In some cases, `dist._functional_collectives.reduce_scatter_tensor` can cause its input tensor not to be released immediately after the current layer ends. Instead, it will only be released when the GPU memory usage of the current process reaches a certain threshold (approximately every 15 layers each time). Before Fix <img width="1441" alt="截屏2025-06-24 01 26 13" src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0" /> After Fix <img width="1475" alt="截屏2025-06-24 01 23 43" src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `9ff2af6d2b` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-10 10:57:24 +08:00
wangxiyuan	b979ee353d	[Misc] Code clean up (#1679 ) Make model_runner_v1 more readable - vLLM version: v0.9.2 - vLLM main: `baed180aa0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 14:33:40 +08:00
wangxiyuan	392fd7239b	[Misc] Add attention mask (#1673 ) Move attention mark from V0 to common place. - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 09:12:03 +08:00
wangxiyuan	cc1588be50	[Misc] Code clean up (#1674 ) Remove useless function - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:54:12 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
NeverRaR	71de52d3a9	feat: add kv cache memory cache and skip dynamo guard (#1549 ) ### What this PR does / why we need it? 1、Sometimes loading torchair cache will fail because of the floating of npu memory, so this pr add a new cache to save the old kv cache bytes to avoid the possible crash while loading the torchair graph cache. 2、When caching is enabled and does not exist, the first compilation introduces the overhead of Dynamo Gurad. So in this case, we will compile them directly twice to skip them (This will bring 3-4 ms of tpot optimization) ### Does this PR introduce _any_ user-facing change? Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to control kv cache floating tolerance ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:37:14 +08:00
NeverRaR	df84cceca8	perf: use multicast to avoid padding decode request to prefill size (#1555 ) ### What this PR does / why we need it? perf: use multicast to avoid padding decode request to prefill size ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:36:03 +08:00
wm901115nwpu	f08c4f15a2	fix spell error (#1654 ) Fix the spell error in code - vLLM version: v0.9.1 - vLLM main: `923147b5e8` Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local> Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>	2025-07-07 20:24:42 +08:00
Angazenn	18495f44b2	[BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636 ) ### What this PR does / why we need it? This PR fixes a bug that is caused by max_num_tokens_across_dp calculation. In earlier version, we compute this by graph_pad_size plus max_num_tokens(actual). This will result in different max_num_tokens_across_dp across dp ranks. If padding related is required, this might cause a wrong padding. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed normally. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-07-07 20:03:02 +08:00
ApsarasX	c58accc15e	[Bugfix] Support Qwen3-MOE on aclgraph mode (#1381 ) ### What this PR does / why we need it? Fix the shape of the `npu_moe_init_routing` input parameters to support aclgraph mode on qwen3-moe In addition to this PR, resolving the `gatherv3` error might be necessary. See related PR https://github.com/vllm-project/vllm-ascend/pull/1297 https://github.com/vllm-project/vllm-ascend/pull/1446 Thanks to @yiz-liu for providing the idea ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on Qwen3-30B-A3B Closes: https://github.com/vllm-project/vllm-ascend/issues/1368 --------- Signed-off-by: ApsarasX <apsarax@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-06 15:29:36 +08:00
Vincent Yuan	eb390545ec	[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591 ) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See #1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339 Signed-off-by: Vincent Yuan <farawayboat@gmail.com>	2025-07-05 16:29:21 +08:00
Mengqing Cao	dd22ac38b2	[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136 ) ### What this PR does / why we need it? 1. run deepseek acc ut per pr --- multicard CI time increased by 9 min 2. run spec decode e2e test on v1 per pr --- singlecard CI time increased by 3 min (partly is disabled due to not work now) ~~3. align the output of whether dbo is enabled or not~~ The generated results with and without dbo cannot be aligned. https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136 4. skip V0 mtp test due to failure in https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816 5. fix some version conflicts ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-04 18:05:45 +08:00
wangxiyuan	343955c7ac	[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625 ) This commit `78fe77534b` from vllm reverted the change for FusedMoEParallelConfig This PR do the same to fix the CI error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-04 17:54:33 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Angazenn	9fbd8017c0	[Quantization]300I Duo support w8a8 quantization (#1560 ) ### What this PR does / why we need it? This pr supports w8a8 on 300I Duo platform. The main change is to use `npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? offline inference on 310p runs normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:12:46 +08:00
wangxiyuan	a45dfde283	[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602 ) Make CI happy 1. `c1909e7e8c` changed moeConfig init way 2. `48fb076cbc` changed input batch logic. This PR address these change to vllm-ascend. Closes: https://github.com/vllm-project/vllm-ascend/issues/1600 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-03 18:36:17 +08:00
Li Wang	30bf7014d0	[Bugfix] Add func `swap_states` to fix MLA attention (#1580 ) ### What this PR does / why we need it? mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'` This PR fixed the mla input patch error ### How was this patch tested? will be tested by #1136 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-02 17:42:53 +08:00
Zhu Yi Lin	6b80c5acba	Fix W8A8 fused moe bug (#1529 ) ### What this PR does / why we need it? 1. drop some useless code for w8a8 fusedmoe 2. Add in8 kv cache check 3. Add more ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-02 16:40:51 +08:00
wangxiyuan	641a4e6092	[CI] Cache sampled token ids in model runner to fix CI error (#1573 ) ### What this PR does / why we need it? vllm change `7f280d69c9` break vllm-ascend. This PR Fix the broken CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? passed Closes: https://github.com/vllm-project/vllm-ascend/issues/1572 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-02 12:11:14 +08:00
Pleaplusone	0e43813120	[ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546 ) ### What this PR does / why we need it? This PR (adapted from `2863befce3`) updates the CachedRequestData definition to use a single instance shared across all requests in a batch, instead of creating a new instance per request. Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53 [core.py:521] TypeError: 'CachedRequestData' object is not iterable`, Modify the model_runner to fix it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pass ci will verify this. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-02 06:05:21 +08:00
Shanshan Shen	8013634e9c	[Structured Output] Remove redundant check for `grammar_bitmask` (#1459 ) ### What this PR does / why we need it? Remove redundant check since we have check this at https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-30 17:39:19 +08:00
whx	f286265791	[BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498 ) When use AscendScheduler with prefix-cache enabled and chunk-prefill disabled, there will be accuray problem because there is no branch in mla_v1 to process this scenario. This PR fixes it. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-30 16:51:20 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
yiz-liu	75d05ee200	[Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446 ) ### What this PR does / why we need it? This fix the shape of block_table which was introduced by hybrid kv groups several weeks ago. Error will be raised when enable prefix-cache (eager or not) and Ascend Scheduler at the same time, just send two identical requests and it will reproduce. v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-30 11:25:19 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
Angazenn	c59d69d9e6	[PERF]support MERRouter (#1421 ) ### What this PR does / why we need it? This PR introduces an expert rearrange algorithm for PanguProMoE model. Different from the original grouped topk, it filters out the top experts that are allocated more tokens. Therefore, we can load less experts when calculating gmm. We have test this algorithm for PanguProMoE-72B on 300I Duo platform and 800I A2 platform. On 300I Duo platform, we find that `num_voted_experts` set to 5 achieves both good performance and accuracy. While on 800I A2, we still set it to 8 to use original pangu grouped topk. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:14:49 +08:00
Angazenn	8fa188111d	[PERF]support H2P communication optimization for PanguProMoe (#1463 ) ### What this PR does / why we need it? In this PR, we support H2P communication optimization when running PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather` to replace `all_reduce` to improve performance: original layer: input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm --> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce now: input_layernorm --> tp all_gather --> attn --> tp reduce_scatter --> post_attention_layernorm --> all_rank all_gather --> moe/mlp --> all_rank reduce_scatter Besides, because `reduce_scatter` requires num_tokens that can be divided by group size, we need pad the seqs based on `max_tokens_across_dp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR has been tested with both offline and online inference using PanguProMoE-72B. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:10:27 +08:00
Angazenn	5c53cbaf2a	[BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478 ) ### What this PR does / why we need it? This PR fixes a bug that use broadcast with cpu_group when running dp. The `broadcast310p` patch will take effects for both cpu_group and device group, but we only need it for device group. Hence a wrapper is added to allow cpu_group use native torch broadcast and it solves the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this PR, DP on 310p runs normally and generates reasonable answers. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:07:52 +08:00

1 2 3 4 5 ...

280 Commits