xc-llm-ascend

Author	SHA1	Message	Date
Pr0Wh1teGivee	d13fb0766e	[Perf] add patch to optimize apply_topk_topp (#1732 ) ### What this PR does / why we need it? Performance optimization for apply_top_k_top_p ### Does this PR introduce _any_ user-facing change? Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature ### How was this patch tested? e2e & ut - vLLM version: v0.9.2 - vLLM main: `6a9e6b2abf` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-07-11 15:32:02 +08:00
weiguihua2	aa4240c67f	Support pipeline parallel in V1 Engine (#1700 ) ### What this PR does / why we need it? This patch supports pipeline parallel in V1 Engine ### Does this PR introduce _any_ user-facing change? Yes, users can run PP in V1 ### How was this patch tested? Manully test - vLLM version: v0.9.2 - vLLM main: `31d5c1797f` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-07-11 15:30:51 +08:00
ttanzhiqiang	ee40d3d850	use npu_moe_gating_top_k_softmax (#1355 ) ### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: `1a4f35e2ea` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-11 08:55:06 +08:00
ttanzhiqiang	9d16c9982e	rm router logits Improve TTOP 3ms (#1407 ) ### What this PR does / why we need it? The previous code is router_logits, _ = self.gate(hidden_states) hidden_states = get_dp_group().all_gather(hidden_states, 0) router_logits = get_dp_group().all_gather(router_logits, 0) I want to change the two all_gathers to one, reduce one all_gather communication, and make it hidden_states = get_dp_group().all_gather(hidden_states, 0) router_logits, _ = self.gate(hidden_states) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh gsm8k accuracy verification <img width="1809" alt="截屏2025-06-24 21 53 24" src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b" /> - vLLM version: v0.9.2 - vLLM main: `77f77a951e` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-11 08:53:17 +08:00
ApsarasX	0fc9b56d40	[Perf] Improve MLA multistream performance (#1353 ) ### What this PR does / why we need it? > Need to merge after PR #1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `65393ee064` Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-11 08:51:17 +08:00
Mengqing Cao	cc210f46e6	[AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718 ) ### What this PR does / why we need it? Now there is no need to calculate `num_draft_tokens` when allocating slots. This PR follows the changes in vllm: https://github.com/vllm-project/vllm/pull/20701 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test - vLLM version: v0.9.2 - vLLM main: `cc876d0f29` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-10 18:47:45 +08:00
Li Wang	c7446438a9	[1/N][CI] Move linting system to pre-commits hooks (#1256 ) ### What this PR does / why we need it? Follow vllm-project/vllm lint way: https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml Enable pre-commit to avoid some low level error AMAP. This pr is one step of #1241, The purpose is make linting system more clear and convenient, on this step, Mainly did the following things: yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit, enforce-import-regex-instead-of-re. TODO: - clang-format(check for csrc with google style) need clean code, disable for now - pymarkdown need clean code, disable for now - shellcheck need clean code, disable for now ### Does this PR introduce _any_ user-facing change? Only developer UX change: https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally ``` pip install -r requirements-lint.txt && pre-commit install bash format.sh ``` ### How was this patch tested? CI passed with new added/existing test. Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) - vLLM version: v0.9.1 - vLLM main: `5358cce5ff` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-10 14:17:15 +08:00
ApsarasX	643e6f5486	[Bugfix] Fix accuracy problem caused by mask pollution (#1678 ) ### What this PR does / why we need it? If a small batch of short requests is sent first, forming a chunk with a length <128, it will corrupt the `attn_mask_cache`, causing subsequent requests that do not form a chunk to have accuracy issues. The root cause of this problem is the use of in-place multiplication. Modifying it to use out-of-place multiplication will resolve the accuracy problem. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Yes. - vLLM version: v0.9.2 - vLLM main: `ad6c2e1a0b` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-10 14:06:49 +08:00
ttanzhiqiang	60519c71bd	shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395 ) ### What this PR does / why we need it? When all_reduce_merge is in progress, shared_experts does not do all_reduce in mlp, but waits until shared_experts+router_experts are completed before doing all_reduce In prefill and decode, as long as shared_experts+router_experts are all_reduce, there will be benefits. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? bash examples/run_dp_attention_etp16.sh bash examples/run_dp_attention_etp16_benmark.sh - vLLM version: v0.9.1 - vLLM main: `977180c912` --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>	2025-07-10 12:07:05 +08:00
ApsarasX	89c1a0f006	[Bugfix] Fix memory-leak caused by dist._functional_collectives.reduce_scatter_tensor (#1380 ) ### What this PR does / why we need it? In some cases, `dist._functional_collectives.reduce_scatter_tensor` can cause its input tensor not to be released immediately after the current layer ends. Instead, it will only be released when the GPU memory usage of the current process reaches a certain threshold (approximately every 15 layers each time). Before Fix <img width="1441" alt="截屏2025-06-24 01 26 13" src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0" /> After Fix <img width="1475" alt="截屏2025-06-24 01 23 43" src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `9ff2af6d2b` --------- Signed-off-by: ApsarasX <apsarax@outlook.com>	2025-07-10 10:57:24 +08:00
wangxiyuan	b979ee353d	[Misc] Code clean up (#1679 ) Make model_runner_v1 more readable - vLLM version: v0.9.2 - vLLM main: `baed180aa0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 14:33:40 +08:00
wangxiyuan	392fd7239b	[Misc] Add attention mask (#1673 ) Move attention mark from V0 to common place. - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 09:12:03 +08:00
wangxiyuan	cc1588be50	[Misc] Code clean up (#1674 ) Remove useless function - vLLM version: v0.9.2 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:54:12 +08:00
wangxiyuan	830332ebfc	Clean up v0.9.1 code (#1672 ) vllm has released 0.9.2. This PR drop 0.9.1 support. - vLLM version: v0.9.1 - vLLM main: `b942c094e3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-09 08:52:24 +08:00
NeverRaR	71de52d3a9	feat: add kv cache memory cache and skip dynamo guard (#1549 ) ### What this PR does / why we need it? 1、Sometimes loading torchair cache will fail because of the floating of npu memory, so this pr add a new cache to save the old kv cache bytes to avoid the possible crash while loading the torchair graph cache. 2、When caching is enabled and does not exist, the first compilation introduces the overhead of Dynamo Gurad. So in this case, we will compile them directly twice to skip them (This will bring 3-4 ms of tpot optimization) ### Does this PR introduce _any_ user-facing change? Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to control kv cache floating tolerance ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:37:14 +08:00
NeverRaR	df84cceca8	perf: use multicast to avoid padding decode request to prefill size (#1555 ) ### What this PR does / why we need it? perf: use multicast to avoid padding decode request to prefill size ### How was this patch tested? - vLLM version: v0.9.1 - vLLM main: `1fd471e957` Signed-off-by: boying <897013703@qq.com>	2025-07-07 22:36:03 +08:00
wm901115nwpu	f08c4f15a2	fix spell error (#1654 ) Fix the spell error in code - vLLM version: v0.9.1 - vLLM main: `923147b5e8` Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local> Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>	2025-07-07 20:24:42 +08:00
Angazenn	18495f44b2	[BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636 ) ### What this PR does / why we need it? This PR fixes a bug that is caused by max_num_tokens_across_dp calculation. In earlier version, we compute this by graph_pad_size plus max_num_tokens(actual). This will result in different max_num_tokens_across_dp across dp ranks. If padding related is required, this might cause a wrong padding. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed normally. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-07-07 20:03:02 +08:00
ApsarasX	c58accc15e	[Bugfix] Support Qwen3-MOE on aclgraph mode (#1381 ) ### What this PR does / why we need it? Fix the shape of the `npu_moe_init_routing` input parameters to support aclgraph mode on qwen3-moe In addition to this PR, resolving the `gatherv3` error might be necessary. See related PR https://github.com/vllm-project/vllm-ascend/pull/1297 https://github.com/vllm-project/vllm-ascend/pull/1446 Thanks to @yiz-liu for providing the idea ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on Qwen3-30B-A3B Closes: https://github.com/vllm-project/vllm-ascend/issues/1368 --------- Signed-off-by: ApsarasX <apsarax@outlook.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-06 15:29:36 +08:00
Vincent Yuan	eb390545ec	[Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591 ) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See #1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339 Signed-off-by: Vincent Yuan <farawayboat@gmail.com>	2025-07-05 16:29:21 +08:00
Mengqing Cao	dd22ac38b2	[CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136 ) ### What this PR does / why we need it? 1. run deepseek acc ut per pr --- multicard CI time increased by 9 min 2. run spec decode e2e test on v1 per pr --- singlecard CI time increased by 3 min (partly is disabled due to not work now) ~~3. align the output of whether dbo is enabled or not~~ The generated results with and without dbo cannot be aligned. https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136 4. skip V0 mtp test due to failure in https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816 5. fix some version conflicts ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-07-04 18:05:45 +08:00
wangxiyuan	343955c7ac	[CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625 ) This commit `78fe77534b` from vllm reverted the change for FusedMoEParallelConfig This PR do the same to fix the CI error Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-04 17:54:33 +08:00
Angazenn	a5f33590d3	[CORE]initial support for torchair with non-mla backend (#1506 ) ### What this PR does / why we need it? This PR supports torchair graph mode with non-mla backend on both 800IA2 and 300I Duo platforms. The main change is to add `attention_v1_torchair.py` to support specific attention related operations that are required by torchair. ### Does this PR introduce _any_ user-facing change? Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we can also use it with pangu. Besides, we add a support model list to control which type of models that can use torchair. ### How was this patch tested? We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms, and model generates answer normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:21:42 +08:00
Angazenn	9fbd8017c0	[Quantization]300I Duo support w8a8 quantization (#1560 ) ### What this PR does / why we need it? This pr supports w8a8 on 300I Duo platform. The main change is to use `npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? offline inference on 310p runs normally. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-03 22:12:46 +08:00
wangxiyuan	a45dfde283	[CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602 ) Make CI happy 1. `c1909e7e8c` changed moeConfig init way 2. `48fb076cbc` changed input batch logic. This PR address these change to vllm-ascend. Closes: https://github.com/vllm-project/vllm-ascend/issues/1600 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-03 18:36:17 +08:00
Li Wang	30bf7014d0	[Bugfix] Add func `swap_states` to fix MLA attention (#1580 ) ### What this PR does / why we need it? mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'` This PR fixed the mla input patch error ### How was this patch tested? will be tested by #1136 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-02 17:42:53 +08:00
Zhu Yi Lin	6b80c5acba	Fix W8A8 fused moe bug (#1529 ) ### What this PR does / why we need it? 1. drop some useless code for w8a8 fusedmoe 2. Add in8 kv cache check 3. Add more ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com>	2025-07-02 16:40:51 +08:00
wangxiyuan	641a4e6092	[CI] Cache sampled token ids in model runner to fix CI error (#1573 ) ### What this PR does / why we need it? vllm change `7f280d69c9` break vllm-ascend. This PR Fix the broken CI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? passed Closes: https://github.com/vllm-project/vllm-ascend/issues/1572 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-02 12:11:14 +08:00
Pleaplusone	0e43813120	[ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546 ) ### What this PR does / why we need it? This PR (adapted from `2863befce3`) updates the CachedRequestData definition to use a single instance shared across all requests in a batch, instead of creating a new instance per request. Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53 [core.py:521] TypeError: 'CachedRequestData' object is not iterable`, Modify the model_runner to fix it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pass ci will verify this. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-07-02 06:05:21 +08:00
Shanshan Shen	8013634e9c	[Structured Output] Remove redundant check for `grammar_bitmask` (#1459 ) ### What this PR does / why we need it? Remove redundant check since we have check this at https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-06-30 17:39:19 +08:00
whx	f286265791	[BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498 ) When use AscendScheduler with prefix-cache enabled and chunk-prefill disabled, there will be accuray problem because there is no branch in mla_v1 to process this scenario. This PR fixes it. Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-06-30 16:51:20 +08:00
Li Wang	5f8241c25c	[V1][ModelRunner] Support pooling model for v1 engine (#1359 ) ### What this PR does / why we need it? Change as little existing code as possible to add v1 pooling task's support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to vllm-ascend, Considering the frequent changes in upstream interfaces, in order to decouple, so i move it here ### How was this patch tested? CI passed with new added/existing test, and I have a simple test was first conducted locally which is adapted from https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like bellow： ```python import os import torch from vllm import LLM os.environ["VLLM_USE_MODELSCOPE"]="True" def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'What is the capital of China?'), get_detailed_instruct(task, 'Explain gravity') ] # No need to add instruction for retrieval documents documents = [ "The capital of China is Beijing.", "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." ] input_texts = queries + documents model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed") outputs = model.embed(input_texts) embeddings = torch.tensor([o.outputs.embedding for o in outputs]) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) # [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]] ``` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangli <858794774@qq.com> Co-authored-by: wangli <858794774@qq.com>	2025-06-30 16:31:12 +08:00
yiz-liu	75d05ee200	[Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446 ) ### What this PR does / why we need it? This fix the shape of block_table which was introduced by hybrid kv groups several weeks ago. Error will be raised when enable prefix-cache (eager or not) and Ascend Scheduler at the same time, just send two identical requests and it will reproduce. v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-30 11:25:19 +08:00
Zhu Yi Lin	b308a7a258	support pangumoe w8a8c8 and docs (#1477 ) ### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com>	2025-06-28 18:51:07 +08:00
Angazenn	c59d69d9e6	[PERF]support MERRouter (#1421 ) ### What this PR does / why we need it? This PR introduces an expert rearrange algorithm for PanguProMoE model. Different from the original grouped topk, it filters out the top experts that are allocated more tokens. Therefore, we can load less experts when calculating gmm. We have test this algorithm for PanguProMoE-72B on 300I Duo platform and 800I A2 platform. On 300I Duo platform, we find that `num_voted_experts` set to 5 achieves both good performance and accuracy. While on 800I A2, we still set it to 8 to use original pangu grouped topk. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:14:49 +08:00
Angazenn	8fa188111d	[PERF]support H2P communication optimization for PanguProMoe (#1463 ) ### What this PR does / why we need it? In this PR, we support H2P communication optimization when running PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather` to replace `all_reduce` to improve performance: original layer: input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm --> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce now: input_layernorm --> tp all_gather --> attn --> tp reduce_scatter --> post_attention_layernorm --> all_rank all_gather --> moe/mlp --> all_rank reduce_scatter Besides, because `reduce_scatter` requires num_tokens that can be divided by group size, we need pad the seqs based on `max_tokens_across_dp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR has been tested with both offline and online inference using PanguProMoE-72B. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:10:27 +08:00
Angazenn	5c53cbaf2a	[BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478 ) ### What this PR does / why we need it? This PR fixes a bug that use broadcast with cpu_group when running dp. The `broadcast310p` patch will take effects for both cpu_group and device group, but we only need it for device group. Hence a wrapper is added to allow cpu_group use native torch broadcast and it solves the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this PR, DP on 310p runs normally and generates reasonable answers. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>	2025-06-28 16:07:52 +08:00
Mengqing Cao	5f4391652f	[PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483 ) ### What this PR does / why we need it? Support prompt logprobs in V1. This also enable lm_eval to test accuracy on V1 ### Does this PR introduce _any_ user-facing change? support prompt logprobs output ### How was this patch tested? CI passed with accuracy test. Using lm_eval, which use prompt logprobs as output to test accuracy, to test: ```python VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \ --tasks ceval-valid_computer_network \ --batch_size 8 ``` After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct` on V1 is: ```bash \| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\| \|----------------------------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|-----:\| \|ceval-valid_computer_network\| 2\|none \| 0\|acc \|↑ \|0.7368\|± \|0.1038\| \| \| \|none \| 0\|acc_norm\|↑ \|0.7368\|± \|0.1038\| ``` Closes: https://github.com/vllm-project/vllm-ascend/issues/1043 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-28 09:38:52 +08:00
Mengqing Cao	d59e7fa095	[CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482 ) ### What this PR does / why we need it? - Fix vLLM EPLB break `e9fd658a73` by recovering load_weights back to [v0.9.1 version](`07b8fae219`) temporarily. - Fix transformers>=4.53.0 image processor break Related: https://github.com/vllm-project/vllm-ascend/issues/1470 - Mirror torch_npu requirements to pyproject.toml ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-06-28 00:12:43 +08:00
wangxiyuan	5968dff4e0	[Build] Add build info (#1386 ) Add static build_info py file to show soc and sleep mode info. It helps to make the code clean and the error info will be more friendly for users This PR also added the unit test for vllm_ascend/utils.py This PR also added the base test class for all ut in tests/ut/base.py Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-27 09:14:43 +08:00
sdmyzlp	53c2d58ae1	Handle with_prefill_across_dp for multistream mla (#1322 ) ### What this PR does / why we need it? After #1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-06-26 09:32:07 +08:00
yiz-liu	2690697caa	[Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416 ) ### What this PR does / why we need it? Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds asserts in the `GatherV3` operator. Currently, in [`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124), the `position` tensor may contain values that exceed the dimensions of the attention mask, triggering a `GatherV3` boundary check failure. These invalid indices originate from stale “dirty” entries left over in `position` due to padding logic in the ACL graph. Specifically, in [`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989), the variable `num_input_tokens` is always greater than or equal to `total_num_scheduled_tokens`, so any positions not explicitly cleared from a previous batch will persist and cause this sporadic error. BTW, in the original vLLM implementation, masks are constructed internally using other args, so these lingering values do not surface. However, on the Ascend platform—where split-fuse attention requires externally supplied masks—these residual indices become critical and lead to this elusive, hard-to-reproduce failure. The fix is to explicitly reset or zero out all unused entries in the `position` tensor before passing it to `GatherV3`, ensuring that every index lies within the valid range of the attention mask. Closes: https://github.com/vllm-project/vllm-ascend/issues/1038 ### Does this PR introduce _any_ user-facing change? No Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-06-26 09:27:43 +08:00
Pr0Wh1teGivee	2fda60464c	[Perf] Use fused ops npu_top_k_top_p (#1308 ) ### What this PR does / why we need it? Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are not None, otherwise fallback to the original one. The replacement will take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` . This patch are using `npu_top_k_top_p` which required torch_npu>=2.5.1.post1.dev20250619 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by DeepSeek R1 and UT passed Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-06-25 20:59:06 +08:00
yuancaoyaoHW	e7efc7e7e7	[BugFix] Remove not using patch_eagle.py for CI. (#1385 ) ### What this PR does / why we need it? This PR aims to address a long-standing CI bug and remove unused code. The specific changes include: 1. Fixing CI Bug: Resolves the root cause of CI test failures or instability. This often stems from incorrect environment configurations, dependency version conflicts, or flawed test script logic. This fix ensures the reliability and consistency of the CI pipeline. 2. Removing `patch_eagle.py`: Deletes the `patch_eagle.py` file, which is no longer utilized by the project. This file was likely legacy code, experimental code, or its functionality has since been replaced by other modules. Its removal helps reduce codebase complexity, improves maintainability, and prevents potential confusion. ### Does this PR introduce _any_ user-facing change? No, this PR primarily focuses on internal CI stability maintenance and code cleanup. It does not introduce any user-visible changes to APIs, interfaces, or other behaviors. ### How was this patch tested? CI passed. Specifically: 1. Existing CI Pipelines Passed: After fixing the CI bug, all existing CI tests and pipelines were verified to run correctly and pass successfully. 2. Code Cleanup Verified: Following the removal of `patch_eagle.py`, it was ensured that any related functional modules (if applicable) continue to work as expected, without introducing new regressions. This was typically verified by running the project's main test suite. Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>	2025-06-25 20:36:05 +08:00
sharonyunyun	941269a6c5	adjusting the communication method in graph mode (#1194 ) ### What this PR does / why we need it? Communication performance optimization: replace allreduce with reduce_scatter+all_gather in MLA layer's TP group，to remove stridedsliced and all_gather in MOE layer. when tp > 1, It is enabled during the decode phase of the graph mode when enable_multistream_moe、MLA, use_v1, and MC2 are used. According to the end-to-end RL inference test results, this PR can bring 3% gain in the decode stage. Before Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003) Evaluation ![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7) ![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057) After Improvement Profiling kernel_details ![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e) Evaluation ![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0) ![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4) ### Does this PR introduce _any_ user-facing change? Users need to configure enable_multistream_moe=True ### How was this patch tested? Add e2e test cases to cover code logic Signed-off-by: sharonyunyun <zhangying134@huawei.com>	2025-06-25 19:56:49 +08:00
wangxiyuan	ca884ef86d	[Misc] Clean up uesless code for LLM initialize (#1373 ) This PR aims to clean up the useless code for LLM setup. It helps to make the code more clear. 1. remove useless `self.xxx` property 2. change `set_random_seed` to `seed_everything` 3. remove `set_custom_all_reduce`, it's only used for cuda This is just a code clean. no change for any code logic. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-25 16:20:14 +08:00
Mengqing Cao	52317f92cb	[DP] Tiny fix of dp and update example (#1273 ) ### What this PR does / why we need it? Add `max_num_tokens_across_dp` to AscendMetadata to fix dp This pr fixes the bug introduced by https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg `max_num_tokens_across_dp` when dp_size > 1. Signed-off-by: MengqingCao <cmq0113@163.com>	2025-06-25 11:03:04 +08:00
Li Wang	5f5800ba42	[Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399 ) ### What this PR does / why we need it? Sync MRotaryEmbedding interface change to recover main CI (https://github.com/vllm-project/vllm/pull/19939) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-06-24 22:56:39 +08:00
wangxiyuan	9cbce423ce	[MISC] Remove useless patch (#1366 ) ### What this PR does / why we need it? `stateless_init_dp_group` in vllm works with non-cuda platform now. Remove this useless patch. Which was introduced in vllm-ascend by `e74331a1ed` (v0.8.4rc2) vLLM upstream merged: `3e472d882a` (v0.8.0) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-06-24 10:05:59 +08:00
lyj-jjj	5177bef87a	support fused_moe_allgather_ep (#1335 ) ### What this PR does / why we need it? support fused_moe_allgather_ep ### How was this patch tested? It was tested by UT. Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>	2025-06-23 22:03:38 +08:00

1 2 3 4 5 ...

267 Commits