xc-llm-ascend

Author	SHA1	Message	Date
Icey	18221c0e1d	[Fusion] normalize fusion naming and enable e2e test (#4693 ) ### What this PR does / why we need it? This PR standardizes the fusion naming, changing `enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e testing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-11 17:53:43 +08:00
zhangyiming	66b0781840	[E2E] Refactor the e2e testcases. (#4789 ) ### What this PR does / why we need it? Refactor the e2e testcases. - tests/e2e/multicard/test_weight_loader.py: Remove the unused code. - tests/e2e/singlecard/multi-modal/test_internvl.py: Move to accuracy test. - tests/e2e/singlecard/test_aclgraph.py: Rename the file. - tests/e2e/singlecard/test_embedding_aclgraph.py : Combine with tests/e2e/singlecard/test_bge_model.py - tests/e2e/singlecard/test_completion_with_prompt_embeds.py: Delete eager mode and modify model to Qwen3-0.6B - tests/e2e/singlecard/test_quantization.py: Modify model to Qwen3-0.6B-W8A8 - tests/e2e/singlecard/test_vlm.py: Modify model to Qwen3-VL-8B - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-11 10:15:00 +08:00
lianyibo	e32014ac1d	[Model] Support pooling models (#3122 ) ### What this PR does / why we need it? Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken). After this [commit](`17373dcd93`), vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side. Fixes #1960 - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lianyibo <lianyibo1@kunlunit.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-12-10 11:37:57 +08:00
Ronald	1a7a34c5ec	add e2e test for mtp async_scheduling (#4826 ) ### What this PR does / why we need it? add e2e test for mtp async scheduling ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-10 11:30:22 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
Li Wang	1c70f5c922	[CI] Skip `test_suffix_correctness` (#4820 ) ### What this PR does / why we need it? Currently, suffix decoding has known correctness issue see https://github.com/vllm-project/vllm-ascend/actions/runs/20033509824/job/57457565620?pr=4781" Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-09 11:48:13 +08:00
LuLina	afe00505de	[Fix] skip xlite e2e test (#4786 ) ### What this PR does / why we need it? Due to the differences in operators used and execution order between xlite and eager modes, there will be slight precision discrepancies. This patch skip the xlite e2e tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.12.0 vLLM main: `ad32e3e19c` Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 16:48:15 +08:00
wangxiyuan	0b65ac6c4b	remove useless patch (#4699 ) patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-08 11:02:42 +08:00
fluctlux	9fbcfa36af	[CI] Fix ngram & suffix test oom (#4755 ) ### What this PR does / why we need it? Avoid oom during CI by using `with VllmRunner` instead of `LLM()`, and enable `test_ngram_correctness` ### How was this patch tested? CI passed. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 09:26:29 +08:00
LuLina	2be0fe2691	[Feat] Add Euler xlite graph wrapper support (#4526 ) ### What this PR does / why we need it? This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md The latest performance comparison data between xlite and the default aclgraph mode is as follows: ## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`c4a71fc6`) - xlite-full: main(`c4a71fc6`) + xlite-full - xlite-decode-only: main(`c4a71fc6`) + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph ### Does this PR introduce _any_ user-facing change? Enable the xlite graph mode by setting xlite_graph_config: --additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 08:27:46 +08:00
zhaomingyu13	cb42564942	[BugFix] Fix eagle3 accuracy problem when enforce_eager=True (#4521 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "num_speculative_tokens": 3 }, enforce_eager=True, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:31:26 +08:00
Ronald	3480094d7c	support async mtp (#4511 ) ### What this PR does / why we need it? this pr aims to support async_scheduling for mtp, which refer to vllm pr https://github.com/vllm-project/vllm/pull/24799. and this pr fix some synchronize problem in vllm-ascend. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 17:15:57 +08:00
Icey	178ca1607e	Adopt inductor fusion and define quantization fusion pass (#4168 ) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-04 10:29:48 +08:00
wangxiyuan	400af665e6	[CI] Drop ascend scheduler from test (#4613 ) Drop ascend scheduler from test - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 13:18:17 +08:00
wangxiyuan	981a14f8d5	[CI]enable chunked prefill by default (#4569 ) set `enable_chunked_prefill` to True for e2e test by default to keep the same behavior with vLLM - vLLM version: v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-02 08:54:34 +08:00
wangxiyuan	27b09ca9b9	[CI] drop ascend scheduler test (#4582 ) let' drop ascend scheduler test first to ensure all function works without it. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-01 20:33:50 +08:00
wangxiyuan	0d14f635b4	upgrade torch npu version (#4433 ) vLLM graph feature now rely on torch >=2.8. To make graph mode work, we need upgrade torch version as well. For long term support, upgrade torch to a newer one is good to go as well. Related vLLM change: https://github.com/vllm-project/vllm/pull/25110 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-12-01 19:01:55 +08:00
fluctlux	f1f6370ed9	[Feature] Integrate Suffix Spec Decoding (#4045 ) ### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (https://github.com/vllm-project/vllm/pull/25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:`83f478bb19` with PR https://github.com/vllm-project/vllm/pull/25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>	2025-12-01 18:41:42 +08:00
Mengqing Cao	c84efeae25	[CI] Skip test_ngram_correctness as the oom issue block CI (#4578 ) ### What this PR does / why we need it? Skip test_ngram_correctness as the oom issue block CI related CI failure: https://github.com/vllm-project/vllm-ascend/actions/runs/19780591780/job/56680823606 ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 Signed-off-by: MengqingCao <cmq0113@163.com>	2025-11-30 01:34:50 +08:00
Mengqing Cao	517fd9272d	Revert "drop ascend scheduler" (#4580 ) Reverts vllm-project/vllm-ascend#4498 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2	2025-11-29 22:20:48 +08:00
wangxiyuan	f10acddb78	drop ascend scheduler (#4498 ) Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-29 16:18:34 +08:00
SILONG ZENG	ab37a7d5ae	[main]Upgrade cann to 8.3rc2 (#4350 ) ### What this PR does / why we need it? Upgrade cann to 8.3rc2 ### Does this PR introduce _any_ user-facing change? Yes, docker image will use 8.3.RC2 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2025-11-28 14:06:01 +08:00
wangxiyuan	bc69d7cfe1	upgrade to vllm 0.11.2 (#4400 ) Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by https://github.com/vllm-project/vllm/pull/26866 2. get_mrope_input_positions is broken by https://github.com/vllm-project/vllm/pull/28399 3. graph mode is broken by https://github.com/vllm-project/vllm/pull/25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by https://github.com/vllm-project/vllm/pull/27583 5. `get_attn_backend_cls` and attention backend is broken are broken by https://github.com/vllm-project/vllm/pull/28534 6. spec decode is broken by https://github.com/vllm-project/vllm/pull/28771 7. sp feature is broken by https://github.com/vllm-project/vllm/pull/27126 8. mtp is broken by https://github.com/vllm-project/vllm/pull/27922 9. lora is broken by https://github.com/vllm-project/vllm/pull/21068 10. execute_model is broken by https://github.com/vllm-project/vllm/pull/26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by https://github.com/vllm-project/vllm/pull/28159 12. kv cahe is broken by https://github.com/vllm-project/vllm/pull/27753 13. dp is broken by https://github.com/vllm-project/vllm/pull/25110 What's broken and changed by ourself: 1. qwen vl is broken by https://github.com/vllm-project/vllm/pull/28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by https://github.com/vllm-project/vllm/pull/23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by https://github.com/vllm-project/vllm/pull/28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by https://github.com/vllm-project/vllm/pull/28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by https://github.com/vllm-project/vllm/pull/27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-11-26 11:48:58 +08:00
XiaoxinWang	5508a602ed	[Fix] fix aclgraph e2e test. (#4131 ) ### What this PR does / why we need it? Due to the inconsistency between the attention operators used in eager mode and graph mode, the accumulation order of the operator cannot be guaranteed to be deterministic. Therefore, we modify the test to compare with given outputs. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-24 17:22:03 +08:00
wangxiyuan	a1f142b7ad	Drop 0.11.0 support (#4377 ) There is a lot hack code for v0.11.0, which makes the code hard to upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-24 17:08:20 +08:00
realliujiaxu	1cdf9ffa73	[Bugfix] fix hang in async scheduling (#4233 ) ### What this PR does / why we need it? After https://github.com/vllm-project/vllm-ascend/pull/4113, there is no synchronization between steps. However, in async scheduling with aclgraph, it is possible that the CPU's record event for the current iteration completes before the previous iteration's graph execution has finished. If cpu is fast enough, device will hang on event_wait in interation i+1 (assume that event_record is executed immediately on update stream of device): <img width="1812" height="489" alt="image" src="https://github.com/user-attachments/assets/373fe655-afe5-4d7d-807e-b0aacf24a543" /> after add synchonization, record is launched after graph replay: <img width="1803" height="466" alt="image" src="https://github.com/user-attachments/assets/a8a68053-bd7d-49f5-a79c-9a26ef1285cc" /> bubble time caused by synchronization is about 85 us on G8600： <img width="1491" height="804" alt="image" src="https://github.com/user-attachments/assets/968611ee-f39a-4329-8150-1c4adba25dd1" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Co-authored-by: hwhaokun <haokun0405@163.com>	2025-11-19 14:47:19 +08:00
XiaoxinWang	e38ef2c434	support FULL graph mode for GQA (#3970 ) ### What this PR does / why we need it? The current library only supports the FullDecodeOnly graph mode, which enables full graph execution during the decode. This PR extends support to allow full graph execution in both the prefill and decode, referred to as FULL graph mode. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-17 10:50:35 +08:00
XiaoxinWang	1b4ce63ec9	fix fullgraph in ds. (#4016 ) ### What this PR does / why we need it? DS don't have 'AscendAttentionMetadataBuilder' class so will fail in fullgraph. We resolved the issue by modifying the code to only check for 'GDNAttentionMetadataBuilder ', while all other attention cases follow the default branch. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-11-12 10:11:43 +08:00
wangxiyuan	f811a24bf0	Remove VLLM_USE_V1 (#4086 ) Drop VLLM_USE_V1 usage. This env has been removed from vLLM already. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-11 15:43:39 +08:00
zhaomingyu13	7ffbe73d54	[main][Bugfix] Fix ngram precision issue and open e2e ngram test (#4090 ) ### What this PR does / why we need it? Fix ngram precision issue and open e2e ngram test - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: Icey <1790571317@qq.com>	2025-11-11 09:06:24 +08:00
Levi	0a62e671fb	[Feat] flashcomm_v2 optim solution (#3232 ) ### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com>	2025-11-10 11:01:45 +08:00
wangx700	24d6314718	[Bugfix] fix sleepmode level2 e2e test (#4019 ) ### What this PR does / why we need it? enable sleepmode level2 e2e test and add the check logic to ensure the nz is not enabled. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? use e2e tests - vLLM version: v0.11.0 - vLLM main: `83f478bb19` Signed-off-by: wangx700 <wangxin700@huawei.com>	2025-11-08 14:11:55 +08:00
Canlin Guo	f99762eb25	[E2E][MM] Add e2e tests for InternVL model (#3796 ) ### What this PR does / why we need it? As a validation for #3664, add end-to-end tests to monitor the InternVL model and ensure its continuous proper operation. This PR is only for single-card. So the models that have more parameters than 8B like 78B are needed to test using multi-cards. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? `pytest -sv tests/e2e/singlecard/multi-modal/test_internvl.py` - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2025-10-31 15:42:47 +08:00
lilinsiman	35a913cf1e	add new e2e tests case for aclgraph memory (#3879 ) ### What this PR does / why we need it? add new e2e tests case for aclgraph memory ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-31 09:16:52 +08:00
Song Zhixin	216fc0e8e4	[feature] Prompt Embeddings Support for v1 Engine (#3026 ) ### What this PR does / why we need it? this PR based on [19746](https://github.com/vllm-project/vllm/issues/19746), support Prompt Embeddings for v1 engine on NPU ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ```python python examples/prompt_embed_inference.py ``` - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.1 --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-10-30 17:15:57 +08:00
xuyexiong	eff3e5fc6f	[FEAT] Refactor spec decode to support efficient padded speculation (#3528 ) ### What this PR does / why we need it? 1. Refactor the file `mtp_proposer.py`, splits torchair related codes into `mtp_torchair_proposer.py` 2. According to https://github.com/vllm-project/vllm/pull/24539, implements padded speculative decoding as described in https://github.com/vllm-project/vllm/issues/21984. ### Does this PR introduce _any_ user-facing change? User can use `disable_padded_drafter_batch` to disable/enable padded speculation, default is `False`. offline example: ``` speculative_config={"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} ``` ### How was this patch tested? - [x] egaer with pad/unpad: - [x] aclgraph with pad/unpad - [x] torchair with pad/unpad performance test of deepseek-r1 with tp16、dp1 aclgraph with pad ITL: 168ms aclgraph with unpad ITL: 169ms original: 178ms - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-10-30 16:53:05 +08:00
Li Wang	60ee4af6d0	[CI] Add custom op to nightly (#3765 ) ### What this PR does / why we need it? 1. Add custom op to nightly tests, fix https://github.com/vllm-project/vllm-ascend/pull/3665 2. Correctly pass github secrets when using workflow_call, see https://docs.github.com/en/actions/how-tos/reuse-automations/reuse-workflows 3. Fix the single node mutual cancellation issue - vLLM version: v0.11.0rc3 - vLLM main: `c9461e05a4` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-27 14:07:03 +08:00
Icey	d9cdc65854	Upgrade to new vllm commit (#3719 ) ### What this PR does / why we need it? Upgrade to new vllm commit: `c9461e05a4` - Fix many imports, caused by https://github.com/vllm-project/vllm/pull/26908 - Fix import ```sha256```, caused by https://github.com/vllm-project/vllm/pull/27169 - Remove ```SchedulerConfig.send_delta_data```, caused by https://github.com/vllm-project/vllm/pull/27142 - Fix ```FusedMoE``` because of dual stream execution, caused by https://github.com/vllm-project/vllm/pull/26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-10-25 15:36:32 +08:00
weichen	63c363d3de	[Refactor] [MoE] Rename moe-related classes & files (#3646 ) ### What this PR does / why we need it? 1. Rename common_fused_moe.py to fused_moe.py. 2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize to prepare_finalize.py / PrepareAndFinalize. 3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe. 4. Move vllm_ascend/ops/fused_moe.py to vllm_ascend/ops/fused_moe/fused_moe.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-25 11:22:03 +08:00
whx	d5609e2c48	[BugFix] Comment out newly added vlm e2e. (#3736 ) This PR comments out newly added vlm e2e test of ascend scheduler scenario because I found that when running in multi-batch this will stuck. Need to add this back after dealing with this issue. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-25 10:34:59 +08:00
whx	e33751ef8b	[BugFix][Core] Fix a bug running multi-modal with ascend_scheduler (#3675 ) This PR fix the bug related with running multi-modal models with AscendScheduler. This bug was introduced by PR #2372 by using the same parameter names as vLLM with different default values. Currently I fix this bug by changing the default values of these two parameters to align with vLLM. - vLLM version: v0.11.0rc3 - vLLM main: `17c540a993` Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>	2025-10-25 09:41:33 +08:00
Mengqing Cao	cea0755b07	[1/N][Refactor] Refactor code to adapt with vllm main (#3612 ) ### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with `17c540a993` 1. refactor deepseek to the latest code arch as of `17c540a993` 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by https://github.com/vllm-project/vllm/pull/25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by https://github.com/vllm-project/vllm/pull/26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by https://github.com/vllm-project/vllm/pull/23485 - Fix `MLAAttention` import,caused by https://github.com/vllm-project/vllm/pull/25103 - Fix `SharedFusedMoE` import, caused by https://github.com/vllm-project/vllm/pull/26145 - Fix `LazyLoader` improt, caused by https://github.com/vllm-project/vllm/pull/27022 - Fix `vllm.utils.swap_dict_values` improt, caused by https://github.com/vllm-project/vllm/pull/26990 - Fix `Backend` enum import, caused by https://github.com/vllm-project/vllm/pull/25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by https://github.com/vllm-project/vllm/pull/26355 - Fix fused_moe ops, caused by https://github.com/vllm-project/vllm/pull/24097 - Fix bert model because of `inputs_embeds`, caused by https://github.com/vllm-project/vllm/pull/25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by https://github.com/vllm-project/vllm/pull/24172 - Fix `splitting_ops` changes introduced by https://github.com/vllm-project/vllm/pull/25845 - Fix multi-modality changes introduced by https://github.com/vllm-project/vllm/issues/16229 - Fix lora bias dropping issue introduced by https://github.com/vllm-project/vllm/pull/25807 - Fix structured ouput break introduced by https://github.com/vllm-project/vllm/issues/26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>	2025-10-24 16:55:08 +08:00
whx	1b270a64bd	[MoE][Multistream] Avoid performing communication in extra stream. (#3582 ) This PR moves the communication operation of shared experts out of extra stream because I found that this might cause rtMemcpy related errors when running shared experts multistream with aclgraph. Furthermore, I utilize a global variable as extra stream object to avoid allocating streams for each layer in full-graph mode. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-10-24 10:44:38 +08:00
Yizhou	b13d22bf5a	[Fix] Fixes attribute error in MLA implementation (#3618 ) ### What this PR does / why we need it? Corrects the attribute access for retrieving the device from `q_a_proj` to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not exist on the class instance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need MLAPO tests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-23 09:12:50 +08:00
KyrieWang	60e2be1b36	[Feat] Dynamic Batch Feature (#3490 ) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com>	2025-10-22 14:13:32 +08:00
weichen	2f1b9a7a64	Reapply "[MoE] [Refactor] Remove manual memory cleanup (#3365 )" (#3483 ) (#3512 ) ### What this PR does / why we need it? 1. Replace manual memory cleanup with passing parameter. 2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated code. 3. Fix MC2 bug introduced in https://github.com/vllm-project/vllm-ascend/pull/3365 4. Unify aclgraph & eager in W8A8_dynamic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-10-22 11:41:30 +08:00
Chen Chen	6b290acfe1	remove redundant params in mla_preprocess kernel (#3530 ) ### What this PR does / why we need it? This pull request removes the redundant parameters `gamma1` and `beta1` (also named `gamma0`/`beta0` in some places) from the `mla_preprocess` kernel and its calling hierarchy. The changes are consistent across C++ kernel code, bindings, and Python call sites. The parameters were unused in the lower-level functions, so their removal is a good cleanup. ### Does this PR introduce _any_ user-facing change? The python interface of the kernel is affected, and the params of `gamma0` and `beta0` are not needed. ### How was this patch tested? The unit-test of the kernel is adapted accordingly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: mojave2 <chenchen145@huawei.com>	2025-10-21 19:20:13 +08:00
Yizhou	ec1d2b5c04	[Test] Temporarily skip flaky ACL graph test (#3577 ) ### What this PR does / why we need it? Disables `FULL_DECODE_ONLY` end-to-end test that fails intermittently. This prevents CI blockages while the root cause of the flakiness is investigated. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-21 17:16:15 +08:00
lilinsiman	70bef33f13	add new accuracy test case for aclgraph (#3390 ) ### What this PR does / why we need it? Add new accuracy test case Deepseek-V2-Lite-W8A8 for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-20 20:04:04 +08:00
anon189Ty	248ee7fa11	[Feat]Make full graph mode compalible with MTP (#3276 ) ### What this PR does / why we need it? Make the Full Graph mode can run with MTP. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-17 20:19:56 +08:00

1 2 3 4 5

227 Commits