xc-llm-ascend

Author	SHA1	Message	Date
Yikun Jiang	a58b43b72c	Remove git .extraheader and fecth all commtis in /vllm-workspace/vllm-ascend (#2746 ) ### What this PR does / why we need it? Remove git .extraheader and fecth all commtis in /vllm-workspace/vllm-ascend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes: https://github.com/vllm-project/vllm-ascend/issues/2735 - vLLM version: v0.10.1.1 - vLLM main: `51d5e9be7d` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-05 09:45:11 +08:00
henryxuxu0716	51a2aec115	Delete redundant codes related to communication (#2717 ) ### What this PR does / why we need it? Delete redundant codes related to communication ### Does this PR introduce _any_ user-facing change? not involve ### How was this patch tested? not involve - vLLM version: v0.10.1.1 - vLLM main: `6c7af8110a` --------- Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>	2025-09-05 09:39:39 +08:00
1092626063	5b3646ab21	[FEATURE][MTP] Support MTP > 1 (#2708 ) ### What this PR does / why we need it? [RFC：Support MTP > 1 for DeepSeek](https://github.com/vllm-project/vllm-ascend/issues/2745) - [x] dp1 tp16 - [x] dp4 tp4 - [x] dp2 tp 8 - [x] torchair graph - vLLM version: v0.10.1.1 - vLLM main: `c9f7081f9c` Signed-off-by: 1092626063 <1092626063@qq.com>	2025-09-05 09:11:22 +08:00
yiz-liu	83eb40a51c	[Fix][MoE] Refine MoE communication strategy (#2734 ) ### What this PR does / why we need it? Refactors the Mixture-of-Experts (MoE) communication method selection logic. The choice between all-gather, all-to-all, and mc2 is now determined by expert parallel configuration, SoC version (A2/A3), and token count for better performance. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Added. - vLLM version: v0.10.1.1 - vLLM main: `eafa8dcde6` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-05 09:04:04 +08:00
liziyu	4c90fa79ca	[Misc] Remove useless PD check in deepseek (#2739 ) ### What this PR does / why we need it? Remove useless PD check in deepseek - vLLM version: v0.10.1.1 - vLLM main: `6c7af8110a` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-04 22:22:19 +08:00
vllm-ascend-ci	3a2a7d88db	[Doc] Update accuracy reports for v0.10.1rc1 (#2755 ) The accuracy results running on NPU Altlas A2 have changed, updating reports for: All models (Qwen3-30B-A3B, Qwen2.5-VL-7B-Instruct, Qwen3-8B-Base, DeepSeek-V2-Lite) - [Workflow run][1] [1]: https://github.com/vllm-project/vllm-ascend/actions/runs/17459225764 - vLLM version: v0.10.1.1 - vLLM main: `2b30afa442` Signed-off-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com> Co-authored-by: vllm-ascend-ci <vllm-ascend-ci@users.noreply.github.com>	2025-09-04 22:17:17 +08:00
sherie	f86596a66c	allgather use fusedop. (#2689 ) ### What this PR does / why we need it? Use 'npu_moe_init_routing_v2' &'npu_moe_token_unpermute' repalce 'npu_moe_init_routing' &‘npu_moe_compute_expert_tokens’& 'npu_moe_finalize_routing' to optimize performance ### Does this PR introduce _any_ user-facing change? \| branch\| tps\| TTFT \|TPOT \| \| --- \| --- \| --- \|--- \| \|main \|733.98 \| 280.05 \|34.30 \| \|main+fusedop \| 740.33 \| 273.34 \|33.99 \| ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-09-04 11:56:29 +08:00
无脸男	7d47d8f4f6	[Fix] fix resources limit error when apply speculative decoding and aclgraph (#2472 ) ### What this PR does / why we need it? When both speculative decoding and aclgraph are applied, and cudagraph_capture_sizes uses the default value, it will report that the stream resources are insufficient. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `9c99e4871f` Signed-off-by: withHades <244036962@qq.com>	2025-09-04 11:50:43 +08:00
无脸男	0c0789be74	[Feat] allow using aclgraph in ray backend (#2589 ) ### What this PR does / why we need it? Allow using aclgraph in ray backend, for tp + pp + aclgraph in multi machine ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `4ba0c587ba` Signed-off-by: withHades <244036962@qq.com>	2025-09-04 11:45:56 +08:00
Ruri	aff5189c87	[main] Fuse GroupedMatmul, Swiglu and DynamicQuant in `W8A8_DYNAMIC` quantized MoE layers (#2275 ) ### What this PR does / why we need it? Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion operation `GroupedMatmulSwigluQuant`. 1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py` 2. if in supported occasion, use fusion operation `npu_grouped_matmul_swiglu_quant` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16` 1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output Token Throughput increased 27.35% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e" /> 3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output Token Throughput increased 6.86% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6" /> - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-09-04 11:37:32 +08:00
22dimensions	37f5a29cd4	[1/N][Refactor][Quantization] remove redundant quantizer class (#2680 ) ### What this PR does / why we need it? AscendQuantizer/LLMQuantizer class is used to select quant method based on quant config and some other arguments, but it is more simple and clean replacing these classes with map. So i remove them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and e2e test - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-04 11:35:14 +08:00
Icey	d4370ebc42	[Refactor] Refactor Spec Decode (#2668 ) ### What this PR does / why we need it? Refactor spec decode ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-04 11:34:47 +08:00
Mengqing Cao	7e16b4a7cd	[ReleaseNote] Add Release Note for v0.10.1rc1 (#2635 ) Add Release Note for v0.10.1rc1 - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-04 11:26:47 +08:00
Angazenn	e7409e95ee	[1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437 ) ### What this PR does / why we need it? 1. Similar to #2384 , this PR add a torchair-specific modeling for pangu. 2. Fixes a bug introduced by routed_scaling_factor in #2675 . 3. remove eager test case for pangu since there has already been a torchair test case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zengyanjia <z00883269@china.huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>	2025-09-04 10:39:21 +08:00
whx	a58013440a	[BugFix][MLA] Fix attn_mask bug for ring mla (#2704 ) This PR fix a bug related to attention mask used in ring mla. Current ring mla has supported compressed mask, so we can directly use a 512 * 512 attention mask. - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-04 10:22:46 +08:00
wangxiyuan	e11a1bbfc1	[Doc] Update news (#2736 ) Refresh the news. Add meetup and official release info - vLLM version: v0.10.1.1 - vLLM main: `b5ee1e3261` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-04 10:10:24 +08:00
Mengqing Cao	984bd7c13a	[Bugfix][APC] Fix accuracy issue on prefix caching with AscendScheduler (#2714 ) ### What this PR does / why we need it? Fix accuracy issue on prefix caching with AscendScheduler ### How was this patch tested? CI passed with `test_prefix_cache_with_ascend_scheduler` - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-04 08:22:46 +08:00
baxingpiaochong	df88a2ecc8	[P/D]mooncake_connector adapted to 0.10.1 (#2664 ) ### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(https://github.com/vllm-project/vllm/pull/19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (https://github.com/vllm-project/vllm/pull/23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: `fa4311d85f` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-04 08:22:10 +08:00
zhiyuanzhang	07d44ade19	bugfix: fix initialization error for mooncake in k8s (#2541 ) ### What this PR does / why we need it? The detail has been clarified in that issue : https://github.com/vllm-project/vllm-ascend/issues/2557 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? easy to test beacause we just need to echo the variable - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LCAIZJ <leichao139636@163.com>	2025-09-03 22:25:08 +08:00
wangxiyuan	41b028aa5f	[Doc] add v0.9.1 release note (#2646 ) Add release note for 0.9.1 - vLLM version: v0.10.1.1 - vLLM main: `8bd5844989` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 18:04:27 +08:00
linfeng-yuan	90a75a90a9	[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532 ) ### What this PR does / why we need it? This PR ports #2312 #2506 #2531 to main branch. Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. Additionally, this pr fixes a recompilation problem of torchair graph mode caused by `running_in_graph` variable in `AscendMLATorchairImpl`. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). Additionally, we've made a change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to enhance safety and prevent accidental file deletion by adding a suffix directory. ### How was this patch tested? CI and e2e vllm serving pass. - vLLM version: v0.10.1.1 - vLLM main: `70549c1245` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-03 17:56:12 +08:00
liziyu	5889fa1b1c	[bugfix] ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function (#2508 ) ### What this PR does / why we need it? bugfix ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `426cc8629f` Signed-off-by: liziyu <liziyu16@huawei.com>	2025-09-03 16:54:39 +08:00
whx	59d23c39eb	[DP] External dp server starter (#2685 ) This PR re-implements external-dp starter based on vllm's support for external dp. - vLLM version: v0.10.1.1 - vLLM main: `f38035c123` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-03 16:30:26 +08:00
wangxiyuan	c03321781a	[CI] skip unstable UT (#2716 ) See #2687 we notice that test_platform and test_vocab_parallel_embedding is unstable, let's skip them first. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 15:53:50 +08:00
Li Wang	3584306387	[Bugfix] Fix qwen2.5-vl-without-padding (#2623 ) ### What this PR does / why we need it? Correct `AscendQwen2_5_VLForConditionalGeneration_Without_Padding` override methods ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `42dc59dbac` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-03 14:38:55 +08:00
Li Wang	bece793be6	[CI] Disable per-PR triggering for A3 (#2710 ) ### What this PR does / why we need it? Disable per-PR triggering for A3 for now, we trigger the dist test in the label `dist-test` rather than ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `136d853e65` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-03 11:52:34 +08:00
zhanghw0354	eaeb2efb20	[Main][Feat]Set the Profiler parameters through environment variables consistent with vLLM (#2608 ) ### What this PR does / why we need it? Currently, when performing profiling in vLLM-Ascend, if you need to obtain the Python call stack, you have to manually modify the code. The code location is: [worker_v1.py#L337](`6c973361fc/vllm_ascend/worker/worker_v1.py (L337)`) where you set with_stack to true. Now, in vLLM, you can set whether to obtain the Python call stack through an environment variable. The relevant PR is: [#21803](https://github.com/vllm-project/vllm/pull/21803) and the documentation is: [profiling](https://docs.vllm.ai/en/latest/contributing/profiling.html?h=vllm_torch_profiler_with_stack#profile-with-pytorch-profiler) This PR sets the profiler initialization parameters by using the same environment variable as vLLM, eliminating the need for manual code modification. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `0235103cbb` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-09-03 10:58:08 +08:00
Shanshan Shen	93754d8061	[Bugfix] Fix long context seq accuracy problem for `GLM4.5` (#2601 ) ### What this PR does / why we need it? Fix long context seq accuracy problem for `GLM4.5`. When `max_tokens=1000`, there is cyclic output problem like: ```bash 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ```python import os os.environ["VLLM_USE_MODELSCOPE"] = "True" os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" from vllm import LLM, SamplingParams def main(): prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=1000, temperature=0.0) # Create an LLM. llm = LLM(model="/root/.cache/modelscope/hub/models/ZhipuAI/GLM-4___5", tensor_parallel_size=8, enforce_eager=True, trust_remote_code=True, max_model_len=1024) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == "__main__": main() ``` - vLLM version: v0.10.1.1 - vLLM main: `0235103cbb` --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>	2025-09-03 09:18:44 +08:00
Angazenn	b84465c525	[Perf]Enable npu_moe_gating_top_k_softmax on quantized scenarios (#2633 ) ### What this PR does / why we need it? This PR enables `npu_moe_gating_top_k_softmax` when running quantized MoE (such as W8A8). This op in fact makes no distinction between quantized and non-quantized scenarios. Introducing this op reduces 3~4ms for TPOT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `ce30dca5c4` Signed-off-by: Angazenn <supperccell@163.com>	2025-09-03 09:14:17 +08:00
wangxiyuan	24d4dad7b2	[CI] Enable MTP torchair e2e test (#2705 ) enable MTP torchair e2e test - vLLM version: v0.10.1.1 - vLLM main: `ce30dca5c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-03 08:57:43 +08:00
Icey	af62af3cc5	[Image] Upgrade openEuler to 24.03 (#2631 ) ### What this PR does / why we need it? Upgrade openEuler to 24.03 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: `4071c76cf3` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-02 20:09:09 +08:00
wangxiyuan	0829b4873f	[CI] recover e2e test (#2688 ) 1. recover the skipped test. 2. remove pangu eager mode test, it's tested by torchair mode already. 3. skip pangu test util the bug is fixed. - vLLM version: v0.10.1.1 - vLLM main: `56d04089ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 18:49:17 +08:00
wangxiyuan	f023bd52bf	[CI] Make test_platform UT stable (#2696 ) Make test_platform stable - vLLM version: v0.10.1.1 - vLLM main: `56d04089ef` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 18:34:04 +08:00
wangxiyuan	c1e607b7b7	[Misc] Clean up uesless code in rotary_embedding (#2663 ) Clean up useless code which is only used for torchair in rotary_embedding - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 17:25:33 +08:00
Wang Yixuan	253b01b9a5	[7/N][refactor]fix torchair rope ops (#2683 ) ### What this PR does / why we need it? Due to the registration mechanism, torchair ops can not take effect， so have to patch the Ascend ops to adapt torchair ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main vLLM main: `7ea22e42d5` - vLLM version: main - vLLM main: `7ea22e42d5` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-09-02 17:21:56 +08:00
yupeng	9f1e054fe3	[Bugfix][LoRA][Operator] Fix LoRA custom operators accuracy issue (#2672 ) ### What this PR does / why we need it? Fix the LoRA accuracy issue that introduced by custom AscendC operator "bgmv_shrink, sgmv_shrink, bgmv_expand, sgmv_epand". The bug details are: - In the kernel function, if you want to call GlobalTensor.GetSize method, you have to pass the second parameter of bufferSize when you call GlobalTensor.SetGlobalBuffer first. - Or GlobalTensor.GetSize method will return a random value. - You can refer to [this doc](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/81RC1alpha002/apiref/ascendcopapi/atlasascendc_api_07_00024.html). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` --------- Signed-off-by: paulyu12 <paulyu0307@gmail.com> Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu12 <paulyu0307@gmail.com>	2025-09-02 11:46:59 +08:00
xuyexiong	214b32a346	[V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632 ) ### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 torchair multi DP fix: https://github.com/vllm-project/vllm-ascend/pull/2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7c8271cd1e` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-02 11:12:41 +08:00
wangxiyuan	fef18b60bc	Refactor e2e CI (#2276 ) Refactor E2E CI to make it clear and faster 1. remove some uesless e2e test 2. remove some uesless function 3. Make sure all test runs with VLLMRunner to avoid oom error 4. Make sure all ops test end with torch.empty_cache to avoid oom error 5. run the test one by one to avoid resource limit error - vLLM version: v0.10.1.1 - vLLM main: `a344a5aa0a` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-02 09:02:22 +08:00
leo-pony	0df059f41a	[CI] Fix CI Break: upstream adds routed_scaling_factor in forward_oot interface (#2675 ) ### What this PR does / why we need it? Fix CI Break: upstream adds routed_scaling_factor in forward_oot interface, vllm-ascend needs to adapt ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E and UT - vLLM version: v0.10.1.1 - vLLM main: `3e330fcb21` Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-09-01 19:02:50 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
LeeWenquan	b72e34013f	Add ut for mla (#2637 ) ### What this PR does / why we need it? Update UT for MLA case - vLLM version: v0.10.1.1 - vLLM main: `14b4326b94` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-01 14:07:57 +08:00
Wang Yixuan	ad13964c71	[6/N][refactor]delete torchair in rotary ops (#2581 ) ### What this PR does / why we need it? After moved torchair related rope ops into torchair_ops, split the torchair from the origin rope ops to make the code clean. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `81eea3d348` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-09-01 09:10:15 +08:00
Wang Yixuan	c2c97f3079	[5/N][refactor]add torchair rotary ops (#2559 ) ### What this PR does / why we need it? Move torchair related rotary ops into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair rotary ops. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `81eea3d348` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-09-01 09:09:21 +08:00
weichen	3a5fc5ee01	[Refactor][MoE] remove redundant code after refactoring fused_moe (#2612 ) ### What this PR does / why we need it? There are a lot of redundant codes related to moe here, and the structure is not very clear. We did the following things： we have placed the relatively independent code related to apply_mlp into a separate file; removed the environment variables of alltoall_buffer and alltoall_seq. Remove the code related to alltoall_buffer and alltoall_seq, and retain the sole TokenDispatcher inheritance class. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e&ut - vLLM version: v0.10.1.1 - vLLM main: `4071c76cf3` --------- Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-08-30 22:28:50 +08:00
panchao-hub	20ae71291d	[torchair]remove aicpu op (#2640 ) ### What this PR does / why we need it? remove aicpu op for torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.1.1 vLLM main: `05d839c19e` - vLLM version: v0.10.1.1 - vLLM main: `67c14906aa` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-08-30 15:51:12 +08:00
panchao-hub	7215454de6	bugfix for torchair graph (#2639 ) ### What this PR does / why we need it? bugfix for torchair graph ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `67c14906aa` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-08-30 15:49:48 +08:00
weijinqian0	6f1047d5fd	[CI] fix UT error. (#2644 ) `69f46359dd` changed the vl input usage, this PR fix the related UT failure. - vLLM version: v0.10.1.1 - vLLM main: `d660c98c1b` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-08-30 12:04:01 +08:00
yiz-liu	d3c93fba5c	[3/N][Feat][Graph] Support `all-to-all` and quantized models with ACL Graph (#2614 ) ### What this PR does / why we need it? * Unify execution paths: Consolidates the quantized and non-quantized execution paths into a single `fused_experts` function, removing duplicated logic and making the control flow clearer and easier to maintain. * W8A8 dynamic quantization: Adds support for W8A8 dynamic quantization inside the unified MoE kernel. Communication routines are updated to correctly handle dynamic quantization scales for activations. * Weight pre-processing: Prae-transpose the `w13` and `w2` weight matrices (as implemented in PR #2025) so that quantized and non-quantized models follow the same code path for the MoE gating, up-projection, and down-projection operations. * All-to-all communication: Adds an `all-to-all` collective communication pattern. For large token counts on modern hardware, `all-to-all` is more efficient than the previous `all-gather` strategy. However, `all-to-all` is not really captured and replayed due to multiple D2H operations which will trigger synchronization, and thus raise error when capture graphs. We only use `all-to-all` when fallback to `compiled_graph_for_general_shape`. * Dynamic communication selection: The model runner now selects the optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at runtime based on token count and the Ascend SoC version. * Limitation: `all-gather` is not yet supported for quantized models, which means there is still something left to do on A2. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No further test cases needed. - vLLM version: v0.10.1.1 - vLLM main: `d660c98c1b` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-30 11:00:35 +08:00
Mengqing Cao	91c35d765a	[Bugfix] Fix mc2 operator error in aclgraph + ep<16 scenario (#2609 ) ### What this PR does / why we need it? 1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover CI, will be refactorred in the future 2. disable aclgraph when testing w8a8 ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: `95089607fa` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-29 21:59:16 +08:00
wangxiaoteng666	ee6d141dd4	[MAIN][BUGFIX] BugFix: Resolve the issue of waiting queue accumulation when requests are canceled. (#2426 ) ### What this PR does / why we need it? Resolve the issue of waiting queue accumulation when requests are canceled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` --------- Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>	2025-08-29 17:19:23 +08:00

1 2 3 4 5 ...

828 Commits