xc-llm-ascend

Author	SHA1	Message	Date
Shaoxu Cheng	e5343d6eb3	[310P][Bugfix]: fix ngram graph replay accuracy error (#7134 ) ### What this PR does / why we need it? On the 310P device, when running ACLGraph together with the n-gram speculative decoding algorithm, both graph capture and graph replay require `uniform_decode_query_len` and do not depend on `attention_state`. This leads to a rather interesting and unexpected issue on 310P: during decode-only, execution does not enter the graph, while in the split-fuse state (that is, the chunked prefill state), it instead enters graph execution directly. The issue can be resolved by forcibly setting `uniform_decode_query_len` to `1`, so that 310P captures only the decode-only graph, and replay is then controlled through `attention_state`. ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-12 17:08:08 +08:00
无脸男	09d26754cd	[Bugfix] Fix the issue where no exception is thrown when graph capture fails. (#5644 ) ### What this PR does / why we need it? Fix the issue where no exception is thrown when graph capture fails. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: WithHades <244036962@qq.com>	2026-03-12 16:14:45 +08:00
xleoken	77b43492ae	improve the ttft when use mooncake (#6125 ) ### What this PR does / why we need it? improve performance of mooncake by change the log level from info to debug ### ENV 2P + 4D, EP 1. benchmark script ``` evalscope perf \ --parallel 512 \ --number 1024 \ --model deepseek \ --url http://localhost:9000/v1/chat/completions \ --api openai \ --dataset random \ --max-tokens 2 \ --min-tokens 2 \ --prefix-length 0 \ --min-prompt-length 512 \ --max-prompt-length 512 \ --tokenizer-path /tmp/DeepSeek-v3-0324-w8a8-0814 \ --extra-args '{"ignore_eos": true}' \ --rate 2 ``` 2. before patch ``` +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 209.484 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1022 \| +-----------------------------------+-----------+ \| Failed requests \| 2 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 9.7573 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2507.62 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 4.8786 \| +-----------------------------------+-----------+ \| Average latency (s) \| 7.0561 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 5.7444 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.3117 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 14:56:32 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.6062 \| 0.5113 \| 0.5113 \| 1.234 \| 512 \| 2 \| 0.0888 \| 22.8338 \| \| 25% \| 0.7248 \| 0.5639 \| 0.5639 \| 1.4114 \| 512 \| 2 \| 0.2 \| 51.3919 \| \| 50% \| 0.9092 \| 0.7748 \| 0.7748 \| 1.6767 \| 512 \| 2 \| 1.1935 \| 306.7171 \| \| 66% \| 1.0745 \| 1.0345 \| 1.0345 \| 3.1308 \| 512 \| 2 \| 1.3395 \| 344.2495 \| \| 75% \| 7.0812 \| 1.5389 \| 1.5389 \| 10.0016 \| 512 \| 2 \| 1.417 \| 364.1808 \| \| 80% \| 10.6944 \| 1.8552 \| 1.8552 \| 13.3717 \| 512 \| 2 \| 1.4778 \| 379.7911 \| \| 90% \| 19.2342 \| 2.4325 \| 2.4326 \| 22.5105 \| 512 \| 2 \| 1.6208 \| 416.5381 \| \| 95% \| 24.4399 \| 2.8289 \| 2.8289 \| 26.0329 \| 512 \| 2 \| 1.7548 \| 450.9942 \| \| 98% \| 45.0941 \| 3.4098 \| 3.4098 \| 45.6287 \| 512 \| 2 \| 1.8193 \| 467.5476 \| \| 99% \| 46.2786 \| 3.8492 \| 3.8492 \| 46.9282 \| 512 \| 2 \| 1.8576 \| 477.4157 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` 3. after patch ``` Benchmarking summary: +-----------------------------------+-----------+ \| Key \| Value \| +===================================+===========+ \| Time taken for tests (s) \| 191.613 \| +-----------------------------------+-----------+ \| Number of concurrency \| 512 \| +-----------------------------------+-----------+ \| Request rate (req/s) \| 6 \| +-----------------------------------+-----------+ \| Total requests \| 1024 \| +-----------------------------------+-----------+ \| Succeed requests \| 1024 \| +-----------------------------------+-----------+ \| Failed requests \| 0 \| +-----------------------------------+-----------+ \| Output token throughput (tok/s) \| 10.6882 \| +-----------------------------------+-----------+ \| Total token throughput (tok/s) \| 2746.87 \| +-----------------------------------+-----------+ \| Request throughput (req/s) \| 5.3441 \| +-----------------------------------+-----------+ \| Average latency (s) \| 2.0407 \| +-----------------------------------+-----------+ \| Average time to first token (s) \| 0.7989 \| +-----------------------------------+-----------+ \| Average time per output token (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average inter-token latency (s) \| 1.2419 \| +-----------------------------------+-----------+ \| Average input tokens per request \| 512 \| +-----------------------------------+-----------+ \| Average output tokens per request \| 2 \| +-----------------------------------+-----------+ 2026-01-22 15:10:31 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| Percentiles \| TTFT (s) \| ITL (s) \| TPOT (s) \| Latency (s) \| Input tokens \| Output tokens \| Output (tok/s) \| Total (tok/s) \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ \| 10% \| 0.5727 \| 0.5051 \| 0.5051 \| 1.1761 \| 512 \| 2 \| 1.0368 \| 266.4696 \| \| 25% \| 0.6497 \| 0.5324 \| 0.5324 \| 1.3159 \| 512 \| 2 \| 1.1763 \| 302.3184 \| \| 50% \| 0.7767 \| 0.6908 \| 0.6908 \| 1.4793 \| 512 \| 2 \| 1.3521 \| 347.4944 \| \| 66% \| 0.8711 \| 0.7912 \| 0.7912 \| 1.5916 \| 512 \| 2 \| 1.4518 \| 373.1092 \| \| 75% \| 0.9125 \| 0.8797 \| 0.8797 \| 1.7008 \| 512 \| 2 \| 1.521 \| 390.9018 \| \| 80% \| 0.9381 \| 0.9442 \| 0.9442 \| 1.7657 \| 512 \| 2 \| 1.5749 \| 404.7606 \| \| 90% \| 0.994 \| 1.0818 \| 1.0818 \| 1.9289 \| 512 \| 2 \| 1.7006 \| 437.0518 \| \| 95% \| 1.0369 \| 1.2454 \| 1.2454 \| 2.2154 \| 512 \| 2 \| 1.7937 \| 460.9731 \| \| 98% \| 1.1237 \| 18.8814 \| 18.8814 \| 19.4607 \| 512 \| 2 \| 1.8755 \| 482.0097 \| \| 99% \| 1.6752 \| 24.4406 \| 24.4406 \| 25.4734 \| 512 \| 2 \| 1.907 \| 490.0993 \| +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ``` --------- Signed-off-by: xleoken <xleoken@163.com>	2026-03-12 16:13:48 +08:00
Hexiang Wang	f244f3c4a9	[BugFix] Fix problem of extra processes on rank0 device (#7107 ) ### What this PR does / why we need it? Currently when tp>1, we have extra processes on tp rank0 device which consumes extra HBM memory. This is caused by `import torch_npu._inductor` before set_device which introduces extra initialization of device. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All ci passed. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-12 15:59:03 +08:00
Mercykid-bash	132f3c5d0a	Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477 ) # Feature: FlashLB algorithm ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel load balancing algorithm: FlashLB. 1. The default algorithm adopts two separate sub-procedures to optimize expert replication and placement independently: a. Expert Replica Allotment Sub-procedure : Determines the number of replicas for all experts. At each step, it greedily adds one more replica to the expert with the highest per-replica load, aiming to minimize load skew at the expert replica granularity (Min Max Replica, MMR). b. Expert Replica Placement Sub-procedure : Distributes all replicas across devices. First, it sorts the generated replicas in descending order of hotness, then iteratively places the currently hottest replica onto the device with the lowest cumulative load and available slots. However, this simplistic combination of two separate procedures lacks synergy and often leads to sub-optimal load balancing. For example, in the simple scenario illustrated below: Given 8 logical experts with hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas allocated per device across 8 devices, the default EPLB algorithm results in a maximum per-device hotness of 232 (peak-average load ratio 1.28), while our proposed FlashLB algorithm reduces this value to 205 (peak-average load ratio 1.13). <figure><img src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4" width="90%"</figure> 2. The default algorithm simply aggregates hotness measurements across the entire profiling window. While this provides a coarse approximation of the hotness distribution, it fails to capture the time-phased variations and temporal correlations in expert hotness (both within and between experts) across iterations—phenomena that have been observed in real-world scenarios. Such single-point hotness estimation degrades the solution quality of the load balancing algorithm. 3. The default algorithm regularly recalculates updated expert placement results for all layers without discrimination. Considering that excessive expert updates can impact Service Level Objectives (SLOs), such full-scale redeployment leads to excessively high adjustment overhead, which negatively affects end-to-end performance. ## FlashLB Algorithm Principle ### 1. Joint Optimization of Replica Allotment and Placement FlashLB achieves joint optimization of replica allotment and placement through a novel tree search approach, combined with carefully designed e Fl fficient pruning and lightweight look-ahead estimation. We partition all experts into several subsets, and for each subset, hierarchically determine the optimal replica count and placement. Leveraging efficient pruning and lightweight look-ahead estimation, the process consistently aims to optimize the globally expected inter-device load balance degree (considering both deployed and unexplored experts) while ensuring sufficient computational efficiency. Additionally, precompilation techniques are employed for acceleration, delivering load balancing that is both high-quality and practically efficient. ### 2. Multi-Episode Enhancement Instead of performing full-duration averaging like the default algorithm, FlashLB partitions each profiling interval (e.g., 1024 iterations) into multiple consecutive smaller episodes (e.g., 16 iterations). This preserves hotness fluctuation and correlation information. It then constructs a multi-objective optimization problem to co-optimize these episodes simultaneously, enabling adaptability to interleaved hotness patterns and improving statistical robustness. ### 3. Layer-wise Cherry-Picking Redeployment To reduce the overhead of frequent expert redeployment, FlashLB introduces a cherry-picking redeployment scheme. During each algorithmic decision cycle, it real-time tracks load balance degree of all layers and triggers expert placement updates only for those layers whose peak-average ratio exceeds a predefined threshold. This avoids unnecessary redeployment for stable layers, significantly reducing adjustment overhead and thereby improving end-to-end performance gains. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com This PR mainly introduces two key optimizations for load balancing scheduling: 1. Add per-step heat collection function: Support real-time collection of per-step heat information during model inference. This enables more fine-grained load balancing decisions by taking per-step heat as the optimization target, improving scheduling accuracy for dynamic and fluctuating workloads. 2. Update FlashLB algorithm: Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat distribution scenarios. The improved algorithm can comprehensively perceive and utilize multi-stage heat characteristics, achieving more stable and efficient load balancing under complex expert deployment and dynamic traffic patterns. --------- Signed-off-by: Mercykid-bash <ruanche0218@gmail.com> Signed-off-by: xuzewei28 <xuzewei2@h-partners.com> Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>	2026-03-12 15:49:09 +08:00
Feng-xiaosuo	abe72d7cb9	Refactor quantization layer name mapping to leverage vLLM built-in mappers (#7050 ) …the quantization layer name ### What this PR does / why we need it? This PR modifies the loading logic for layer name prefixes in quantized models. The goal is to reduce or eliminate the need for point-to-point (hardcoded) modifications by leveraging the built-in mapper mechanism already provided in vLLM's model code. For models that do not yet have a corresponding mapper, the original point-to-point modification approach has been retained to ensure backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes were validated using an offline deployment script to launch and verify multiple multimodal models. Testing confirmed that the updated loading logic correctly handles layer name prefixes across different model architectures, with no regression in model initialization or inference behavior. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Matrix_K <zhangke144@huawei.com> Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com> Co-authored-by: Matrix_K <zhangke144@huawei.com>	2026-03-12 15:48:14 +08:00
drslark	fb0d6dd175	[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148 ) ### What this PR does / why we need it? Fixed the error of speculative decoding in FULL mode when `num_spec + 1` not in `cudagraph_capture_sizes`. Now, we can run speculative decoding in FULL mode, but with drafter as eager. It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 . ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Test code is shown as below: ```python prompts = [ "1.Who are you?", "2. Who are you?", ] sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200) llm = LLM( model="/home/some-model/Meta-Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_num_seqs=32, # enforce_eager=True, disable_log_stats=False, distributed_executor_backend="mp", gpu_memory_utilization=0.7, async_scheduling=True, speculative_config={ "enforce_eager": True, "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B", "disable_padded_drafter_batch": False, "method": "eagle3", "num_speculative_tokens": 2, }, compilation_config={ "cudagraph_mode": "FULL", "cudagraph_num_of_warmups": 1, }, max_model_len=4096, enable_prefix_caching=False, ) outputs = llm.generate(prompts, sampling_params) ``` The result before: ```text File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor assert num_tokens_padded % uniform_decode_query_len == 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError ``` The result after: ```text -------------------------------------------------- total_num_output_tokens: 400 num_drafts: 249 num_draft_tokens: 498 num_accepted_tokens: 149 mean acceptance length: 1.60 -------------------------------------------------- acceptance at token 0: 0.43 acceptance at token 1: 0.17 ``` - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-12 14:51:12 +08:00
XiaoxinWang	37d1bd8c50	fixed fia pad logic in graph mode. (#7144 ) ### What this PR does / why we need it? related to vllm PR #34043 this pr delete func ‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual number of requests, due to fia operator requires that query_start_loc[-1] equals the total number of computed tokens, so this func delete cause the ifa error. In full graph mode, set num_reqs_paded = num_reqs to fix the error ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2026-03-12 14:50:54 +08:00
Qiu	aa0143e55d	refactor: add a check before layer_sharding logging (#7186 ) ### What this PR does / why we need it? We should only display this log message when layer_sharding is enabled. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-12 11:56:04 +08:00
linfeng-yuan	5f3826b093	[Build] Add support for Ascend950 chip (#7151 ) ### What this PR does / why we need it? This PR adds support for the Ascend950 chip. This includes: - Updating build scripts (`CMakeLists.txt` and `setup.py`) to recognize the Ascend950 chip and set appropriate compilation flags. - Disabling a set of custom operators that are not yet supported on the Ascend950 hardware target. - Performing a codebase-wide refactoring of `pipe_barrier()` calls to the namespaced `AscendC::PipeBarrier<>()` for improved code consistency and adherence to the latest API standards. Ascend950DT e2e passed (Qwen3-32B-MXFP8) and CI passed - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-12 10:25:51 +08:00
shiyuan680	3b6b3c4214	[MODELRUNNERV2]fix penality ops (#7013 ) ### What this PR does / why we need it? fix penality ops for new version, and achieved a 10% performance improvement ### How was this patch tested? pytest ‎tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-03-11 17:13:34 +08:00
yupeng	830f39dd70	[Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650 ) ### What this PR does / why we need it? Fix the issue #6143 . ### Does this PR introduce _any_ user-facing change? Allow to start the server with "--enable-lora && --fully-sharded-loras && --tensor_parallel_size 2". ### How was this patch tested? pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: paulyu12 <507435917@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-11 15:43:15 +08:00
pz1116	a7f91fce71	[KV Pool]get_num_new_matched_tokens return 0 if token length < block_size (#7146 ) ### What this PR does / why we need it? Currently, we call lookup_client for looking up token hit in KV Pool, however, when token length < block size, the key will be empty and there is no point to lookup in KV Pool backend since there will never be a hit. Hence, add early return in `get_num_new_matched_tokens` when `token_len` < `block_size` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Pz1116 <zpbzpb123123@gmail.com> Co-authored-by: fems14 <1804143737@qq.com>	2026-03-11 15:05:34 +08:00
zxr2333	e16009b2cc	[BugFix]Fix recomputed scheduler bug (#7137 ) ### What this PR does / why we need it? Fix the wrong usage of `model_type`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-11 00:32:19 +08:00
SparrowMu	54668e73c5	[Model] Support Minimax-m2.5 on NPU (#7105 ) ### What this PR does / why we need it? Initial version to support minimax-m2.5 on vllm-ascend. This commit coverting original fp8 weight to a quantilized bf16 to support Minimax-m2.5 on NPU. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` ### Test Report Self tested precision summary, where the official precision score of AIME2025 is 86.3 <img width="426" height="84" alt="image" src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a" /> --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>	2026-03-11 00:12:02 +08:00
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
pppeng	0f289fa2a8	Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109 ) ### What this PR does / why we need it? The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not support `ssm_state` inputs in float32 format, we temporarily retain the _forward_core implementation with triton for Qwen3_5 --------- Signed-off-by: pppeng <zepengliu912@qq.com> Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>	2026-03-10 23:28:58 +08:00
shaopeng-666	6e8d3681ae	[bugdix] The problem that the w4a8 weight fails to be loaded when the EP is not enabled is resolved. (#7090 ) ### What this PR does / why we need it? This is a bug fix to resolve the issue where the MOE model fails to load quantized weights in w4a8 format when EP is not enabled.The parameters ["weight_scale_second", "weight_offset_second", "scale_bias"] shall be parsed in per-group mode, regardless of other conditions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-03-10 16:57:05 +08:00
lilinsiman	a5ea699e29	[eagle][cp] fix eagle_cp enable bug2 (#7079 ) ### What this PR does / why we need it? Fix acceptance and high-concurrency bug in eagle3 and cp enabled ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests and ut - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-10 16:32:49 +08:00
pu-zhe	5df450bca4	[Feat] [310p] Support w8a8sc quantization method (#7075 ) ### What this PR does / why we need it? New Quantization Method: Introduced support for the W8A8SC static linear quantization scheme specifically for 310P hardware, enabling more efficient model compression. Refactored the save_sharded_state_310.py to avoid multi-process issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? W8A8SC quant E2E test. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-03-10 16:13:20 +08:00
Li Wang	33234aa0c5	Revert "[Feature][Quant] Auto-detect quantization format from model f… (#6873 ) This reverts commit `3953dcf784`. to keep the basic functions available --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-10 11:27:32 +08:00
yupeng	40f7d93f1a	[bugfix][LoRA] Fix the lora accuracy issue introduced by the upstream vLLM changed. (#6958 ) ### What this PR does / why we need it? Fix the LoRA e2e test accuracy issue that introduced by the upstream PR https://github.com/vllm-project/vllm/pull/32005 ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama32_lora.py - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: yupeng <507435917@qq.com>	2026-03-10 10:43:18 +08:00
xleoken	146b9d2a83	[BugFix] fix metadata execute error: integer modulo by zero (#6521 ) ### What this PR does / why we need it? fix metadata execute error: integer modulo by zero - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: xleoken <xleoken@163.com>	2026-03-10 09:58:06 +08:00
xmpp777	9216e1b050	[fix] Add support for Qwen3.5 Dense and MoE on Ascend (#6933 ) ### What this PR does / why we need it? This pull request introduces support for the Qwen3.5 MoE model on Ascend devices. The key changes are: * Quantization Configuration for Qwen3.5 MoE: Adds necessary prefix mappings and packed module definitions for `qwen3_5_moe` in `vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim quantization. * Triton Kernel Fix: Corrects a bug in the `fused_gdn_gating` Triton kernel. The calculation for `BLK_BATCHES` had an operator precedence issue which is now resolved. The calculation has also been made more robust with added clamping to prevent potential out-of-bounds memory access in the unified buffer. These changes enable the correct and efficient execution of Qwen3.5 MoE models on Ascend hardware. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should be used to verify the correctness of these changes. It is recommended to run tests with the Qwen3.5 MoE model to ensure the new configurations and the kernel fix work as expected. Signed-off-by: xmpp777 <yangming2@huawei.com>	2026-03-10 09:09:31 +08:00
ZT-AIA	ee5347e824	[qwen3 next ]add ascend c casual_conv1d_fn (#6661 ) ### What this PR does / why we need it? add ascend c casual_conv1d_fn - vLLM version: v0.15.0 - vLLM main: `13397841ab` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-09 23:29:49 +08:00
Hexiang Wang	48b624e4cc	[BugFix] Fix implementation bug of triton rope_siso (#7082 ) ### What this PR does / why we need it? Previously implemention of triton rope_siso missing the storage of second half of rope results, which will result in: 1. accuracy problem in neox-style scenario 2. ub overflow in non neox-style scenario This PR fixes it and supplement nightly test case for it. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-09 23:08:43 +08:00
Qiu	13adcbe44b	feat(attention_cp): support chunked prefill for Qwen3Next with PCP&DCP (#6900 ) ### What this PR does / why we need it? Support chunked prefill for Qwen3Next with PCP&DCP - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-03-09 17:55:09 +08:00
LI SHENGYONG	a76a509fae	[MOE][Bugfix] Cancel H2D for expert_map (#7000 ) ### What this PR does / why we need it? If expert_map is on the device, there may be occasional repeated answers in long output scenarios. dsv3.2-exp-w8a8 No garbled characters are displayed in the output. \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2025 \| ef2f4f \| accuracy \| gen \| 60.00 \| - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-09 17:53:54 +08:00
王远	82fdd40d49	[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `c86cdcbcd2` Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>	2026-03-09 17:53:35 +08:00
wanghuanjun2113	dec04ec8d8	[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064 ) ## Summary - Fix incorrect layer count calculation for MTP (Multi-Token Prediction) models in `update_aclgraph_sizes()` function - For MTP models, the draft model's layer count is stored in `num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not in the standard `num_hidden_layers` field - Directly accessing `draft.hf_config.num_hidden_layers` returns the main model's layer count instead of the MTP draft model's layer count ## Bug Description In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function calculates `resources_per_graph` for speculative decoding scenarios. When calculating the resources needed for the draft model, the original code directly accessed: ```python resources_per_graph += draft.hf_config.num_hidden_layers + 1 ``` This works correctly for standard draft models, but fails for MTP models (like DeepSeek-V3's MTP or Qwen3.5's MTP) because: 1. MTP models store their layer count in model-specific fields: - `num_nextn_predict_layers` (DeepSeek-V3 MTP) - `mtp_num_hidden_layers` (Qwen3.5 MTP) 2. The `num_hidden_layers` field in these models contains the main model's layer count, not the MTP layer count 3. This leads to grossly overestimating the `resources_per_graph`, which in turn causes the calculated `max_batch_sizes` to be unnecessarily small ## Fix Use `draft.get_total_num_hidden_layers()` instead of directly accessing `draft.hf_config.num_hidden_layers`. This method correctly handles different model types through the `model_arch_config_convertor` infrastructure, returning the appropriate layer count for: - Standard draft models → `num_hidden_layers` - DeepSeek-V3 MTP → `num_nextn_predict_layers` - Qwen3.5 MTP → `mtp_num_hidden_layers` 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:14:51 +08:00
LoganJane	eb648f7398	[Bugfix] Support quant config in glm46v (#7062 ) ### What this PR does / why we need it? We need to support quant config in glm46v . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We used the 'Ascend/msit' quantization method to test the w8a8 weights. Successfully ran on NPU using vllm-ascend by the w8a8 weights. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com> Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>	2026-03-09 16:07:16 +08:00
tanhaoan333	57c554a23f	[bugfix]Fix parameter ordering bug in _merge_multimodal_embeddings (#7068 ) ### What this PR does / why we need it? This PR fixes a bug in the `_merge_multimodal_embeddings` function where the parameter order was incorrect. The `multimodal_embeddings` and `is_multimodal` parameters were swapped, which would lead to runtime errors when the function is called with positional arguments. This change corrects the function signature to align with its expected usage, ensuring that multimodal embeddings are correctly merged. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix for an internal utility function and has no user-facing impact. ### How was this patch tested? The correctness of this fix is validated by existing tests for multimodal functionality. With the incorrect function signature, these tests would fail due to argument type mismatches. CI passing confirms the fix is effective. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-09 16:05:52 +08:00
Cao Yi	cb4c7de856	[Perf] Optimize MTP execution by reordering state update operation (#6844 ) ## Summary - Move `_update_states_after_model_execute` call from after main model sampling to after draft model execution - This reordering reduces pipeline bubbles between main model and draft model execution - No accuracy impact - the state update operation is independent of draft token proposal ## Performance Impact Reduces idle time between main model and draft model execution stages, improving overall MTP (Multi-Token Prediction) performance. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 15:55:27 +08:00
zxr2333	d39d80830c	[KVCache]Qwen3.5 supports contiguous tensor hybrid-attn kv-cache (#6887 ) ### What this PR does / why we need it? Supports contiguous tensor hybrid-attn kv-cache on fullattn-mamba hybrid model, such as Qwen3Next and Qwen3.5. Due to the restrictions of Ascend operators, all KV tensors, conv tensors, and SSM tensors must be contiguous. Therefore, this PR uses the following solution to generate the KV cache: tensor1: [(kv_padding), conv , ...] tensor2: [k , ssm , ...] tensor3: [v , (mamba_padding), ...] Under this scheme, although some waste may occur, the tensors of all caches are guaranteed to be contiguous. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 15:28:40 +08:00
Cao Yi	aef9d4249d	[Perf] Avoid CPU sync in mrope_positions copy by using full tensor copy (#7014 ) ### What this PR does / why we need it? The index-select operation `mrope_positions.gpu[:, :total_num_scheduled_tokens].copy_(...)` triggers a CPU-NPU synchronization, which blocks subsequent operator dispatch and causes bubbles visible in Profiling. This PR changes to full tensor copy (`mrope_positions.gpu.copy_(mrope_positions.cpu)`) to eliminate the sync point. The trade-off is a negligible increase in memory usage since `mrope_positions.cpu` is a small tensor. Result: ~2-3% TPOT improvement with the profiling bubbles eliminated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified via Profiling that the CPU sync bubble is eliminated and TPOT is reduced by 2-3%. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-09 14:46:37 +08:00
LeeWenquan	65eae6de7b	Add Ascend Ops recurrent_gated_delta_rule (#6725 ) ### What this PR does / why we need it? Change recurrent_gated_delta_rule ops from triton to ascend C version for better performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-03-09 14:14:14 +08:00
JIACHENG XU	23bf5d4d48	[EPLB][bugfix] Bugfix for fused mc2 (#6794 ) ### What this PR does / why we need it? This pull request addresses a bug related to the fused mc2 functionality within the EPLB (Expert Parallelism Load Balancing) system, specifically impacting quantization and MoE communication. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Spicy-Stick <873805887@qq.com> Signed-off-by: root <root@localhost.localdomain>	2026-03-09 11:26:57 +08:00
Zetong Li	06ec136f08	[Bugfix] Obtain kernel block size for computing slot mapping correctly (#7019 ) ### What this PR does / why we need it? This PR aims to fix incorrect slot mapping in qwen35 due to mismatched block size. In qwen35, we should use `kernel_block_size` so that we can compute it in a correct way, and it is obtained in `load_model` when we have a chance to grab `draft_attn_layers`. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-09 11:05:01 +08:00
wangxiaoteng888	a3f4f6b10b	[P/D][Bugfix] Layerwise stacking MTP error. (#7036 ) ### What this PR does / why we need it? The community has added a cleaning mechanism for the metadata after the main model finishes running. The MTP layer should not clean the metadata, and a new condition has been added to avoid cleaning it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-03-09 10:55:43 +08:00
zxr2333	675387f1fd	[P/D][KVPool]Mooncake Layerwise Connector supports kv_pool (#7032 ) ### What this PR does / why we need it? This PR creates and registers `ascend_multi_connector`, which allows the `mooncake_layerwise_connector` to use the kv_pooling feature. We unregister the original vllm's `MultiConnector` and replace it with `AscendMultiConnector` when registering the connectors. ### Does this PR introduce _any_ user-facing change? No. User can use `MultiConnector` to initialize `AscendMultiConnector`. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-09 10:49:04 +08:00
drslark	6a7115fa0d	[main][feature] Support quarot for eagle3 without embedding (#7038 ) ### What this PR does / why we need it? If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225. - vLLM main: `4034c3d32e` Signed-off-by: drslark <slarksblood@qq.com>	2026-03-09 10:43:06 +08:00
lilinsiman	01d3515dcf	[eagle][cp][bugfix] Fix the bug in eagle and cp enabled (#6981 ) ### What this PR does / why we need it? When eagle and cp are enabled at the same time, there is an error in pcp_allgather due to hidden_states. This PR fixes this issue. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-03-06 20:49:49 +08:00
aipaes	1c0ecf806a	[bugfix] fix pass bug: pass really rope dim for npu_rotary_embedding (#6880 ) ### What this PR does / why we need it? pass really rope dim for npu_rotary_embedding before： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.head_dim, True ) after： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.rope_dim, True ) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 19:35:17 +08:00
tanhaoan333	094eb0eff9	[bugfix]Qwen-Omni quantization bugfix (#7042 ) ### What this PR does / why we need it? [bugfix]Qwen-Omni quantization bugfix fix Qwen-Omni quantization weight mapping to float weight ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>	2026-03-06 17:24:22 +08:00
ZhaoJiangJiang	a51d6366b9	[Bugfix] Qwen3Next support FlashComm1 (#6830 ) ### What this PR does / why we need it? Support FlashComm1 for Qwen3-Next. Fix some padding problems in Sequence Parallel (SP) and resolve precision problems in shared_out when both FlashComm1 is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com> Co-authored-by: zhaojiangjiang <zhaojiangjiang1@h-partners.com>	2026-03-06 17:14:08 +08:00
Zetong Li	a2696006d1	[Refactor][EAGLE] 8/N delete mtp_proposer (re-pull) (#7033 ) ### What this PR does / why we need it? NOTE: This PR is re-pull of #7016 since ci mistakenly marked unfinished pr as having passed. This PR aims to delete mtp_proposer. By fixing a bug in both dsv32 and glm5, now it should be ok to remove mtp_proposer. The bug is actually about unnecessary slicing of `slot_mapping`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-03-06 17:11:22 +08:00
Fager10086	c5dfa8d645	[OPS]add split_qkv_rmsnorm_mrope ops (#6730 ) ### What this PR does / why we need it? This PR adds split_qkv_rmsnorm_mrope kernel with interleaved for qwen3.5 and qwen3-vl to improve performance. ### Does this PR introduce _any_ user-facing change? Does not. ### How to use? ```python real_q, real_k, real_v, real_gate = torch.ops.vllm.triton_split_qkv_rmsnorm_mrope( qkv=qkv, q_weight=q_weight, k_weight=k_weight, cos_sin=cos_sin, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, head_size=head_size, eps=eps, mrope_section=mrope_section, is_interleaved=is_interleaved, rope_dim=rope_dim, has_gate=has_gate, ) ``` ### How was this patch tested? - vLLM version: v0.16.0 - Accuracy test script： ```shell pytest tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_split_qkv_rmsnorm_mrope.py ``` --------- Signed-off-by: Fager <865071616@qq.com> Signed-off-by: Fager10086 <77871921+Fager10086@users.noreply.github.com> Signed-off-by: fager <865071616@qq.com>	2026-03-06 16:18:37 +08:00
xiaocongtou6	bc0fd7ca72	[Feat]Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. (#6940 ) ### What this PR does / why we need it? Adapt the graph mode (piecewise and full_decode_only) of PCP and DCP for DeepSeek v3.2. ### How was this patch tested? Test output: {"object":"text_completion","model":"deepeek_v3","choices":[{"index":0,"text":" the head of state and head of government of the United States, indirectly elected to a four-year term by the American people through the Electoral College. The officeholder leads the executive branch of the federal government and is the commander-in-chief of the United States","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":1,"text":" Paris. This is the largest city in France and its main political, cultural and commercial center. The modern location of the city is the north of the central part of the country, on the banks of the Seine River Seine River Seine in 3\n\n","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":2,"text":" now\n\n# AI future is now\n\nThe world is changing at a rapid pace, and artificial intelligence (AI) is at the forefront of this transformation. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null},{"index":3,"text":" a 3rd year student at the University of Lincoln studying Media Production. This blog is about my work throughout my final year on the course.\n\n## Tuesday 3 May 2016\n### Final Major Project - Evaluation\n\nFor my final project I","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":227,"completion_tokens":200,"prompt_tokens_details":null},"kv_transfer_params":null} - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: xiaocongtou6 <2066962956@qq.com> Signed-off-by: xiaocongtou6 <105542647+xiaocongtou6@users.noreply.github.com>	2026-03-06 16:10:24 +08:00
Shanshan Shen	a813eadd2d	[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 ) ### What this PR does / why we need it? Currently, we are using `e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)` for convolution computation, which is used in patch embedding for VL models. After profiling, we find that this linear method will take about 6.87 ms, which is much slower than just using `F.conv3d()`. In `F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on Ascend NPU, which only take about 2.50 ms and is 2.7x faster than linear method. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-06 14:26:37 +08:00
wangxiyuan	16c3b0b822	Revert "[Refactor][EAGLE] 8/N delete mtp_proposer" (#7030 ) Reverts vllm-project/vllm-ascend#7016 It breaks E2E test - vLLM version: v0.16.0 - vLLM main: `4034c3d32e`	2026-03-06 11:24:05 +08:00

... 2 3 4 5 6 ...

1687 Commits