xc-llm-ascend

Author	SHA1	Message	Date
Nengjun Ma	ab676413e6	Default enable MLAPO (#5952 ) ### What this PR does / why we need it? 1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD disagregation D Instance, for example: DeepSeekV3-W8A8, DeepSeek-R1-W8A8. 2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models, currently is DeepSeek-V3.2-W8A8. ### Does this PR introduce _any_ user-facing change? Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO feature for deepseek w8a8 model The effect of enabling MLAPO SFA model deployed on a single A3 Node: Test with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py dataset: gsm8k-lite，without set MTP, FULL GRAPH, has 19% promote：未默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 14055.8836 ms │ ├─────────────────────────┤ │ ITL │ 66.8171 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 104.9105 token/s │ ├─────────────────────────┤ 默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 3753.1547 ms │ ├─────────────────────────┤ │ ITL. │ 61.4236 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 125.2075 token/s│ ├─────────────────────────┤ - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-22 09:26:39 +08:00
MengLong Chen	a15a5f6aa5	[Doc] Supplement PD separation parameters of DeepSeek V3.1 (#6053 ) ### What this PR does / why we need it? Supplement PD separation parameters of DeepSeek V3.1 The recommended parameter configuration for DeepSeek V3.1 in the EP32 scenario after PD separation has been adjusted, and the core parameters have been described in detail. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-01-22 08:53:44 +08:00
ZCG12345	8900e3398b	[Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (#5918 ) ### What this PR does / why we need it? 1. Implement a high-performance Triton custom kernel for the rotary position embedding (RoPE) operator on Ascend NPU platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve extreme performance optimization for the core RoPE operator: the single inference latency is reduced from 57.1 μs to 9 μs, with 6.34x performance improvement and 84.24% latency reduction. 4. The RoPE operator is a hot path that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? NO - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. Environment Validation: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. Kernel Registration Test: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. Functional Correctness Test: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is completely correct (no garbled text, no logical errors), consistent with the original implementation. 4. Performance Benchmark Test: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. Fallback Mechanism Test: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. Compatibility Test: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: ZCG12345 <2097562023@qq.com>	2026-01-21 22:01:22 +08:00
LeeWenquan	2a618d2454	[Ops] update causal_conv1d_update (#5984 ) ### What this PR does / why we need it? Update causal_conv1d_update ops for better perf. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-21 16:33:52 +08:00
meihanc	53bfb38192	[CI]Update triton ascend version in 3.2.0 (#6067 ) ### What this PR does / why we need it? update triton ascend version in 3.2.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-21 16:02:23 +08:00
Qiu	58ff465821	[bugfix] fix the complex and potentially problematic generate_kv_idx. (#5957 ) ### What this PR does / why we need it? In long-sequence scenarios, the chunked-prefill component may encounter dimension misalignment issues, which previously occurred during precision testing on the code_generate_lite dataset. This PR removes redundant computations and instead derives the value using existing results and straightforward calculations. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-21 14:21:02 +08:00
LICO67373	12a668b1d9	[Refactor] AttentionBuilder inherit from base class in vllm (#5916 ) ### What this PR does / why we need it? This PR makes `AscendMLAMetadataBuilder` and `AscendSFAMetadataBuilder` properly inherit from the base class `MLACommonMetadataBuilder` in vllm by adding `super().__init__()` calls. Changes: - Add `super().__init__()` call in `AscendMLAMetadataBuilder.__init__()` - Add `super().__init__()` call in `AscendSFAMetadataBuilder.__init__()` - Extract `ascend_chunked_prefill_workspace_size()` to `vllm_ascend/attention/utils.py` to avoid code duplication - Override `determine_chunked_prefill_workspace_size()` to support Ascend-specific 128k tokens workspace size (vs 64k in parent class) - Update unit tests to mock parent class `__init__` for proper isolation Why we need it: - Follow proper Python inheritance patterns by calling `super().__init__()` - Reduce code duplication by reusing parent class initialization logic - Better maintainability as parent class changes will be automatically inherited Part of issue #5463 item 10 ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that does not change any user-facing behavior. Signed-off-by: lico67373 <918688502@qq.com>	2026-01-21 10:45:45 +08:00
Li Wang	839e03cbc9	[Nightly] Use Qwen repo for qwen3-next (#6064 ) ### What this PR does / why we need it? Use Qwen repo for qwen3-next to make nightly test happy. see https://github.com/vllm-project/vllm-ascend/actions/runs/21179025996/job/60915871441 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-21 10:39:12 +08:00
guanguan0308	1ed9524763	add dispath_ffn_combine_bf16 (#5866 ) ### What this PR does / why we need it? add dispath_ffn_combine_bf16 - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: guanguan0308 <1546542263@qq.com>	2026-01-21 09:30:30 +08:00
wangqiankun13	bec8641876	[BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (#5932 ) ### What this PR does / why we need it? In [PR 5040](https://github.com/vllm-project/vllm-ascend/pull/5040), the `dispatch_gmm_combine_decode` operator was configured with an incorrect global_bs parameter. This PR is to fix the bug. The global_bs provided as input should have the same meaning as in the `moe_distributed_dispatch` operator, specifically: (the maximum batch size across all cards) * (expert parallel world size). However, the implementation incorrectly used the variable max_num_tokens, which does not account for tensor parallelism. This error likely resulted in an unnecessarily large (overestimated) value. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 80.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-21 09:26:40 +08:00
Magnus	5b129cf0a1	[1/N][Feat] Xlite Qwen3 MoE Support (#5951 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`69b170b8b5`) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph \| maxconcurrency \| item \| TTFT(ms) \| \| TPOT(ms) \| \| QPS (req/s) \| OutputSpeed (token/s) \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| \| \| Avg \| P99 \| Avg \| P99 \| \| \| \| 1 \| baseline-aclgraph \| 205.07 \| 287.29 \| 12.34 \| 12.65 \| 0.14 \| 78.81 \| \| 1 \| xlite-full \| 66.40 \| 113.69 \| 11.71 \| 12.40 \| 0.15 \| 84.73 \| \| 1 \| xlite-decode-only \| 221.15 \| 316.40 \| 12.16 \| 12.91 \| 0.14 \| 79.70 \| \| 1 \| diff1 \| -67.62% \| -60.43% \| -5.11% \| -1.98% \| 7.14% \| 7.51% \| \| 1 \| diff2 \| 7.84% \| 10.13% \| -1.46% \| 2.06% \| 0.00% \| 1.13% \| \| \| \| \| \| \| \| \| \| \| 16 \| baseline-aclgraph \| 1892.16 \| 13916.86 \| 22.78 \| 39.28 \| 1.15 \| 589.89 \| \| 16 \| xlite-full \| 1355.40 \| 8907.45 \| 15.96 \| 25.15 \| 1.65 \| 850.21 \| \| 16 \| xlite-decode-only \| 1519.42 \| 8711.64 \| 19.23 \| 29.73 \| 1.38 \| 711.60 \| \| 16 \| diff1 \| -28.37% \| -36.00% \| -29.94% \| -35.97% \| 43.48% \| 44.13% \| \| 16 \| diff2 \| -19.70% \| -37.40% \| -15.58% \| -24.31% \| 20.00% \| 20.63% \| \| \| \| \| \| \| \| \| \| \| 32 \| baseline-aclgraph \| 673.80 \| 3914.90 \| 32.20 \| 37.95 \| 1.80 \| 928.54 \| \| 32 \| xlite-full \| 481.65 \| 2710.50 \| 19.95 \| 25.35 \| 2.91 \| 1506.67 \| \| 32 \| xlite-decode-only \| 372.22 \| 1095.25 \| 25.19 \| 28.47 \| 2.33 \| 1202.82 \| \| 32 \| diff1 \| -28.52% \| -30.76% \| -38.04% \| -33.20% \| 61.67% \| 62.26% \| \| 32 \| diff2 \| -44.76% \| -72.02% \| -21.77% \| -24.98% \| 29.44% \| 29.54% \| \| \| \| \| \| \| \| \| \| \| 48 \| baseline-aclgraph \| 583.18 \| 3277.65 \| 41.02 \| 46.05 \| 2.17 \| 1115.08 \| \| 48 \| xlite-full \| 973.42 \| 8237.33 \| 23.29 \| 30.50 \| 3.71 \| 1908.09 \| \| 48 \| xlite-decode-only \| 480.79 \| 2026.98 \| 31.48 \| 35.41 \| 2.83 \| 1453.75 \| \| 48 \| diff1 \| 66.92% \| 151.32% \| -43.22% \| -33.77% \| 70.97% \| 71.12% \| \| 48 \| diff2 \| -17.56% \| -38.16% \| -23.26% \| -23.11% \| 30.41% \| 30.37% \| \| \| \| \| \| \| \| \| \| \| 64 \| baseline-aclgraph \| 742.74 \| 5953.39 \| 47.79 \| 53.15 \| 2.48 \| 1272.37 \| \| 64 \| xlite-full \| 545.22 \| 3941.34 \| 25.09 \| 30.41 \| 4.64 \| 2376.44 \| \| 64 \| xlite-decode-only \| 752.40 \| 4534.29 \| 38.67 \| 43.28 \| 3.06 \| 1567.94 \| \| 64 \| diff1 \| -26.59% \| -33.80% \| -47.50% \| -42.78% \| 87.10% \| 86.77% \| \| 64 \| diff2 \| 1.30% \| -23.84% \| -19.08% \| -18.57% \| 23.39% \| 23.23% \| \| \| \| \| \| \| \| \| \| \| 100 \| baseline-aclgraph \| 565.52 \| 1716.81 \| 60.89 \| 68.69 \| 3.08 \| 1580.64 \| \| 100 \| xlite-full \| 398.14 \| 2328.88 \| 30.70 \| 32.45 \| 6.01 \| 3086.42 \| \| 100 \| xlite-decode-only \| 712.53 \| 4875.94 \| 52.71 \| 60.78 \| 3.53 \| 1813.58 \| \| 100 \| diff1 \| -29.60% \| 35.65% \| -49.58% \| -52.76% \| 95.13% \| 95.26% \| \| 100 \| diff2 \| 26.00% \| 184.01% \| -13.43% \| -11.52% \| 14.61% \| 14.74% \| \| \| \| \| \| \| \| \| \| \| 150 \| baseline-aclgraph \| 842.42 \| 5175.01 \| 73.60 \| 88.18 \| 3.80 \| 1952.26 \| \| 150 \| xlite-full \| 568.52 \| 4204.33 \| 37.90 \| 40.01 \| 7.27 \| 3734.72 \| \| 150 \| xlite-decode-only \| 654.43 \| 2504.06 \| 67.40 \| 77.00 \| 4.18 \| 2145.11 \| \| 150 \| diff1 \| -32.51% \| -18.76% \| -48.51% \| -54.63% \| 91.32% \| 91.30% \| \| 150 \| diff2 \| -22.32% \| -51.61% \| -8.42% \| -12.68% \| 10.00% \| 9.88% \| \| \| \| \| \| \| \| \| \| \| 200 \| baseline-aclgraph \| 750.63 \| 3049.91 \| 88.26 \| 101.95 \| 4.28 \| 2189.72 \| \| 200 \| xlite-full \| 558.48 \| 3791.98 \| 45.54 \| 49.04 \| 8.17 \| 4175.52 \| \| 200 \| xlite-decode-only \| 807.09 \| 4254.95 \| 85.18 \| 101.79 \| 4.44 \| 2271.52 \| \| 200 \| diff1 \| -25.60% \| 24.33% \| -48.40% \| -51.90% \| 90.89% \| 90.69% \| \| 200 \| diff2 \| 7.52% \| 39.51% \| -3.49% \| -0.16% \| 3.74% \| 3.74% \| \| \| \| \| \| \| \| \| \| ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>	2026-01-21 09:26:03 +08:00
Zetong Li	1ab6cd4935	[Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (#5945 ) ### What this PR does / why we need it? This PR aims to fix setting of `speculative_config.enforce_eager` in deepseek v3.2 mtp. The point is that, vllm sets `speculative_config.enforce_eager` as True if using deepseek_v32 with mtp. Since we support graph mode, we simply ignore it here. However, this fix will also implicitly ignore user setting of `speculative_config.enforce_eager`, we need to take care and remove it once vllm supports this feature. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-21 09:24:33 +08:00
kx	936d81a258	[bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (#5132 ) ### What this PR does / why we need it? adapt to: https://github.com/vllm-project/vllm/pull/30475. just change get_num_encoder_tokens() to get_num_encoder_embeds() in recompute_schedule.py, which seems that it is currently not in use. The get_num_encoder_tokens() function in VLLM no longer exists. - vLLM version: v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: 01267596 <xiongkai123@cmbchina.com> Co-authored-by: 01267596 <xiongkai123@cmbchina.com>	2026-01-21 09:13:52 +08:00
weiguihua2	b399117e89	[Bugfix] fix pcp qwen full graph FIA bug (#6037 ) ### What this PR does / why we need it? In the pcp full graph Qwen model scenario, the inconsistency between the Q shape and actual q len of the FIA operator is fixed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-21 08:49:05 +08:00
DreamerLeader	b6d55fc48e	[Bugfix]Fixed precision issues caused by pooled request pooling (#6049 ) ### What this PR does / why we need it? Fixed precision issues caused by pooled request pooling ### Does this PR introduce _any_ user-facing change? pr6045 ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>	2026-01-20 23:51:31 +08:00
fems14	8b98d7a4e8	【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (#6045 ) ### What this PR does / why we need it? Resolved a double-free memory vulnerability in the pooling layer under re-computation scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: fems14 <1804143737@qq.com>	2026-01-20 22:56:04 +08:00
drslark	b2475099a0	[main][Bugfix] Fixed an problem related to embeddings sharing (#5967 ) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-20 21:34:28 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
Li Wang	0c0514579f	[CI][Lint] Show lint diff on failure (#5956 ) ### What this PR does / why we need it? Currently, some of lint checks default automatic code correction but only shows which files were modified (without specifying the changes); in a CI environment, we can make a small optimization to show which lines were modified to give the developers some specifying hint. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-20 21:07:01 +08:00
Li Wang	8cf1e8d8a7	[CI] Add wait logic for each individual case (#6036 ) ### What this PR does / why we need it? Wait until the NPU memory is clean ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2026-01-20 21:05:44 +08:00
zhangxinyuehfad	750c06c78a	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633 ) ### What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 nightly ci test： DeepSeek-V3.2-W8A8 1node DP2+TP8 :tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py ### Does this PR introduce _any_ user-facing change - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 21:05:15 +08:00
shiyuan680	cea48c2a34	model runner v2 support triton of penalty (#5854 ) ### What this PR does / why we need it? Optimized operator performance and add ut test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen2.5 7b vl, ops time approved 90% - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` this pr is for # https://github.com/vllm-project/vllm-ascend/issues/5208 Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-20 12:26:05 +00:00
Canlin Guo	afabb49f00	[Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (#6034 ) ### What this PR does / why we need it? Add docs for Qwen3-VL-Embedding & Qwen3-VL-Reranker. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-20 17:36:31 +08:00
Icey	402872050a	[Tests] move qwen3 performance test from nightly to e2e (#5980 ) ### What this PR does / why we need it? Move the qwen3 performance test from nightly to e2e to intercept performance degradation. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-20 17:08:43 +08:00
weiguihua2	5892455f43	[Bugfix] fix bug of pcp+mtp+async scheduler (#5994 ) ### What this PR does / why we need it? Fixed the issue where the PCP and MTP services could not be started due to asynchronous scheduling. After the pcp, mtp, and asynchronous scheduling functions are enabled, the service is suspended because of a shape mismatch after a curl request is sent. This PR resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-20 15:24:05 +08:00
meihanc	ea57e3e7a4	[Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5988 ) ### What this PR does / why we need it? Upgrade vllm commit to releases/v0.14.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-20 15:10:40 +08:00
LeeWenquan	55b20ac63b	[Ops] Add layernorm for qwen3Next (#5765 ) ### What this PR does / why we need it? Add layernormFn triton op for qwen3Next model for better performance. <img width="248" height="526" alt="image" src="https://github.com/user-attachments/assets/27b47157-5df5-4db1-aa88-1dae799b2bf6" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-20 14:43:14 +08:00
starmountain1997	0664c6e67a	[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921 ) ### What this PR does / why we need it? #### Documentation Improvements New Configuration: Added the layer_sharding parameter to the DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include `["q_b_proj", "o_proj"]` in their prefill node setup for better resource utilization. #### CI and Testing Updates Test Config Update: Updated the multi-node E2E test configuration file: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml. including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update performance baseline. ### Does this PR introduce any user-facing change? Yes. The documentation now recommends a more optimized startup command for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see improved performance in multi-node PD disaggregation environments. ### How was this patch tested? CI Validation: The updated E2E test configuration has been verified through the nightly CI pipeline. Environment: * vLLM version: v0.13.0 Base Commit: [11b6af5](`11b6af5280`) Hardware: Ascend A3/A2 multi-node cluster. --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-20 12:40:54 +08:00
zhangxinyuehfad	a5b099c73d	[Bugfix] Reset incompatible config (#6005 ) ### What this PR does / why we need it? This PR introduces compatibility fixes for running vLLM on Ascend NPU hardware. The changes ensure that GPU-specific parameters are automatically detected and reset to Ascend-compatible values with appropriate warnings logged. \| Module \| Parameter \| Default Value \| \|--------\|-----------\|---------------\| \| Model Config \| `disable_cascade_attn` \| `False` \| \| Parallel Config \| `all2all_backend` \| `"allgather_reducescatter"` \| \| Cache Config \| `cpu_kvcache_space_bytes` \| `None` \| \| MultiModal Config \| `mm_encoder_attn_backend` \| `None` \| \| Observability Config \| `enable_layerwise_nvtx_tracing` \| `False` \| \| Scheduler Config \| `max_num_partial_prefills` \| `1` \| \| Speculative Config \| `quantization` \| `None` \| \| KV Transfer Config \| `kv_buffer_size` \| `1e9` \| \| KV Transfer Config \| `enable_permute_local_kv` \| `False` \| \| Attention Config \| `use_prefill_decode_attention` \| `False` \| \| Attention Config \| `use_cudnn_prefill` \| `False` \| \| Attention Config \| `use_trtllm_ragged_deepseek_prefill` \| `False` \| \| Attention Config \| `use_trtllm_attention` \| `False` \| \| Attention Config \| `disable_flashinfer_prefill` \| `False` \| \| Attention Config \| `disable_flashinfer_q_quantization` \| `False` \| \| Attention Config \| `flash_attn_version` \| `None` \| \| Attention Config \| `backend` \| `None` \| \| Attention Config \| `flash_attn_max_num_splits_for_cuda_graph` \| `32` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 11:02:38 +08:00
lilinsiman	a8576ec610	[Refactor][EAGLE] 5/N Update attn_metadata by common_attn_metadata (#5869 ) ### What this PR does / why we need it? 4/N EAGLE refactor plan devided into many parts, this PR is the first change, which modifies the attn_metadata update method by modifying common_metadata and then rebuilding the code. ### Does this PR introduce _any_ user-facing change? ut ### How was this patch tested? no - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: Zetong Li <slippersss@126.com> Co-authored-by: Zetong Li <slippersss@126.com>	2026-01-20 10:06:00 +08:00
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
Qiu	38cfcd572a	[doc](cp) correct the prefill of GQA and adjust desc of block table. (#5697 ) ### What this PR does / why we need it? correct the seq length of KV for prefill of GQA and clarify the desc of block table distribution in developer guide. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-19 18:53:48 +08:00
Levi	f0d41199a6	[Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (#5936 ) ### What this PR does / why we need it? When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index operation to reorganize the batch, because that we need ensure the correct batch-id for each rank after the reduce-scatter op in VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need reduce-scatter. Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-19 17:12:13 +08:00
wangxiaochao6	bc486d9530	[main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (#5960 ) ### What this PR does / why we need it? In PD disaggregation case, when P has multi nodes, mooncake fails to send data. Fix the issue in this PR. The details: If a P rank does not need to transfer kv cache to any one D rank, D node should send a message to P node to release the kv cache in P node. If P has multi nodes, D node should know the corresponding IP in each P node, then D node can send message to the right P node. Otherwise, send data error will happen. This PR fix this issue by providing P nodes IP to D node through Parameter `remote_port_send_num`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangxiaochao <w00642655@china.huawei.com> Co-authored-by: wangxiaochao <w00642655@china.huawei.com>	2026-01-19 16:35:13 +08:00
wangqiankun13	ebb940691f	[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (#5755 ) ### What this PR does / why we need it? [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. - Before: weight scale must be float32 - After: weight scale can be float32/float16 when x is float16, float32/bfloat16 when x is float32/bfloat16. And w1 scale can use different dtype with w2 scale. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Perf > When scale is of type fp16 or bf16, it will be cast to fp32 internally within the operator, while the subsequent computations remain unchanged. Therefore, this PR will introduce an additional cast operation but halve the memory copy operations for scale . Furthermore, since the scale data is only a few KB in size and participates in relatively few computations, its impact is almost negligible compared to major operations like matrix multiplication. Thus, the theoretical performance change should be minimal. test single operator cases from qwen3-235b, - single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b ep32) - batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536 The test was conducted for 100 rounds, and the average of the last 95 rounds was taken. \| \| bs18(us)\| bs32(us)\| \| -----\| -----\| -----\| \|Without this PR\|96.28\|108.83\| \|With this PR\|96.06\|107.90\| Note: Single-operator benchmarks represent an ideal scenario. They are usually only useful for referencing relative changes and may not fully align with performance data observed within the full model. #### Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-19 16:10:43 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
SILONG ZENG	b27774dbd6	[CI]fix for lint CI (#5982 ) ### What this PR does / why we need it? fix lint CI - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 09:49:28 +08:00
Icey	c929bd1e8d	[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034 ) This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph fusion pass for `matmul_allreduce_rmsnorm` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`. Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tongrunze <t00574058@china.huawei.com>	2026-01-19 09:28:07 +08:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
LI SHENGYONG	bc1f6713e7	[EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (#5933 ) ### What this PR does / why we need it? 1. Move the logic of expert mapping forward to prevent shotgun changes 2. Disable the update of expert map. ### How was this patch tested? a2 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| GPQA_diamond \| 53064e \| accuracy \| gen \| 73.23 \| a3 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:24:25 +08:00
LI SHENGYONG	9fed2636cb	[EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908 ) ### What this PR does / why we need it? 1. If the model has dense layers, the current code will attempt to obtain the routing experts of the dense layers, which will cause an error. This should be fixed by modifying the code to skip the dense layers when obtaining the routing experts. 2. The global_expert_map that the function directly outputs a affects the performance of dsv3.2. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepSeek V3.1 conversation is normal. #### aime precision test (dsv3.1) baseline without eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 66.67 \| eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 70.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:23:28 +08:00
Shanshan Shen	ad3a1eaf70	[Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (#5855 ) ### What this PR does / why we need it? As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339, multi-modal inference on vllm-ascend may lead to OOM issues in some scenarios. After our analysis, this is due to the memory fragmentation caused by frequent dynamic memory size adjustments during runtime. During the inference, the figure for non-torch memory see a gradual increase from around 1G to over 5G until the OOM issue occurs. We find that this problem can be resolved by just directly setting `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue. Thus, we decide to set this value by default, except RL (sleep mode) scenarios. It's also worthy to note that this environment variable may have more than one key-value pairs. We should append `",expandable_segments:True"` to the current configs. For example: ```python PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True". ``` > [!NOTE] > `max_split_size_mb` or `garbage_collection_threshold` cannot be enabled together with `expandable_segments=True`. ### Does this PR introduce _any_ user-facing change? Users do not need to set `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more. ### How was this patch tested? I have build a dataset consisting of my own photographs, which can stably reproduce this OOM issue on Qwen3-VL serie models. After apply this PR, this problem has been resolved and the amount of non-torch memory will keep stable at around 1G throughout the whole inference. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-19 09:17:31 +08:00
herizhen	0eafed9bd6	[doc]Table split (#5929 ) ### What this PR does / why we need it? Added legend descriptions, and split redundant tables into core supported model tables and extended compatible model tables. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: herizhen <1270637059@qq.com>	2026-01-19 09:15:04 +08:00
Li Wang	c4fde5c064	[Doc] Upgrade outdated ut doc (#5937 ) ### What this PR does / why we need it? For cpu env, we should set `SOC_VERSION` to mock different NPU chips for different compilation paths ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-19 09:12:46 +08:00
SILONG ZENG	329961b375	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #2 ) (#5977 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/attention/attention_mask.py` \| \| `vllm_ascend/attention/attention_v1.py` \| \| `vllm_ascend/attention/context_parallel/attention_cp.py` \| \| `vllm_ascend/attention/context_parallel/common_cp.py` \| \| `vllm_ascend/attention/context_parallel/mla_cp.py` \| \| `vllm_ascend/attention/utils.py` \| \| `vllm_ascend/batch_invariant.py` \| \| `vllm_ascend/device/device_op.py` \| \| `vllm_ascend/device_allocator/camem.py` \| \| `vllm_ascend/envs.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 08:59:46 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
zzhxxx	05e69b99e5	[Doc] Remove Chinese characters from the icons in the doc. (#5959 ) ### What this PR does / why we need it? Remove Chinese characters from the icons in the doc. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-18 07:22:57 +08:00
wangxiaoteng888	fff5df3efe	[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 ) ### What this PR does / why we need it? The force-free secondary release request causes the node to crash. When requests are pulled too quickly, they should not be added to the delay-free queue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-17 18:49:27 +08:00
Jade Zheng	22f253142a	[Feature] Support fine-grained shared expert overlap (#5482 ) Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2026-01-17 11:53:22 +08:00

1 2 3 4 5 ...

2146 Commits