xc-llm-ascend

Author	SHA1	Message	Date
iiiklw	a0315f6697	[npugraph_ex]enable npugraph_ex by default (#6664 ) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>	2026-02-12 08:44:06 +08:00
Angazenn	c0c2eb614e	[Main][Ops] Make triton rope support index_selecting from cos_sin_cache (#5450 ) ### What this PR does / why we need it? This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel (For example, https://github.com/vllm-project/vllm-ascend/pull/5450/changes#diff-a4c2d3071530df193b98f9bf38553874bc4d47571336711f116c26d019cfbb6aR77-R81). Highlights - RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - Custom Operation Registration: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` Additional test on Qwen3-235b accuracy: \| Aime2024 \| GSM8K \| Livecodebench \| \| -------- \| -------- \| -------- \| \| 83.33 \| 96.26 \| 70.23 \| --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-02-11 21:20:53 +08:00
SILONG ZENG	6bc44bf49b	[CI]fix nightly multi node test error for wait for pod ready (#6675 ) ### What this PR does / why we need it? Fixes the issue where nightly multi-node tests hang during the "wait for pod ready" stage due to strict shell mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-11 18:11:00 +08:00
whx	bb73478c00	[Test][BugFix] Fix torch.rand usage in triton penalty test (#6680 ) ### What this PR does / why we need it? This PR fixes a `TypeError` in `tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_penality.py` that was causing nightly test failures. The `torch.rand()` function was being called with the `device` string as a positional argument, which is incorrect. This has been corrected to use the `device` keyword argument. Fixes # ### Does this PR introduce _any_ user-facing change? No, this change only affects a test file. ### How was this patch tested? CI is expected to pass with this fix. - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-11 16:31:49 +08:00
pu-zhe	02886e2641	[Feat] 310p support MoE W8A8 quantizaition (#6641 ) ### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-10 17:17:44 +08:00
jiangyunfan1	1eb07986bf	[TEST]add a qwen3-30b acc case with mooncake mempool (#6244 ) ### What this PR does / why we need it? This PR adds a case of qwen3-30b w8a8 with mooncake mempool, we need to test it regual ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2026-02-10 16:26:55 +08:00
wangxiyuan	2a826b5fad	[Misc] upgrade to vllm main (#6646 ) ### What this PR does / why we need it? This PR upgrades the core vLLM dependency to a newer version from the main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is necessary to keep our project up-to-date with the latest features and fixes from upstream vLLM. 1. `ac32e66cf9` pass file is moved. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wxsIcey <1790571317@qq.com>	2026-02-10 14:08:59 +08:00
yupeng	8d44ddacb0	[Test][LoRA] Add e2e test for base model inference (#6624 ) ### What this PR does / why we need it? This PR adds an end-to-end test case to verify the correctness of base model inference when LoRA is enabled. This is to ensure that after a LoRA base model request issue was fixed, the functionality remains correct and does not regress. The new test case calls `do_sample` with `lora_id=0` to target the base model and asserts the output against expected SQL queries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with the new test case. The test can be run with: ```bash pytest -sv tests/e2e/singlecard/test_llama32_lora.py Signed-off-by: paulyu12 <507435917@qq.com>	2026-02-09 21:06:49 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
lhp-deep	d060c797ed	[fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619 ) ### What this PR does / why we need it? This PR fixes a bug in the `test_triton_fusion_ops` test case. The test compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with a split implementation. Both paths use a recurrent state tensor. The bug was that the state tensor was being modified in-place by the fused kernel call, and this modified tensor was then reused for the split implementation path. This led to an incorrect comparison and test failure. This fix ensures that each path starts with an identical, clean initial state by creating separate tensors. It also changes the state initialization from `torch.randn` to `torch.ones` to make the test deterministic. ### Does this PR introduce _any_ user-facing change? No, this change only affects a test case and has no user-facing impact. ### How was this patch tested? The fix is applied directly to the test case. The CI passing for `test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is working as expected. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>	2026-02-09 16:43:27 +08:00
pu-zhe	1cc225711d	[Refactor]310p_e2e test case update (#6539 ) ### What this PR does / why we need it? This pull request significantly enhances the test suite by adding new end-to-end test cases for Qwen3 models on the 310P hardware platform. The primary goal is to ensure the stability and correctness of these models under diverse operational conditions, including various parallelism strategies, data types, and quantization methods. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:28:37 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
wangyu	c63b7a1188	[Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301 ) ### What this PR does / why we need it? This PR adds disaggregated encoder tests for Qwen2.5-VL-7B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the test by running ci - vLLM version: release/v0.12.0 --------- Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>	2026-02-06 17:30:17 +08:00
Li Wang	d018aeb5fa	[Image] Bump mooncake version to v0.3.8.post1 (#6428 ) ### What this PR does / why we need it? This patch bump the mooncake version to the latest [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test is locally >>> from mooncake.engine import TransferEngine - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-02-06 10:54:03 +08:00
zhangxinyuehfad	81f3c09d6d	[CI] Change A2 runner (#6557 ) ### What this PR does / why we need it? This PR updates the CI runner from `linux-aarch64-a2-` to `linux-aarch64-a2b3-` in various test configuration files. This change is necessary to adapt to updates in the CI infrastructure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The changes are configuration updates for CI tests. The correctness will be verified by the CI pipeline. Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 23:43:57 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
ChenCangtao	2c1608265b	[CI][npugraph_ex]Fix npugraph ex e2e test (#6553 ) ### What this PR does / why we need it? When running the Qwen3-0.6B model using the npugraph_ex backend, the last few characters of the generated results changed. We have modified the relevant test cases to ensure the CI runs smoothly. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-02-05 14:03:10 +08:00
Yizhou	2ee4f23f28	[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6475 ) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit `5b0a6bcfe9`)" and fixes a check in `model_runner_v1`. A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm. This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-02-04 21:11:08 +08:00
starmountain1997	bfcc372f75	[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499 ) ### What this PR does / why we need it? This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test by adding both short and long prompt test cases: - Short test: Validates basic functionality with minimal input ("Hello ") - Long test: Validates the model can handle prompts near its maximum context length (~163K tokens, approaching the max_position_embeddings limit of 163,840) Additionally, explicitly sets max_model_len=163840 to ensure the test properly exercises the model's full context window capability. ### Does this PR introduce _any_ user-facing change? No. This change only affects internal E2E testing infrastructure. ### How was this patch tested? The modified test case will be executed as part of the E2E test suite and has been validated [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499). - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-04 09:10:50 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
whx	4d6444d5fd	[Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505 ) ### What this PR does / why we need it? Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has not been tested on bf16 scenario. Results show that this is not currently supported. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with existing test. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-02-03 18:26:51 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
LeeWenquan	b1de6cbb31	[Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047 ) ### What this PR does / why we need it? Fix a bug in the repo and add a test case for MTP + Full Decode Only + Qwen3Next. The _build_dummy_attn_metadata function in NPUModelRunner seems losed a query_star_loc.copy_to_gpu operation, which will lead to difference between query_start_loc and query_start_loc_cpu, and they are required to be same in MTP + Full Decode Only + Qwen3Next case. Before this pr: `self.query_start_loc = [0, 0, 0, 0, ... , 0] self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-02-03 14:26:21 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
lidenghui1110	79803932e2	[Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366 ) ### What this PR does / why we need it? As #2947 describe, we need to transpose kv cache layout after GQA kv transfer when prefill and decode tensor parallel size are heterogeneous, in the previous implementation, we use `npu_paged_cache_load ` + `tranpose` + `_npu_reshape_and_cache` to do this work. But obviously, it is not an efficient plan, the ops above need to be called for each layer, which introduces 3 * layer_num kernel launch, and 6 * layer_num data movement between L1 Cache and HBM for one request on decode node. Usually, decode node uses graph mode, so these op kernels will be called between decode forward launched by an async thread in mooncacke connector, this kernels maybe last for several decode forward and TTFT will increase by 3~4 decode forward time. In this PR, we implement an AscendC fused op `transpose_kv_cache_by_block` to do this with only once kernel launch and move data between L1 Cache and HBM only once. After using this fused op, the time cost in transpose kv cacke layout can be decreased to 0.24ms from 7ms in UT on 910C, and in PD disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in qwen3-235B. \| request_num \| original \| fused_op\| \|:----------------------:\|:---------------:\|:-------------------:\| \| 1 \| 643 ms \| 578 ms \| \| 128 \| 1480 ms \| 1368 ms \| ### Does this PR introduce _any_ user-facing change? Use fused op by default, incase the op has bug in any scenario, provide fallback choice using env to disable it. DISABLE fused op by add following env `export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0` ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-02-03 14:10:01 +08:00
guanguan0308	dffac6db73	[Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402 ) ### What this PR does / why we need it? Add New Output for Expert Token Count An additional output tensor expert_token_nums is added to both operators to meet the requirement of tracking token distribution among experts: Tensor Name: expert_token_nums Dimension: 1D tensor Shape: (local_expert_num,) Data Type: int32 Semantics: Represents the number of tokens actually received by each expert on the current card. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: guanguan0308 <1546542263@qq.com> Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>	2026-02-03 10:41:06 +08:00
starmountain1997	b6256e8bc9	Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241 )" (#6497 ) # What this PR does / why we need it? This PR reverts commit `8134146ab6`, which modified the DeepSeek V3.2 (W8A8) single-node nightly test configuration. as there is no limit between tp_size and MTP. # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? N/A for a revert PR. The changes restore the previously known working configuration. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-03 08:42:58 +08:00
meihanc	c08364f761	[Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455 ) ### What this PR does / why we need it? When using Mooncake on Ascend NPU, AscendDirectTransport randomly allocates ports within range `[20000, 20000 + npu_per_node × 1000)`. Reference: [ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475) If `kv_port` overlaps with this range, users may encounter intermittent startup failures: ```bash zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012') RuntimeError: KV Cache sending/receiving thread failed to start. ``` This pr fix intermittent kv_port conflict with AscendDirectTransport in `Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide` section in `pd_disaggregation_mooncake_multi_node.md`. test Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml): https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-02-02 17:31:21 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
starmountain1997	8134146ab6	[CI] fix DS3.2 single node cudagraph_sizes config (#6241 ) # What this PR does / why we need it? This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8) model to ensure CI stability. The changes include: 1. Simplified nightly test matrix (nightly_test_a3.yaml): - Temporarily reduced to only run deepseek3_2-w8a8 test case for debugging - Changed trigger from schedule/workflow_dispatch to support push/pull_request for faster iteration 2. Updated DeepSeek V3.2 test configuration (test_deepseek_v3_2_w8a8.py): - Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32] for better performance - Increased max-num-seqs from 4 to 8 - Increased gpu-memory-utilization from 0.92 to 0.98 - Increased num_speculative_tokens from 2 to 3 3. Added PR checkout step (_e2e_nightly_single_node.yaml): - Added ability to checkout a specific PR (#6241) for testing # Does this PR introduce any user-facing change? No. This PR only affects CI/CD test configurations and does not introduce any user-facing changes. # How was this patch tested? Mock nightly test has passed, see [here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241). <img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b" src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc" /> - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-02-02 11:47:32 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Li Wang	5b0a6bcfe9	[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459 ) This reverts commit `56f5d3bd49`. ### What this PR does / why we need it? The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-31 16:33:34 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
Yizhou	56f5d3bd49	[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357 ) ### What this PR does / why we need it? This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-30 16:41:44 +08:00
ChenCangtao	f2990f7741	[e2e Test][npugraph_ex]add static kernel e2e test case (#6320 ) ### What this PR does / why we need it? Added an E2E test case for the scenario of enabling a static kernel for npugraph_ex, monitoring its compilation and unloading process. Also fixed the previously existing spelling errors - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 16:24:48 +08:00
CodeCat	b2857de43f	[ST]Add e2e test for Npugraphex_pass (#6388 ) ### What this PR does / why we need it? We found the custom passes of NPUGraphEX have implemented fusion operator features, which still require E2E test case validation and guard. This PR implements E2E test cases for the AddRMSNormQuant and SplitQKVNormRope operator fusions under NPUGraphEX that are already in the codebase. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-30 09:14:07 +08:00
wjunLu	4970de4242	[CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195 ) ### What this PR does / why we need it? Enable the tests that were skipped due to an outdated driver version: - tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py - tests/e2e/multicard/4-cards/long_sequence/test_basic.py - tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py and some cases in - tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py - tests/e2e/multicard/2-cards/test_external_launcher.py - tests/e2e/multicard/2-cards/test_offline_weight_load.py - tests/e2e/multicard/2-cards/test_quantization.py - tests/e2e/multicard/4-cards/test_data_parallel_tp2.py TODO: - tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py - tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-29 22:41:41 +08:00
Qiu	50e0e87646	[bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344 ) ### What this PR does / why we need it? PR #5672 attempted to remove the -1 padding for duplicate tokens in the decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler slicing approach. However, in the single-ops logic and mixed PD batches, the decode slot_mapping did not eliminate the -1 and also shared the slicing method, resulting in incorrect slot_mapping. This PR resolves this issue, and the logic will be further consolidated in subsequent refactoring PRs. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-29 16:48:37 +08:00
InSec	86b6ecac4c	[CI][BugFix] Import error fix. (#6293 ) ### What this PR does / why we need it? Fix the import error of qwen3-next nightly test. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: InSec <1790766300@qq.com>	2026-01-28 22:07:47 +08:00
linfeng-yuan	e25ee65729	[Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348 ) ### What this PR does / why we need it? Add e2e test case for apply_top_k_top_p_custom kernel and eliminate chinese comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-01-28 17:25:57 +08:00
dsxsteven	325cb16e3f	[BugFix][CI]Fix DeepSeek-R1-W8A8-longseq nightly CI (#6297 ) ### What this PR does / why we need it? The precision issue arose because the kv cache of the p-node had not been fetched for an extended period(>6min) and was forcibly freed. To avoid this problem, the batch size was reduced and the timeout period has also been extended. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: dsxsteven <dsxsteven@sina.com>	2026-01-28 16:36:24 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
InSec	595b57c4d4	[CI][BugFix] Qwen3-Next nightly test fix. (#6247 ) ### What this PR does / why we need it? Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the full graph mode. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: InSec <1790766300@qq.com>	2026-01-26 19:53:53 +08:00
huangning1995	ce11fd49f3	[Feature] Batch invariant torch.compile (#6107 ) ### What this PR does / why we need it? Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to enable batch-invariant in vllm-ascend, we observed that the performance of BI in eager mode remains suboptimal. This PR further integrates batch-invariant with torch.compile, which improves inference performance by 350% when tested with Qwen3-0.6B. ### Does this PR introduce _any_ user-facing change? Previously, enabling both aclgraph and Batch-Invariant would cause an "ub overflow" error. This occurred because transposed input tensors could produce incorrect stride() values. To fix this, we now call .contiguous() on the input tensors before passing them to Triton kernels. This ensures a contiguous memory layout and prevents transposed tensors from causing incorrect stride calculations. ### Test Plan pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py ### Test Result ``` ============================================================================ slowest durations ============================================================================ 87.37s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle 77.39s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN 74.04s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail 73.59s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation (8 durations < 0.005s hidden. Use -vv to show these durations.) ================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================ ``` ### Performance export VLLM_BATCH_INVARIANT=1 vllm serve /home/Qwen3-0.6B \ --served-model-name qwen \ --port 8000 \ --max-num-seqs 256 \ --tensor-parallel-size 1 \ --max-model-len 5500 \ --max-num-batched-tokens 5500 \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.9 \ --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}' vllm bench serve --served-model-name qwen --trust-remote-code --backend vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name random --random-input-len 512 --random-output-len 256 --num-prompts 800 --max-concurrency 8 torch.compile batch invariant performance: ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 477.21 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 1.68 Output token throughput (tok/s): 429.16 Peak output token throughput (tok/s): 472.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 1287.48 ---------------Time to First Token---------------- Mean TTFT (ms): 285.53 Median TTFT (ms): 312.70 P99 TTFT (ms): 324.22 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.59 Median TPOT (ms): 17.50 P99 TPOT (ms): 18.44 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.59 Median ITL (ms): 17.45 P99 ITL (ms): 18.76 ================================================== ``` Eager ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 1694.70 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 0.47 Output token throughput (tok/s): 120.85 Peak output token throughput (tok/s): 136.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 362.54 ---------------Time to First Token---------------- Mean TTFT (ms): 164.29 Median TTFT (ms): 129.71 P99 TTFT (ms): 1961.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 65.81 Median TPOT (ms): 65.15 P99 TPOT (ms): 72.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 65.81 Median ITL (ms): 64.64 P99 ITL (ms): 75.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: huangning1995 <huangning12@huawei.com>	2026-01-26 09:15:06 +08:00
Li Wang	c38c838d03	[CI] Decrease Qwen3 dense model output throughput baseline to make ci happy (#6233 ) ### What this PR does / why we need it? As https://github.com/vllm-project/vllm-ascend/actions/runs/21327913593/job/61388195448 shows, I encountered two CI failures., The results consistently pointed to the reduced outcome 1600 -> 1514 - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:04:13 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
Li Wang	ca297eb57f	[CI] Migrate e2e test runner to hk (#5344 ) ### What this PR does / why we need it? This patch add new runner labels for the HK region, and e2e single-card testing has been migrated to this runner. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:00:51 +08:00
Angazenn	5b746f3e83	[Inductor]change pass to adapt to new addrmsnormBias operator (#6094 ) ### What this PR does / why we need it? #5790 changes default addrmsnormBias operator if custom ops is enabled. This PR modifies AddRmsNormQuant pass to align with addrmsnormBias. --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 20:16:44 +08:00

1 2 3 4 5 ...

624 Commits