xc-llm-ascend

Author	SHA1	Message	Date
SILONG ZENG	347eb36a59	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #9 ) (#6135 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \|`vllm_ascend/worker/model_runner_v1.py`\| \|`vllm_ascend/worker/pcp_utils.py`\| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-01 23:20:20 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Li Wang	5b0a6bcfe9	[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459 ) This reverts commit `56f5d3bd49`. ### What this PR does / why we need it? The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which break the functionality availability in the spec_decode scenario, let's revert and make CI happy first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-31 16:33:34 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
Yizhou	56f5d3bd49	[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357 ) ### What this PR does / why we need it? This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-30 16:41:44 +08:00
Wang Kunpeng	70cc5f7969	[bugfix]fix rope_forward_triton error (#6404 ) ### What this PR does / why we need it? The rope_forward_triton method reports an error. For example: ``` (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] cos = cos.view(num_tokens, -1) (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768 ``` This is because an incorrect num_tokens_padded was passed in. Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-01-30 14:09:00 +08:00
Yizhou	ac963f1519	[Fix] Adds CUDA graph stats to execution state (#6331 ) ### What this PR does / why we need it? Adds a CUDA graph profiling stats field to the execution state and updates the NPU model runner to set, unpack, and forward those stats during execution. This preserves CUDA graph metrics across state transitions, improving observability for later use and diagnostics. ### Does this PR introduce _any_ user-facing change? Enable this by set ```python llm = LLM( ... disable_log_stats=False, cudagraph_metrics=True, ... ) ``` or `--cudagraph-metrics` and make sure do not disable log stats. After this, you should be able to see something like this, which is really helpful for some light debugging: ``` [loggers.py:257] Engine 000: Avg prompt throughput: 32.3 tokens/s, Avg generation throughput: 114.4 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% [cuda_graph.py:117] CUDAGraph Config Settings: [cuda_graph.py:117] [cuda_graph.py:117] - Mode: FULL_DECODE_ONLY [cuda_graph.py:117] - Capture sizes: [1, 2, 4, 8, 16, 24, 32] [cuda_graph.py:117] [cuda_graph.py:117] CUDAGraph Stats: [cuda_graph.py:117] [cuda_graph.py:117] \| Unpadded Tokens \| Padded Tokens \| Num Paddings \| Runtime Mode \| Count \| [cuda_graph.py:117] \|-----------------\|---------------\|--------------\|--------------\|-------\| [cuda_graph.py:117] \| 4 \| 4 \| 0 \| FULL \| 18 \| [cuda_graph.py:117] \| 5 \| 5 \| 0 \| NONE \| 1 \| [cuda_graph.py:117] \| 1 \| 1 \| 0 \| FULL \| 1 \| [cuda_graph.py:117] \| 18 \| 18 \| 0 \| NONE \| 1 \| ``` ### How was this patch tested? None. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-28 16:34:20 +08:00
Wang Kunpeng	c498cea22d	[refactor] refactor excute_model and _dymmy_run method (#6043 ) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-27 22:27:01 +08:00
TMC	41eb71d665	[Refactor] profiler config optimze (#6141 ) ### What this PR does / why we need it? This PR optimizes the torch_npu profiler configuration to significantly reduce overhead and trace file size. The key changes include: Enable Data Simplification: Explicitly sets data_simplification=True in _ExperimentalConfig. This filters out unnecessary intermediate data during profiling, drastically reducing the memory footprint and I/O overhead. Use Lightweight Stack Tracing: Replaces with_stack with with_modules when torch_profiler_with_stack is enabled. In torch_npu, with_stack introduces heavy latency. with_modules provides equivalent semantic information with much lower overhead. Code Simplification: Removes redundant parameter configurations in _ExperimentalConfig by utilizing default values, making the codebase cleaner and easier to maintain. Test setup: max length = 50, profiler + stack enabled Before optimization: Profiler data size: 651 MB Generate time: 3 seconds After optimization: Profiler data size: 156 MB (≈76% reduction) Generate time: <1 second ### Does this PR introduce _any_ user-facing change? No API changes. Users profiling on Ascend will experience faster profiling execution and smaller trace files when stack tracing is enabled. ### How was this patch tested? Manually verified on Ascend NPU by running vLLM with the profiler enabled. Confirmed that trace files are generated correctly containing necessary stack/module info, while showing the reported reduction in size and time. - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: mengchengTang <745274877@qq.com>	2026-01-27 22:09:50 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Jingchun Gao	b390e0ef78	[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (#5416 ) - Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>	2026-01-26 16:53:07 +08:00
ChenCangtao	1645546661	[bugfix][npugraph_ex]fix static kernel uninstall issue (#6128 ) ### What this PR does / why we need it? The static kernel in torch_npu is uninstalled through Python's atexit mechanism. However, in vllm-ascend, when inference ends or the service stops, the worker process is terminated. This way, ending the process does not trigger the atexit mechanism, causing the static kernel not to be unloaded. When using the nougraph_ex backend and enabling the static kernel, we registered a signal handler to explicitly unload the static kernel. When there are many static kernels, unloading usually takes some time, whereas vllm will directly kill the process after sending a terminate event. Therefore, we choose to handle it by starting a new process. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-26 15:03:18 +08:00
Canlin Guo	65289676b4	[Refactor] Separate `_prepare_inputs` to `_prepare_inputs` and `_preprocess` (#6191 ) ### What this PR does / why we need it? Align with upstream vLLM. This PR will help downstream vLLM-Omni reduce the cost for maintaining the _prepare_inputs. Besides, it helps vLLM-Ascend code more readable. In the future, we can follow closer to vLLM. The `preprocess` logic is same as GPUModelRunner. We don't need to maintain it anymore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-26 14:05:23 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
Angazenn	019a2fe6e6	[Eagle3]enhance skipping dp allreduce and add it into eagle proposer (#6192 ) ### What this PR does / why we need it? This PR： 1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all cpu dp allreduce for dense models. This is also for purpose 2. 2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now models like Qwen3-235b supports eagle3 spec decode. A typical setting for these moe models on pd disaggregation often introduce `dp_size > 1`. This requires `set_forward_context` to call a cpu dp allreduce to retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce greatly improves performance. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 11:29:42 +08:00
LI SHENGYONG	8210a62a44	[EPLB][Bugfix]Reduce unnecessary video memory usage (#6020 ) ### What this PR does / why we need it? 1.Incorporate the warm up of the EPLB into the profile run. 2.Reusing the same gather buffer ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen3-235b aime baseline \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb The OOM issue does not occur. \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-23 14:21:13 +08:00
simplzyu	f8d03d21f1	Add Medusa speculative decoding support for vllm_ascend (#5668 ) ### What this PR does / why we need it? `vllm_ascend` already supports several speculative decoding strategies such as MTP, EAGLE, N-gram, and suffix decoding. However, Medusa is not yet supported. Medusa is an efficient speculative decoding framework that leverages a lightweight draft model to propose multiple tokens in a single step, which can significantly improve decoding throughput and reduce latency. To enable Medusa-based speculative decoding on Ascend hardware and provide more decoding options for users, this PR adds Medusa support into the `vllm_ascend` speculative decoding pipeline. ### Does this PR introduce _any_ user-facing change? This PR introduces Medusa speculative decoding as an additional speculative decoding method: ✔ Adds `MedusaProposer` and integrates it into the speculative decoding registry ✔ Extends `SpecDcodeType` with a `MEDUSA` enum entry ✔ Updates `NPUModelRunner` to recognize and invoke Medusa during decoding ✔ Adds Medusa-specific handling in the draft token generation logic ✔ Ensures backward compatibility — Medusa is only used when explicitly enabled Key code changes include: * New file: `vllm_ascend/spec_decode/medusa_proposer.py` * Register Medusa in `get_spec_decode_method` * Extend proposer type hints to include `MedusaProposer` * Add a Medusa-specific branch in `generate_draft_token_ids` * Pass `sample_hidden_states` required by Medusa ### How was this patch tested? Medusa is implemented as a new proposer class (`MedusaProposer`) following the existing speculative decoding interface. The integration works as follows: 1. Users enable Medusa via the speculative decoding configuration. 2. `get_spec_decode_method()` returns a `MedusaProposer` instance when `method="medusa"`. 3. During decoding, `NPUModelRunner` detects that the active drafter is a `MedusaProposer`. 4. Instead of the generic speculative decoding path, the Medusa-specific `generate_token_ids()` method is invoked, which consumes: * `valid_sampled_token_ids` * `sampling_metadata` * `spec_decode_metadata` * `sample_hidden_states` 5. The proposed tokens are validated by the target model as usual. When Medusa is not enabled, the decoding pipeline behaves exactly as before, ensuring full backward compatibility. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: simplzyu <191163281@qq.com> Signed-off-by: simplzyu <zhenyuguo@cmbchina.com>	2026-01-23 14:14:23 +08:00
ZYang6263	418a43e2a2	[Bugfix] Fix seq_lens reset issue causing performance degradation (#6158 ) ### What this PR does / why we need it? Now `seq_lens` was not being reset correctly after each step due to missing code that clears the sequence lengths. As a result, when processing a smaller batch after a larger batch, the `seq_lens` from the larger batch was still carried over. This caused the attention operator to compute using an unnecessarily larger sequence length, leading to an increased computation load and performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-01-23 11:29:54 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
Angazenn	1d3544c887	[BugFix]converting pa get_workspace back to capturing (#5833 ) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Angazenn <supperccell@163.com>	2026-01-22 15:49:22 +08:00
ChenCangtao	38edfd585a	[bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (#6015 ) ### What this PR does / why we need it? When using the full_decode_only mode, the vllm framework will still use the torch.fx.passes.split_module.split_module API to process the corresponding GraphModule of the model. However, the output of this API may cause the output of the fx graph to no longer be a tuple, and torch.compile enforces strict checks on this. Previously, we manually modified the fx graph, which introduced an abnormality in the model output type. In this PR, we switched to using PyTorch's native API to modify the FX graph, and removed the code that was previously added to handle output type anomalies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-22 04:35:06 +00:00
zhaomingyu13	34fb628248	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#6097 ) According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later No ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Fixes vllm-project/vllm#31345 ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-22 11:36:23 +08:00
Qiu	58ff465821	[bugfix] fix the complex and potentially problematic generate_kv_idx. (#5957 ) ### What this PR does / why we need it? In long-sequence scenarios, the chunked-prefill component may encounter dimension misalignment issues, which previously occurred during precision testing on the code_generate_lite dataset. This PR removes redundant computations and instead derives the value using existing results and straightforward calculations. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-21 14:21:02 +08:00
shiyuan680	cea48c2a34	model runner v2 support triton of penalty (#5854 ) ### What this PR does / why we need it? Optimized operator performance and add ut test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen2.5 7b vl, ops time approved 90% - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` this pr is for # https://github.com/vllm-project/vllm-ascend/issues/5208 Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-20 12:26:05 +00:00
weiguihua2	5892455f43	[Bugfix] fix bug of pcp+mtp+async scheduler (#5994 ) ### What this PR does / why we need it? Fixed the issue where the PCP and MTP services could not be started due to asynchronous scheduling. After the pcp, mtp, and asynchronous scheduling functions are enabled, the service is suspended because of a shape mismatch after a curl request is sent. This PR resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-20 15:24:05 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
Magnus	e8bbf72867	[Bugfix] Fix XliteModelRunner init failed when aclgraph is enabled (#5899 ) ### What this PR does / why we need it? Fix XliteModelRunner init failed when aclgraph is enabled. Ensure function graph_capture of vllm.v1.worker.gpu_model_runner is replaced. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: changdawei1 <changdawei3@huawei.com>	2026-01-15 15:40:28 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
zhaomingyu13	01805fbd7d	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 )"(#5902 ) This reverts commit `d886b81971`. it breaks pd function - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:55:10 +08:00
LICO67373	2a6d95c389	[Cleanup] Remove dead code make_attention_mask function (#5818 ) ### What this PR does / why we need it? This PR removes the unused `make_attention_mask` function from `vllm_ascend/worker/v2/attn_utils.py`. Why it's dead code: - After PR #4870 (attention mask unification refactor), attention mask generation has been centralized in the `AttentionMaskBuilder` singleton class - The mask is now generated directly by metadata builders when needed (e.g., `AscendAttentionMetadataBuilder`, `AscendMLAMetadataBuilder`) - The `make_attention_mask` function is no longer called anywhere in the codebase - The function's parameters (including `attn_mask` and `spec_attn_mask`) were also removed from `build_attn_metadata` in the same refactor Changes: - Remove `make_attention_mask` function (24 lines) from `vllm_ascend/worker/v2/attn_utils.py` ### Does this PR introduce _any_ user-facing change? No. This is a code cleanup that removes dead code. No user-facing behavior changes. ### How was this patch tested? - Verified that `make_attention_mask` is not called anywhere in the codebase (via `grep`) - CI tests pass to ensure no regressions - The function has been unused since PR #4870 was merged - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-14 16:52:51 +08:00
Ronald	e20813f441	[Feature] implement eagle spec decoding for model runner v2 (#5840 ) ### What this PR does / why we need it? this pr implement eagle spec decoding for model runner v2, please see RFC https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-14 09:18:05 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00
Rozwel-dx	8d571286dd	[Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555 ) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card ### What this PR does / why we need it? Modify the binding logic to allocate CPU cores for each NPU card based on NUMA affinity, while isolating acl_thread/release_thread and other processes to prevent mutual interference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `c85cc045f8` Signed-off-by: rowzwel_dx <1392851715@qq.com> - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-01-13 09:21:28 +08:00
zhaomingyu13	d886b81971	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 ) ### What this PR does / why we need it? According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Fixes vllm-project/vllm#31345 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-13 09:14:30 +08:00
LiuYi-Up	dde547e900	[Bugfix] bugfix for the order of dummy run pad and sync (#5777 ) ### What this PR does / why we need it? This PR addresses an issue in piecewise graph mode when Multi-Threading Parallelism (MTP) is enabled. Specifically, the original dummy run sequence performs the following steps in order: 1. Sync DP (input length = 1 + k) 2. Dispatch (input length = 1 + k, with padding==graph size) However, in the model execution phase, the sequence differs, resulting in: 1. Padding (input length = 1, with padding) 2. Sync DP (input length = 1 + k) 3. Dispatch (input length 1 + k != graph size 1 + k, with padding) This discrepancy leads to a mismatch between the input sizes used in the model execution and those expected by the dispatch graph, causing an inconsistency in graph size. This PR ensures that the dispatch graph size aligns correctly by modifying the sequence of operations during model execution to match the dummy run sequence, resolving the mismatch issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: LiuYi-UP <1150854440@qq.com>	2026-01-13 08:44:10 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
lilinsiman	c5744e2350	[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692 ) ### What this PR does / why we need it? The condition for determining padding in the fullgraph overlay with MTP and PCP has been modified to accommodate corner cases where the shape capture size is manually specified. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-10 23:07:48 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
Yizhou	f4605c2b3c	[Fix] Fixes speculative decode indexing and unpad condition for attention metadata (#5626 ) ### What this PR does / why we need it? This addresses the issue brought up by #5356 and #4963, and we believe the unnecessary conditions are the root cause. Change the unpad trigger to be driven by actual size mismatches (num_reqs vs base_num_reqs or scheduled vs input token counts) rather than specific speculative-method flags. Then remove brittle workarounds that forced request counts and sliced query start locations. This prevents incorrect indexing and length mismatches during speculative decoding and makes metadata unpadding more robust across scheduling modes. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested by existing cases. - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2026-01-08 19:41:08 +08:00
drslark	ccbc5e2ba1	[Feat][Bugfix][main] Adapted SP to eagle3 (#5562 ) ### What this PR does / why we need it? Adapted sp to eagle3. There may still be some problems, e.g., accuracy in some scenes, `sp`+`dp`... We will fix them later. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We tested it mainly in a new `e2e`. ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` ```text . =============================== warnings summary =============================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) ============= ``` It passed. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-08 15:33:52 +08:00
Icey	b94fc13d3f	[BugFix][Fusion] Fix graph fusion failure problem (#5676 ) Currently, the vllm pull request (https://github.com/vllm-project/vllm/pull/24252) is causing operator fusion to fail. This issue was previously fixed by patching the backend. The root cause has been identified, and the problem can be resolved with this pull request. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-07 18:42:55 +08:00
Mengqing Cao	3f4f2b4ae6	[Refactor] Import global var form vllm instead of overwirte it (#5469 ) ### What this PR does / why we need it? Import global var form vllm instead of overwirte it, so that we could use the correct global variant value - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2026-01-07 18:41:45 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
无脸男	1140789e83	[Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (#5553 ) ### What this PR does / why we need it? When launching the service in the scenario where the cudagraph_mode is set to FULL and Eagle3 acceleration is enabled for inference, an error in fia will cause graph capture to fail. This PR fixes the issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: WithHades <244036962@qq.com>	2026-01-07 15:57:16 +08:00

1 2 3 4 5 ...

570 Commits