xc-llm-ascend

Author	SHA1	Message	Date
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
LI SHENGYONG	611e223b7d	[EPLB][Bugfix] EPLB support fp/bf16 (#5531 ) ### What this PR does / why we need it? EPLB support dtype of fp/bf16. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? w8a8_dynamic Baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| w8a8_dynamic eplb: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| The fp16 conversation is normal. The fp16 test is in progress. Baseline fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-26 14:28:16 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
Angazenn	019a2fe6e6	[Eagle3]enhance skipping dp allreduce and add it into eagle proposer (#6192 ) ### What this PR does / why we need it? This PR： 1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all cpu dp allreduce for dense models. This is also for purpose 2. 2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now models like Qwen3-235b supports eagle3 spec decode. A typical setting for these moe models on pd disaggregation often introduce `dp_size > 1`. This requires `set_forward_context` to call a cpu dp allreduce to retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce greatly improves performance. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 11:29:42 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
Qiu	749e24f81e	[bugfix] align max_num_batched_tokens with tppcp when using FLASHCOMM1 (#6000 ) ### What this PR does / why we need it? Align max_num_batched_tokens with tppcp when using FLASHCOMM1 to avoid assert error in `NPUModelRunner._dummy_run`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-23 14:19:49 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
anon189Ty	7725314b26	[Feat] Merge the multi eagle graphs to one graph (#5940 ) ### What this PR does / why we need it? This PR merge all steps of draft model in fullgraph mode, to avoid the synchronize between each graph, reduce the bubble time. #### Key ideas: - The "model forward" of the step 0 (first step) and remaining steps are captured together as a "Callable", rather than capturing each model individually. - "update_attn_params" is moved outside the entire graph, meaning that all "attn_metadata" required by all steps are constructed before "replay", and the "attn_params" of all steps are updated at once. - Remove synchronization between the main model graph and draft model graph. #### Key params/functions: - params: draft_attn_metadatas, attn_metadata_multi_steps, slot_mapping_group - functions: _run_merged_draft, attn_update_stack_num_spec_norm, update_attn_params, _propose, dummy_run ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2026-01-23 08:37:02 +08:00
Bai Yongbin	7f91ac2649	[CP&SP] Integrate FIA operator in mla_cp._forward_decode (#5641 ) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>	2026-01-22 20:02:30 +08:00
CodeCat	1402cf6874	[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model for QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:https://github.com/vllm-project/vllm-ascend/issues/4715 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-22 17:22:41 +08:00
zhaomingyu13	34fb628248	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#6097 ) According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later No ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Fixes vllm-project/vllm#31345 ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-22 11:36:23 +08:00
LICO67373	12a668b1d9	[Refactor] AttentionBuilder inherit from base class in vllm (#5916 ) ### What this PR does / why we need it? This PR makes `AscendMLAMetadataBuilder` and `AscendSFAMetadataBuilder` properly inherit from the base class `MLACommonMetadataBuilder` in vllm by adding `super().__init__()` calls. Changes: - Add `super().__init__()` call in `AscendMLAMetadataBuilder.__init__()` - Add `super().__init__()` call in `AscendSFAMetadataBuilder.__init__()` - Extract `ascend_chunked_prefill_workspace_size()` to `vllm_ascend/attention/utils.py` to avoid code duplication - Override `determine_chunked_prefill_workspace_size()` to support Ascend-specific 128k tokens workspace size (vs 64k in parent class) - Update unit tests to mock parent class `__init__` for proper isolation Why we need it: - Follow proper Python inheritance patterns by calling `super().__init__()` - Reduce code duplication by reusing parent class initialization logic - Better maintainability as parent class changes will be automatically inherited Part of issue #5463 item 10 ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that does not change any user-facing behavior. Signed-off-by: lico67373 <918688502@qq.com>	2026-01-21 10:45:45 +08:00
drslark	b2475099a0	[main][Bugfix] Fixed an problem related to embeddings sharing (#5967 ) ### What this PR does / why we need it? Cancel the embeddings sharing when the embeddings of main model and the embeddings of eagle model are different. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Cause i don't have `Meta-Llama-3.1-8B-Instruc`t locally, i commented it and run: ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` The output is fine: ```text . ======================================================================================================================== warnings summary ========================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================================================== 3 passed, 1 skipped, 2 warnings in 196.19s (0:03:16) ======================================================================================================= ``` - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-20 21:34:28 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
weiguihua2	5892455f43	[Bugfix] fix bug of pcp+mtp+async scheduler (#5994 ) ### What this PR does / why we need it? Fixed the issue where the PCP and MTP services could not be started due to asynchronous scheduling. After the pcp, mtp, and asynchronous scheduling functions are enabled, the service is suspended because of a shape mismatch after a curl request is sent. This PR resolves this issue. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-20 15:24:05 +08:00
zhangxinyuehfad	a5b099c73d	[Bugfix] Reset incompatible config (#6005 ) ### What this PR does / why we need it? This PR introduces compatibility fixes for running vLLM on Ascend NPU hardware. The changes ensure that GPU-specific parameters are automatically detected and reset to Ascend-compatible values with appropriate warnings logged. \| Module \| Parameter \| Default Value \| \|--------\|-----------\|---------------\| \| Model Config \| `disable_cascade_attn` \| `False` \| \| Parallel Config \| `all2all_backend` \| `"allgather_reducescatter"` \| \| Cache Config \| `cpu_kvcache_space_bytes` \| `None` \| \| MultiModal Config \| `mm_encoder_attn_backend` \| `None` \| \| Observability Config \| `enable_layerwise_nvtx_tracing` \| `False` \| \| Scheduler Config \| `max_num_partial_prefills` \| `1` \| \| Speculative Config \| `quantization` \| `None` \| \| KV Transfer Config \| `kv_buffer_size` \| `1e9` \| \| KV Transfer Config \| `enable_permute_local_kv` \| `False` \| \| Attention Config \| `use_prefill_decode_attention` \| `False` \| \| Attention Config \| `use_cudnn_prefill` \| `False` \| \| Attention Config \| `use_trtllm_ragged_deepseek_prefill` \| `False` \| \| Attention Config \| `use_trtllm_attention` \| `False` \| \| Attention Config \| `disable_flashinfer_prefill` \| `False` \| \| Attention Config \| `disable_flashinfer_q_quantization` \| `False` \| \| Attention Config \| `flash_attn_version` \| `None` \| \| Attention Config \| `backend` \| `None` \| \| Attention Config \| `flash_attn_max_num_splits_for_cuda_graph` \| `32` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 11:02:38 +08:00
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
LI SHENGYONG	bc1f6713e7	[EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (#5933 ) ### What this PR does / why we need it? 1. Move the logic of expert mapping forward to prevent shotgun changes 2. Disable the update of expert map. ### How was this patch tested? a2 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| GPQA_diamond \| 53064e \| accuracy \| gen \| 73.23 \| a3 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:24:25 +08:00
LI SHENGYONG	9fed2636cb	[EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908 ) ### What this PR does / why we need it? 1. If the model has dense layers, the current code will attempt to obtain the routing experts of the dense layers, which will cause an error. This should be fixed by modifying the code to skip the dense layers when obtaining the routing experts. 2. The global_expert_map that the function directly outputs a affects the performance of dsv3.2. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepSeek V3.1 conversation is normal. #### aime precision test (dsv3.1) baseline without eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 66.67 \| eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 70.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:23:28 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
Jade Zheng	22f253142a	[Feature] Support fine-grained shared expert overlap (#5482 ) Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2026-01-17 11:53:22 +08:00
rjg-lyh	3af91e5ac4	[Bugfix] Fix the input constraints checks for the mlapo and bmm_transpose operators (#5764 ) ### What this PR does / why we need it? This PR fix the input constraints checks for the mlapo and bmm_transpose operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` ### Perf 64K/3K，1P1D，bs=32 before this pr: TPOT 29ms, TTFT 47s，TPS 606 token/s after this pr: TPOT 29ms, TTFT 48s，TPS 636 token/s Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-01-16 09:52:48 +00:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
Zetong Li	ea01aeaab7	[Refactor][EAGLE] 4/N extract common methods from eagle and mtp (#5870 ) ### What this PR does / why we need it? This PR aims to extract common methods from eagle_proposer and mtp_proposer. This is a small step towards merging eagle and mtp. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-15 10:24:35 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
zhaomingyu13	01805fbd7d	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 )"(#5902 ) This reverts commit `d886b81971`. it breaks pd function - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:55:10 +08:00
LI SHENGYONG	ecf2fa482e	[EPLB][Bugfix] Get expert map from layers (#5817 ) ### What this PR does / why we need it? The initialization method of expert_map used by the eplb module is different from that used by the fused_moe module. This PR deletes the expert_map initialization method used by the eplb module to make the initialization methods consistent. #### before bugfix self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,62, 63], device='npu:1', dtype=torch.int32) self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32) ### How was this patch tested? #### qwen3-235B-w8a8 aime \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-14 09:16:51 +08:00
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00
Rozwel-dx	8d571286dd	[Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555 ) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card ### What this PR does / why we need it? Modify the binding logic to allocate CPU cores for each NPU card based on NUMA affinity, while isolating acl_thread/release_thread and other processes to prevent mutual interference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `c85cc045f8` Signed-off-by: rowzwel_dx <1392851715@qq.com> - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-01-13 09:21:28 +08:00
zhaomingyu13	d886b81971	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 ) ### What this PR does / why we need it? According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Fixes vllm-project/vllm#31345 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-13 09:14:30 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
zxr2333	78b554dda9	[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722 ) ### What this PR does / why we need it? Add new function to mooncake layerwise connector, including: 1. supports sparse attention, for DeepSeek-V3.2 2. Distribute transfer tasks to redundant kv_head cards This PR is related to [[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache Layerwise Push Support](https://github.com/vllm-project/vllm-ascend/issues/4842) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-01-10 23:04:16 +08:00
wangxiaoteng888	aa987ffe87	[P/D][bugfix]Fix the PCP port mapping error issue (#5706 ) ### What this PR does / why we need it? Fix the PCP port mapping error issue.In a multi-node PD separation scenario, when the PCP feature is enabled, there is an issue with the ZMQ transmission port. Specifically, the IP and port received by Side D do not match. The cause of this issue is an error in the port mapping update strategy logic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-10 22:43:52 +08:00
fems14	ff4c1a47b3	[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751 ) ### What this PR does / why we need it? 1.Fixed memory retention on certain GPUs caused by missing PUT operations. 2.Fixed performance degradation resulting from architectural incompatibilities in the underlying refactor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: fems14 <1804143737@qq.com>	2026-01-09 17:46:23 +08:00
zzhxxx	64d29875f9	[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 ) ### What this PR does / why we need it? Based on the Sharded-CP feature PR:https://github.com/vllm-project/vllm-ascend/pull/4702; RFC:https://github.com/vllm-project/vllm/issues/30055 This PR officially integrates Deepseek V3.2's DSA-CP support on the basis of https://github.com/vllm-project/vllm-ascend/pull/4702, improving inference efficiency and scalability under mixed prefill-decode workloads. The main improvements include: - Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for TP=1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-09 15:58:40 +08:00
Chenxi Qian	40eb3e1836	[OP] Enable custom op aclnnMoeInitRoutingCustom (#5332 ) ### What this PR does / why we need it? This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR #5251 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com> Signed-off-by: zzzzwwjj <1183291235@qq.com> Co-authored-by: zzzzwwjj <1183291235@qq.com>	2026-01-09 09:35:18 +08:00
zhenwenqi2024	97f6be8108	[feature]dcp&pcp support mlapo (#5672 ) ### What this PR does / why we need it? mlapo in deepseek is a huge performance improvement in decode, this pr support pcp & dcp with mlapo ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-08 23:49:23 +08:00
drslark	ccbc5e2ba1	[Feat][Bugfix][main] Adapted SP to eagle3 (#5562 ) ### What this PR does / why we need it? Adapted sp to eagle3. There may still be some problems, e.g., accuracy in some scenes, `sp`+`dp`... We will fix them later. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We tested it mainly in a new `e2e`. ```shell pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance ``` ```text . =============================== warnings summary =============================== <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) ============= ``` It passed. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-08 15:33:52 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00

1 2 3 4 5 ...

604 Commits