xc-llm-ascend

Author	SHA1	Message	Date
zhangyiming	56d8f088dd	[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196 ) ### What this PR does / why we need it? [Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: menogrey <1299267905@qq.com>	2026-01-24 11:29:07 +08:00
zhaomingyu13	2dd68652bc	[Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637 ) ### What this PR does / why we need it? Add the setting description of cudagraph_capture_sizes, guide users to avoid the common mistakes frequently made when using the EAGLE overlay fullgraph. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need for testing - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-23 23:22:44 +08:00
UnifiedCacheManager	a2f022f9b6	[UCMConnector]Add has_connector_metadata (#6172 ) ### What this PR does / why we need it? ucm_connector add has `has_connector_metadata` interface to adapt to the latest KV connector in vLLM. ### Does this PR introduce _any_ user-facing change? this PR doesn't introduce _any_ user-facing change. ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>	2026-01-23 21:16:48 +08:00
lhchg	717d299ae5	[BugFix]bug fix for dispatch_ffn_combine (#6156 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Some synchronization logic of the fusion operator copies EP * expertPerRank int32 values. This part of data contains synchronization signals and data. The 512B DataBlock of Ascend A3 writes all data in the same block atomically to the HBM. For the DeepSeek model, when expertPerRank per device is 16, the 512B alignment is met in both 16-device single-node and 32-device two-node scenarios. Therefore, we check the first position of each 512B data. If the value is not 0, it indicates that the current 512B data has been sent. However, for other cases where expertPerRank per device is not 16, EP * expertPerRank does not meet the 512B alignment. If the above logic is used for checking, there will be problems. Therefore, here we will pad the EP * expertPerRank data length to the length aligned to 512B. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: lhchg <lhao_cheng@163.com> Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>	2026-01-23 21:14:18 +08:00
drslark	44a4ff6960	[main][BugFix] Avoided a bug of `torch_npu.npu_mm_reduce_scatter_base` when sp size >= 16 (#6168 ) ### What this PR does / why we need it? If `sp` is enabled and `tp_size` >= 16, `torch_npu.npu_mm_reduce_scatter_base` will raises a exception. After consulting with the operator developer, we learned that the operator does not work when `tp` = 16. So, we disable the operator when `tp` = 16. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested We started a server with `sp` enabled and `tp` = 16. It started successfully. ```text [0;36m(APIServer pid=1855938)[0;0m INFO: Started server process [1855938] [0;36m(APIServer pid=1855938)[0;0m INFO: Waiting for application startup. [0;36m(APIServer pid=1855938)[0;0m INFO: Application startup complete. ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-23 21:12:23 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
starmountain1997	6c73b88dd6	[CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115 ) ### What this PR does / why we need it? This PR enables FLASHCOMM1 communication optimization with layer sharding for DeepSeek-V3.2 W8A8 model testing to validate PR #5702. The changes include: 1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1 improves performance for distributed inference 2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"] 4. Update baselines: Adjust performance baselines to reflect the improvements from FLASHCOMM1 and layer sharding ### Does this PR introduce _any_ user-facing change? No. This is a CI/test-only change that enables new communication optimization features for testing purposes. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-23 19:48:37 +08:00
baxingpiaochong	8786412f5c	[Bugfix]KV pool rank 0 consumes more HBM (#6113 ) ### What this PR does / why we need it? before add_set_deivce <img width="2354" height="674" alt="image" src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b" /> after <img width="1044" height="156" alt="image" src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2026-01-23 19:47:33 +08:00
jiangyunfan1	bdf65e6bd3	[TEST]Add mooncake common method for tests (#6194 ) ### What this PR does / why we need it? This PR adds mooncake common method to conftest, we need it to add more test cases later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running a test - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>	2026-01-23 17:14:15 +08:00
Angazenn	1e116829ac	[doc]update --max-num-seqs in Qwen3-235b tutorial (#6197 ) ### What this PR does / why we need it? This pr update --max-num-seqs in Qwen3-235b single-node-deployment tutorial to ensure running into graph mode correctly. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: Angazenn <supperccell@163.com>	2026-01-23 17:11:10 +08:00
Li Wang	af4dbb6b26	[CI] Use nginx for package cache to speed up CI (#6170 ) ### What this PR does / why we need it? Use nginx for package cache to speed up CI - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-23 16:56:16 +08:00
weiguihua2	4173255c0c	[main][Bugix] fix kv pcp+pooling+pd separation bug (#6153 ) ### What this PR does / why we need it? Rectify the problem that the pcp and pd separation and kv pooling scenario. In the pooling scenario, multi_nodes_meta_mapping is empty. As a result, an error is reported when the remote_host information is obtained through the get_remote_port_send_num method. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-23 16:15:04 +08:00
zhaomingyu13	ff63626874	[Bugfix] Fix the issue of the acceptance rate decline for Qwen3-30B-A3B-EAGLE3 (#6138 ) ### What this PR does / why we need it? Due to the long-term lack of synchronization with the upstream code, a problem that led to a decrease in the acceptance rate of the Qwen3-30B-A3B-EAGLE3 draft model was introduced when fixing the bug（#5967）. Now, synchronize with the upstream and fix this bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-30B-A3B", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "AngelSlim/Qwen3-a3B_eagle3" "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarkblood@qq.com>	2026-01-23 16:12:56 +08:00
wjunLu	a3079cd253	[Tests] Skip unstable eagle cases to keep CI success (#6180 ) ### What this PR does / why we need it? The test case `tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance` fails occasionally, such result seems not stable with method `eagle`, for example: [tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance](https://github.com/vllm-project/vllm-ascend/actions/runs/21249578476/job/61147453980?pr=6151) This PR skips the `eagle` tests to keep CI success - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 15:33:53 +08:00
SILONG ZENG	78af0c30a3	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #12 ) (#6177 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/ops/triton/activation/swiglu_quant.py` \| \| `vllm_ascend/ops/triton/batch_invariant/matmul.py` \| \| `vllm_ascend/ops/triton/batch_invariant/mean.py` \| \| `vllm_ascend/ops/triton/batch_invariant/rmsnorm.py` \| \| `vllm_ascend/ops/triton/fla/chunk.py` \| \| `vllm_ascend/ops/triton/fla/chunk_delta_h.py` \| \| `vllm_ascend/ops/triton/fla/chunk_o.py` \| \| `vllm_ascend/ops/triton/fla/chunk_scaled_dot_kkt.py` \| \| `vllm_ascend/ops/triton/fla/cumsum.py` \| \| `vllm_ascend/ops/triton/fla/fused_qkvzba_split_reshape.py` \| \| `vllm_ascend/ops/triton/fla/l2norm.py` \| \| `vllm_ascend/ops/triton/fla/layernorm_guard.py` \| \| `vllm_ascend/ops/triton/fla/sigmoid_gating.py` \| \| `vllm_ascend/ops/triton/fla/solve_tril.py` \| \| `vllm_ascend/ops/triton/fla/utils.py` \| \| `vllm_ascend/ops/triton/fla/wy_fast.py` \| \| `vllm_ascend/ops/triton/fused_gdn_gating.py` \| \| `vllm_ascend/ops/triton/layernorm_gated.py` \| \| `vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py` \| \| `vllm_ascend/ops/triton/mamba/causal_conv1d.py` \| \| `vllm_ascend/ops/triton/reject_sample.py` \| \| `vllm_ascend/ops/triton/rope.py` \| \| `vllm_ascend/ops/triton/spec_decode/utils.py` \| \| `vllm_ascend/ops/triton/triton_utils.py` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-23 14:59:19 +08:00
zhangxinyuehfad	193acc2c19	[CI] Add nightly ci test for deepseek v3.1 (#5386 ) ### What this PR does / why we need it? Add nightly ci test for deepseek v3.1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 14:36:49 +08:00
LI SHENGYONG	8210a62a44	[EPLB][Bugfix]Reduce unnecessary video memory usage (#6020 ) ### What this PR does / why we need it? 1.Incorporate the warm up of the EPLB into the profile run. 2.Reusing the same gather buffer ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen3-235b aime baseline \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb The OOM issue does not occur. \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-23 14:21:13 +08:00
Qiu	749e24f81e	[bugfix] align max_num_batched_tokens with tppcp when using FLASHCOMM1 (#6000 ) ### What this PR does / why we need it? Align max_num_batched_tokens with tppcp when using FLASHCOMM1 to avoid assert error in `NPUModelRunner._dummy_run`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-23 14:19:49 +08:00
simplzyu	f8d03d21f1	Add Medusa speculative decoding support for vllm_ascend (#5668 ) ### What this PR does / why we need it? `vllm_ascend` already supports several speculative decoding strategies such as MTP, EAGLE, N-gram, and suffix decoding. However, Medusa is not yet supported. Medusa is an efficient speculative decoding framework that leverages a lightweight draft model to propose multiple tokens in a single step, which can significantly improve decoding throughput and reduce latency. To enable Medusa-based speculative decoding on Ascend hardware and provide more decoding options for users, this PR adds Medusa support into the `vllm_ascend` speculative decoding pipeline. ### Does this PR introduce _any_ user-facing change? This PR introduces Medusa speculative decoding as an additional speculative decoding method: ✔ Adds `MedusaProposer` and integrates it into the speculative decoding registry ✔ Extends `SpecDcodeType` with a `MEDUSA` enum entry ✔ Updates `NPUModelRunner` to recognize and invoke Medusa during decoding ✔ Adds Medusa-specific handling in the draft token generation logic ✔ Ensures backward compatibility — Medusa is only used when explicitly enabled Key code changes include: * New file: `vllm_ascend/spec_decode/medusa_proposer.py` * Register Medusa in `get_spec_decode_method` * Extend proposer type hints to include `MedusaProposer` * Add a Medusa-specific branch in `generate_draft_token_ids` * Pass `sample_hidden_states` required by Medusa ### How was this patch tested? Medusa is implemented as a new proposer class (`MedusaProposer`) following the existing speculative decoding interface. The integration works as follows: 1. Users enable Medusa via the speculative decoding configuration. 2. `get_spec_decode_method()` returns a `MedusaProposer` instance when `method="medusa"`. 3. During decoding, `NPUModelRunner` detects that the active drafter is a `MedusaProposer`. 4. Instead of the generic speculative decoding path, the Medusa-specific `generate_token_ids()` method is invoked, which consumes: * `valid_sampled_token_ids` * `sampling_metadata` * `spec_decode_metadata` * `sample_hidden_states` 5. The proposed tokens are validated by the target model as usual. When Medusa is not enabled, the decoding pipeline behaves exactly as before, ensuring full backward compatibility. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: simplzyu <191163281@qq.com> Signed-off-by: simplzyu <zhenyuguo@cmbchina.com>	2026-01-23 14:14:23 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
ZYang6263	418a43e2a2	[Bugfix] Fix seq_lens reset issue causing performance degradation (#6158 ) ### What this PR does / why we need it? Now `seq_lens` was not being reset correctly after each step due to missing code that clears the sequence lengths. As a result, when processing a smaller batch after a larger batch, the `seq_lens` from the larger batch was still carried over. This caused the attention operator to compute using an unnecessarily larger sequence length, leading to an increased computation load and performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: ZYang6263 <zy626375@gmail.com>	2026-01-23 11:29:54 +08:00
wangxiaoteng888	82a2b3bcc7	[P/D]Add ssl cert for metaserver proxy (#5875 ) ### What this PR does / why we need it? When the P node accesses the proxy meteserver, add the SSL certificate and the CA certificate path to improve security. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-23 11:11:44 +08:00
wjunLu	f4a361fcc3	[CI] Re-open skipped cases due to PTA upgrading and update the golden results (#6144 ) ### What this PR does / why we need it? Re-open `tests/e2e/singlecard/test_aclgraph_accuracy.py` and update its golden results to match PTA 2.9.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 10:46:31 +08:00
Li Wang	4d780a8b01	[Misc] Revert "[Misc] Bump mooncake version to v0.3.8.post1 (#6110 )" (#6164 ) ### What this PR does / why we need it? The new version of moonkcake lead to the image build failure. see https://github.com/vllm-project/vllm-ascend/actions/runs/21236469259/job/61105443733, we should revert it first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-23 09:53:32 +08:00
wjunLu	72ffc00b86	[Bugfix] Fix structured outputs errors: `TypeError: apply_token_bitmask_inplace_cpu()` (#6151 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5524 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 09:52:55 +08:00
zhangxinyuehfad	08a45e6053	[Doc] update supported features (#6165 ) ### What this PR does / why we need it? update supported features - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:50:11 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
lhchg	27a513b672	[BugFix]hccl bufferSize check for dispatch_ffn_combine (#6130 ) ### What this PR does / why we need it? dispatch_ffn_combine use hccl buffer as shared buffer, if hccl buffer not enough，operator will error with "MTE out of range" now add check for hccl buffer size, if not enough, will prompt "hccl buffer is too small" and indicate what the expectation is. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: lhchg <lhao_cheng@163.com>	2026-01-23 08:41:40 +08:00
anon189Ty	7725314b26	[Feat] Merge the multi eagle graphs to one graph (#5940 ) ### What this PR does / why we need it? This PR merge all steps of draft model in fullgraph mode, to avoid the synchronize between each graph, reduce the bubble time. #### Key ideas: - The "model forward" of the step 0 (first step) and remaining steps are captured together as a "Callable", rather than capturing each model individually. - "update_attn_params" is moved outside the entire graph, meaning that all "attn_metadata" required by all steps are constructed before "replay", and the "attn_params" of all steps are updated at once. - Remove synchronization between the main model graph and draft model graph. #### Key params/functions: - params: draft_attn_metadatas, attn_metadata_multi_steps, slot_mapping_group - functions: _run_merged_draft, attn_update_stack_num_spec_norm, update_attn_params, _propose, dummy_run ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2026-01-23 08:37:02 +08:00
Zetong Li	63d3921208	[Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (#6032 ) ### What this PR does / why we need it? This PR aims to remove `use_aclgraph` and use `use_cuda_graph` just the same as eagle_proposer in mtp_proposer. The reason of these changes are described below. There is a scenario that `use_aclgraph=True` while `use_cuda_graph=False`, e.g. enabling `async_scheduling=True`. When using deepseek v3.2, `common_attn_metadata.num_input_tokens` is important and it should be the same as `num_input_tokens` entering into model. In the above scenario, `use_aclgraph` accidentally pad `num_tokens` to `num_input_tokens`, coinciding with `common_attn_metadata.num_input_tokens`. But later eager mode is triggered and actually we don't need padding. That means that the code logic is incorrect but the running output looks fine. However, `common_attn_metadata.num_input_tokens` should mean `num_input_tokens` entering into model. So we should update `common_attn_metadata.num_input_tokens = num_input_tokens` after padding. Therefore, we can safely use normal `use_cuda_graph` instead of problematic `use_acl_graph`. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-22 21:08:07 +08:00
shaopeng-666	176bfc36bc	[BugFix] fix 3vl dense model load quant weight (#6100 ) ### What this PR does / why we need it? Fix Qwen3VL dense quant model load weights Error. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? The Qwen3VL quantized model service initialized successfully. Inference requests are processed correctly, and valid responses are returned. - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>	2026-01-22 20:05:25 +08:00
Bai Yongbin	7f91ac2649	[CP&SP] Integrate FIA operator in mla_cp._forward_decode (#5641 ) ### What this PR does / why we need it? Replace the npu_multi_head_latent_attention with FIA operator in mla_cp.py _forward_decode. Adjust mla_attn_dpc_pcp in acl_graph.py ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: Bai Yongbin <845473182@qq.com> Signed-off-by: tongyuzhou <t00886357@china.huawei.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: tongyuzhou <t00886357@china.huawei.com>	2026-01-22 20:02:30 +08:00
wjunLu	88632cf976	[CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (#6145 ) ### What this PR does / why we need it? Upgrade wheel building's CANN to 8.5.0 and update the Docs - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 19:50:54 +08:00
meihanc	e54d294df3	[CI]Install clang in dokerfile for triton ascend (#4409 ) ### What this PR does / why we need it? Install clang in dokerfile for triton ascend - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-22 19:01:28 +08:00
wjunLu	a7d781f135	[Main] Upgrade PTA to 2.9.0 (#6112 ) ### What this PR does / why we need it? Upgrade PTA to 2.9.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 17:59:06 +08:00
CodeCat	1402cf6874	[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model for QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:https://github.com/vllm-project/vllm-ascend/issues/4715 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-22 17:22:41 +08:00
wangxiaoteng888	f2c0ced06d	[P/D][PCP]bugfix pcp force free twice caused logger error (#6124 ) ### What this PR does / why we need it? The issue of the D node mistakenly sending the pull-end signal twice, leading to the P node printing logger errors abnormally, has been resolved. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-22 16:24:33 +08:00
Angazenn	1d3544c887	[BugFix]converting pa get_workspace back to capturing (#5833 ) ### What this PR does / why we need it? This helps to fix a bug in for pa get_workspace. In earlier implementation, we use `_npu_paged_attention_get_workspace` in `_update_pa_attn_params`. However, this might cause some potential memory problems as it dynamically allocate new memory for workspace when calling this api. Therefor, we move this back to capturing, and use a fixed `SEQ_LEN_WITH_MAX_PA_WORKSPACE` to get max workspace. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Angazenn <supperccell@163.com>	2026-01-22 15:49:22 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
zhangxinyuehfad	9bba0a2a68	[Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (#6042 ) ### What this PR does / why we need it? When running the Qwen2.5-Omni-7B model on Ascend NPU, the engine fails during the profiling/warmup stage with the following error: `AclNN_Runtime_Error(EZ9903): rtKernelLaunchWithHandleV2 failed: 507035. The vector core execution is abnormal.` error log: https://github.com/vllm-project/vllm-ascend/actions/runs/21144534911/job/60806765393#step:17:6412 This error is specifically triggered by the `triton_mrope` kernel when handling the unique `mrope_section` configurations of the Omni model. Other multimodal models with standard sections (e.g., [16, 24, 24]) or standard LLMs work correctly with Triton. Modified vllm_ascend/ops/rotary_embedding.py to add a conditional check before calling forward_triton. 1. For standard LLMs (mrope_interleaved = True ), it continues to use Triton for acceleration. 2. For complex configurations (like Qwen2.5-Omni mrope_interleaved = False ), it now falls back to the native super().forward_oot() path, which uses the stable torch_npu or PyTorch implementation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-22 15:46:05 +08:00
ChenCangtao	38edfd585a	[bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (#6015 ) ### What this PR does / why we need it? When using the full_decode_only mode, the vllm framework will still use the torch.fx.passes.split_module.split_module API to process the corresponding GraphModule of the model. However, the output of this API may cause the output of the fx graph to no longer be a tuple, and torch.compile enforces strict checks on this. Previously, we manually modified the fx graph, which introduced an abnormality in the model output type. In this PR, we switched to using PyTorch's native API to modify the FX graph, and removed the code that was previously added to handle output type anomalies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-22 04:35:06 +00:00
zhaomingyu13	34fb628248	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#6097 ) According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later No ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Fixes vllm-project/vllm#31345 ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-22 11:36:23 +08:00
Li Wang	37a9cf818a	[Misc] Bump mooncake version to v0.3.8.post1 (#6110 ) ### What this PR does / why we need it? Since the mooncake has the newer [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1), we pin the tag to latest release ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 11:03:16 +08:00
wangqiankun13	08d7014874	[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (#5758 ) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](https://github.com/vllm-project/vllm-ascend/pull/5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 \| tee qwen3_235b_eagle3.log ``` \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 80.00 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-22 10:51:02 +08:00
JiangWeixiang	cef04b3555	[bugfix] adapt_remote_request_id (#6051 ) This PR addresses a request ID mismatch issue in the PD (Prefill-Decoding) separation deployment scenario for vllm-ascend. Upstream vLLM recently mitigated request ID collisions by appending a random suffix to each request_id (e.g., req-123 → req-123-abc), refer to [PR-27987](https://github.com/vllm-project/vllm/pull/27987 ) & [PR-29665](https://github.com/vllm-project/vllm/pull/29665). While this works in single-node deployments, it breaks compatibility in PD-separated setups: the Producer (Prefill node) and Consumer (Decoding node) end up with different request_id values, preventing the Consumer from correctly retrieving the KV cache generated by the Producer. To resolve this, this PR introduces a new field remote_request_id in the metadata passed via mooncake_connector. The Producer preserves and forwards the original (unmodified) request_id as remote_request_id. The Consumer then uses this remote_request_id—instead of its locally generated suffixed ID—to fetch the correct KV cache from the Prefill node. This ensures consistent request identification across PD nodes while maintaining compatibility with upstream vLLM’s request ID deduplication mechanism. <img width="1279" height="781" alt="image" src="https://github.com/user-attachments/assets/274238c1-dab6-4d3a-9ee4-6e578679b762" /> - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: ghphotoframe <854746559@qq.com> Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>	2026-01-22 10:48:40 +08:00
maxmgrdv	ef9d8367f5	[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (#5143 ) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-22 10:34:58 +08:00
zzhxxx	dd8571860d	[Feature] Support DSA-CP for Hybrid scenario (#5702 ) Signed-off-by: zzhx1 <zzh_201018@outlook.com> ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### Support FULL_DECODE_ONLY Mode under PD-Mixed Scenario: Extends DSA-CP to handle the FULL_DECODE_ONLY execution mode when running in a prefill-decode mixed (PD-mixed) serving environment, improving throughput and resource utilization for decode-intensive workloads. In pure prefill nodes: - Both q_proj and o_proj are sharded across world ranks, using broadcast for weights distribution. In PD-mixed nodes (supporting both prefill and decode): - q_proj is fully replicated (not sharded) to avoid communication overhead during decoding. - o_proj Using the original TP `RowParallelLinear` method to store weights During prefill execution: - o_proj forwards through all_gather to collect weights, reconstructing the complete o_proj weights on each card. During decode (graph replay phase): - Additional all_to_all (before o_proj) and reduce_scatter (after o_proj) are introduced to enable sequence-parallel output aggregation while maintaining correctness under SFA CP. ### benchmark: - TTFT increased by 527% - TPOT increased by 180% <img width="1550" height="938" alt="image" src="https://github.com/user-attachments/assets/9b7a03d8-a3db-4a99-8923-6e5bfcfecf72" /> ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Co-authored-by: clrs97 <524936896@qq.com>	2026-01-22 10:12:09 +08:00
wangxiyuan	69740039b7	[CI] Upgrade CANN to 8.5.0 (#6070 ) ### What this PR does / why we need it? 1. Upgrade CANN to 8.5.0 2. move triton-ascend 3.2.0 to requirements note: we skipped the two failed e2e test, see https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail. We'll fix it soon. ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/5494 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-22 09:29:50 +08:00
Nengjun Ma	ab676413e6	Default enable MLAPO (#5952 ) ### What this PR does / why we need it? 1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD disagregation D Instance, for example: DeepSeekV3-W8A8, DeepSeek-R1-W8A8. 2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models, currently is DeepSeek-V3.2-W8A8. ### Does this PR introduce _any_ user-facing change? Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO feature for deepseek w8a8 model The effect of enabling MLAPO SFA model deployed on a single A3 Node: Test with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py dataset: gsm8k-lite，without set MTP, FULL GRAPH, has 19% promote：未默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 14055.8836 ms │ ├─────────────────────────┤ │ ITL │ 66.8171 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 104.9105 token/s │ ├─────────────────────────┤ 默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 3753.1547 ms │ ├─────────────────────────┤ │ ITL. │ 61.4236 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 125.2075 token/s│ ├─────────────────────────┤ - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-22 09:26:39 +08:00

... 6 7 8 9 10 ...

2545 Commits