xc-llm-ascend

Author	SHA1	Message	Date
Wangbei25	4f259d4fd8	[Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 ) ### What this PR does / why we need it? Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder and add doc for DeepSeekOCR2.md ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vllm 0.18.0 - vllm-ascend main 1. _create_custom_4d_mask during 141ms49us620ns --> _create_npu_optimized_mask during 1ms227us780ns 2. convd2d : 27ms --> matmul <1ms 3. relposattention：sdpa->prompt_flash_attention --------- Signed-off-by: Wangbei25 <wangbei41@huawie.com> Signed-off-by: Wangbei25 <wangbei41@huawei.com> Co-authored-by: Wangbei25 <wangbei41@huawie.com>	2026-03-31 14:49:29 +08:00
zzzzwwjj	a40eee2ba1	[feat] support dispatch_v2/combine_v2 hierarchy communication (#7698 ) ### What this PR does / why we need it? This PR adds support for hierarchical communication for `dispatch_v2` and `combine_v2` MoE operations. This is achieved by introducing a new configuration `enable_mc2_hierarchy_comm`. When enabled, the communication algorithm is set to "hierarchy", which support mc2 op comm between two super pod. The changes include: - Adding `enable_mc2_hierarchy_comm` to `AscendConfig`. - Modifying `TokenDispatcherWithMC2` to pass `comm_alg: "hierarchy"` to the underlying `torch_npu` ops when the new config is enabled. - Adding validation to ensure that this feature is only used with compatible PTA/CANN versions and is not used with the conflicting `fused_mc2` op. - Updating `is_hierarchical_communication_enabled` to respect the new configuration flag. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces a new user-facing configuration option `enable_mc2_hierarchy_comm` in `additional_config` to enable hierarchical communication for MoE. ### How was this patch tested? - vLLM version: v0.18.0 Signed-off-by: zzzzwwjj <1183291235@qq.com>	2026-03-27 09:20:16 +08:00
Wang Kunpeng	0bab629f90	[v0.18.0][bugfix]fixed block_size incorrect setting issue in dsv3.2 (#7630 ) (#7652 ) ### What this PR does / why we need it? https://github.com/vllm-project/vllm/pull/35122 This PR in the vllm community refactors the update mode of block_size. As a result, when the user does not specify `--block-size`, dsv3.2 obtains an incorrect block_size. The root cause of the problem is analyzed from the block_size update process as follows: 1. In NPUPlatform, `check_and_update_config` calls `refresh_block_size` to set block_size to 128. 2. During Modelrunner initialization, the `self.block_size` parameter is generated. At this time, block_size is still 128. This parameter will be used for operations such as kvcache initialization. 3. `update_block_size_for_backend` updates block_size to the size set in attn_backend. The reason why the DSV3.2 is faulty is that it has an additional attn_backend `DeepseekV32IndexerBackend`, and this backend is not rewritten. The block_size obtained from attn_backend is 64. In this case, only `vllm_config.cache_config.block_size` is updated, and other parts are not modified. As a result, the block_size on the entire network is inconsistent. Modification solution: Skip `update_block_size_for_backend` and modify block_size only in the `check_and_update_config` method. In the future, the block_size update logic can be migrated to the `update_block_size_for_backend` method. Ensure that all block_size values on the entire network are updated. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.18.0 - vLLM main: `ed359c497a` --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-03-26 22:38:28 +08:00
wangbj127	2ad0ca52a6	Qwen3.5 MoE supports flashcomm v1 (#7644 ) cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486 <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Multimodal models like Qwen3.5 MoE does embedding in model_runner, so when flash comm is enabled, the first AllGather operation should be skipped. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> - vLLM version: v0.18.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Wangbingjie <wangbj1207@126.com> Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>	2026-03-25 23:09:33 +08:00
SILONG ZENG	1e3c1e76bf	[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511 ) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily exclude all `csrc/` from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` note: clang-format is enabled but currently excludes all csrc/ - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-03-24 20:03:01 +08:00
realliujiaxu	5d12446573	[Feat][SP] Suport SP for VL MoE models (#7044 ) ### What this PR does / why we need it? 2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712, extend SP to VL MoE models. ### Does this PR introduce _any_ user-facing change? remove `sp_threshold` in additional config and reuse `sp_min_token_num` from vLLM. ### How was this patch tested? - Model: Qwen3-VL-30B-A3B, - TP4 DP2 - 100 reqs - max concurrency 1 \| Seq length \| Mean TTFT (ms) main \| Mean TTFT (ms) this PR \| \|------------\|---------------------\|------------------------\| \| 4k \| 429.40 \| 323.3 \| \| 16k \| 1297.01 \| 911.74 \| - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-24 17:16:00 +08:00
Shaoxu Cheng	83bd77c983	[310p]: add rmsnorm gated fallback and unit test (#7424 ) ### What this PR does / why we need it? RFC #7394 310P cannot use the fused `rmsnormgated` operator and must fall back to the native implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? ut - vLLM version: v0.17.0 - vLLM main: `4497431df6` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-24 09:00:11 +08:00
Shaoxu Cheng	5b60b530d6	[Bugfix][310p] the new A5 mmencoder op donot support 310p (#7518 ) ### What this PR does / why we need it? Because the new A5 MMEncoder operator was merged, the 310P can no longer run any VL models. This PR fixes that issue. details at #7046 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e - vLLM version: v0.17.0 - vLLM main: `8b6325758c` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-23 15:40:34 +08:00
Cao Yi	9e2965bae2	[Feature] Support Flash Comm V1 for VL models (with MLA) (#7390 ) ## Summary Flash Comm V1 (flashcomm1) was previously blocked for all VL models. Root cause: For VL models, `inputs_embeds` at layer 0 originates from the vision encoder as a full `[N, H]` tensor — it has not been reduce-scattered across TP ranks. The original MLA forward path assumed inputs were already scattered, producing wrong output shapes under TP > 1. Fix: - Detect at init time (statically, not via runtime shape checks) whether a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo treats the branch as a constant. - In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1 + is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate output as `[N//tp_size, H]`. - Remove the platform-level assertion that prevented VL models from enabling Flash Comm V1. Other improvements: - `is_vl_model()` now uses vllm's canonical detection (`hf_config is not hf_text_config`) instead of fragile key-name checks, with the old checks kept as fallback. - Added `parse_layer_idx(prefix)` utility. - Added `maybe_chunk_residual` call in `AscendRMSNorm` before the add-rms-norm op. - Removed unnecessary CPU/fp32 round-trip in `AscendLearnable2DInterpPosEmbDivided_fixed.forward()`. - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: LoganJane <loganJane73@hotmail.com>	2026-03-22 21:05:28 +08:00
LI SHENGYONG	1e05c4908f	[EPLB] Reduce the memory used for batch_isend_irecv (#7344 ) ### What this PR does / why we need it? #6729 seems to reduce the NPU memory usage of eplb, but actually moves the buffer allocation of dist.all_gather_into_tensor to dist.batch_isend_irecv. Therefore, the overall NPU memory usage is not reduced. This PR completely reduces the memory usage in this part. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Remaining memory of each rank before the repair. <img width="649" height="99" alt="image" src="https://github.com/user-attachments/assets/52a67592-e0e8-4f9a-b194-b84cb848c598" /> Remaining memory of each rank after the repair. <img width="641" height="99" alt="image" src="https://github.com/user-attachments/assets/0bc2e67c-f328-4dea-98af-d7a459fb4876" /> Close EPLB. <img width="543" height="45" alt="image" src="https://github.com/user-attachments/assets/6dcba19d-4401-44b8-a6d3-c7b35ee983c7" /> Memory of weights for each rank. <img width="648" height="46" alt="image" src="https://github.com/user-attachments/assets/4db2fd04-98a0-4d26-a026-2e8287102b99" /> Estimated memory for EPLB: 15.68 / 48 (layer_num) + 2 * 0.02 = 0.35 GB - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-03-20 12:25:58 +08:00
pichangping	3f39ac9c8d	[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222 ) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-03-16 22:49:05 +08:00
linfeng-yuan	5f3826b093	[Build] Add support for Ascend950 chip (#7151 ) ### What this PR does / why we need it? This PR adds support for the Ascend950 chip. This includes: - Updating build scripts (`CMakeLists.txt` and `setup.py`) to recognize the Ascend950 chip and set appropriate compilation flags. - Disabling a set of custom operators that are not yet supported on the Ascend950 hardware target. - Performing a codebase-wide refactoring of `pipe_barrier()` calls to the namespaced `AscendC::PipeBarrier<>()` for improved code consistency and adherence to the latest API standards. Ascend950DT e2e passed (Qwen3-32B-MXFP8) and CI passed - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2026-03-12 10:25:51 +08:00
zxr2333	239683c7a6	[P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022 ) ### What this PR does / why we need it? Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By CI. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-03-10 23:59:20 +08:00
wanghuanjun2113	dec04ec8d8	[Bugfix] Fix incorrect layer count for MTP models in update_aclgraph_sizes (#7064 ) ## Summary - Fix incorrect layer count calculation for MTP (Multi-Token Prediction) models in `update_aclgraph_sizes()` function - For MTP models, the draft model's layer count is stored in `num_nextn_predict_layers` or `mtp_num_hidden_layers` (for Qwen3.5), not in the standard `num_hidden_layers` field - Directly accessing `draft.hf_config.num_hidden_layers` returns the main model's layer count instead of the MTP draft model's layer count ## Bug Description In `vllm_ascend/utils.py`, the `update_aclgraph_sizes()` function calculates `resources_per_graph` for speculative decoding scenarios. When calculating the resources needed for the draft model, the original code directly accessed: ```python resources_per_graph += draft.hf_config.num_hidden_layers + 1 ``` This works correctly for standard draft models, but fails for MTP models (like DeepSeek-V3's MTP or Qwen3.5's MTP) because: 1. MTP models store their layer count in model-specific fields: - `num_nextn_predict_layers` (DeepSeek-V3 MTP) - `mtp_num_hidden_layers` (Qwen3.5 MTP) 2. The `num_hidden_layers` field in these models contains the main model's layer count, not the MTP layer count 3. This leads to grossly overestimating the `resources_per_graph`, which in turn causes the calculated `max_batch_sizes` to be unnecessarily small ## Fix Use `draft.get_total_num_hidden_layers()` instead of directly accessing `draft.hf_config.num_hidden_layers`. This method correctly handles different model types through the `model_arch_config_convertor` infrastructure, returning the appropriate layer count for: - Standard draft models → `num_hidden_layers` - DeepSeek-V3 MTP → `num_nextn_predict_layers` - Qwen3.5 MTP → `mtp_num_hidden_layers` 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` Signed-off-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 16:14:51 +08:00
aipaes	1c0ecf806a	[bugfix] fix pass bug: pass really rope dim for npu_rotary_embedding (#6880 ) ### What this PR does / why we need it? pass really rope dim for npu_rotary_embedding before： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.head_dim, True ) after： q_rope, k_rope = torch.ops.vllm.npu_rotary_embedding( positions, q_flat, k_flat, cos_sin_cache, self.head_dim, self.rope_dim, True ) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Signed-off-by: aipaes <82140963+aipaes@users.noreply.github.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-03-06 19:35:17 +08:00
Shanshan Shen	a813eadd2d	[MM][Perf] Enable 2.7x faster for convolution computation with aclnn BatchMatMulV2 (#7017 ) ### What this PR does / why we need it? Currently, we are using `e2b31243c0/vllm/model_executor/layers/conv.py (L219-L232)` for convolution computation, which is used in patch embedding for VL models. After profiling, we find that this linear method will take about 6.87 ms, which is much slower than just using `F.conv3d()`. In `F.conv3d()`, it will call aclnn `BatchMatMulV2` with optimization on Ascend NPU, which only take about 2.50 ms and is 2.7x faster than linear method. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-03-06 14:26:37 +08:00
rjg-lyh	2bd9c35788	[perf][refactor] Refactor and optimize sfa_v1.py for dsv3.2/glm5 (#6874 ) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - replace `scatter` ops with `reshape_and_cache`: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: 28s, TPOT: 26ms, TPS: 820 token/s* fixed redundant double writes of k_cache: TTFT: 24s, TPOT: 26ms, TPS: 840 token/s replace scatter ops with reshape_and_cache: TTFT: 24s, TPOT: 26ms, TPS: 850 token/s ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-03-05 14:27:11 +08:00
Ronald	77e009d9fc	[Feature] Add docs of batch invariance and make some extra operators patch (#6910 ) ### What this PR does / why we need it? This PR add docs of batch invariance and make some extra operators according to validation result. please see https://github.com/vllm-project/vllm-ascend/issues/5487 to track progress. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-05 09:12:40 +08:00
Cao Yi	a0a904a3d4	[BugFix] Improve GDN layer detection for multimodal models (#6941 ) ## Summary - Enhanced `check_gdn_layer()` function to properly detect GDN layers in multimodal models - Added support for checking `text_config.layer_types` in addition to root-level `layer_types` - Fixed potential None reference errors when `layer_types` attribute is missing ## Changes - Modified `vllm_ascend/utils.py`: - Replaced `hasattr()` check with safer `getattr()` approach - Added fallback to empty list when `layer_types` is None - Added secondary check for `text_config.layer_types` to support models like Qwen-Omni ## Motivation Previous implementation only checked `layer_types` at the root config level, which failed to detect GDN layers in multimodal models where this information is nested under `text_config`. Additionally, it could raise errors when `layer_types` was None. --- Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: wanghuanjun2113 <wanghuanjun2113@gmail.com>	2026-03-03 20:08:39 +08:00
Shaoxu Cheng	2064afe380	[300I][Bugfix] fix unquant model weight nd2nz error (#6851 ) ### What this PR does / why we need it? - This PR fixes an issue with weight format conversion for unquantized models running on Ascend 310P devices. - The changes refactor the logic for converting weights to the FRACTAL_NZ format. Previously, this was handled in a 310P-specific linear layer implementation (`AscendUnquantizedLinearMethod310`). This implementation has been removed, and the logic is now centralized in the `maybe_trans_nz` utility function. This function now checks if the device is a 310P and applies the NZ format cast accordingly for `float16`/`bfloat16` weights. - This refactoring simplifies the code by removing platform-specific duplication and ensures correct weight handling for unquantized models on 310P. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ut and local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-03 15:57:26 +08:00
Shaoxu Cheng	ddc78dbade	[300I] support decode-only aclgraph mode (#6849 ) ### What this PR does / why we need it? 310p aclgraph mode, but has some problems: - the event-id hardware limit, the num of graph will be limited. - the cann version support this feature cannot be get from external of huawei. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-03-02 14:15:14 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
drslark	5666ce03f5	[bugfix] Fixed an accuracy problem of gdn layer in graph (#6822 ) ### What this PR does / why we need it? There will be random ouputs if we run model with GDN attention in graph mode: ```python prompts = [ "1. Who are you?", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5) llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, distributed_executor_backend="mp", gpu_memory_utilization=0.7, speculative_config={ "method": "qwen3_next_mtp", "num_speculative_tokens": 3, }, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [8], }, max_model_len=4096, enable_prefix_caching=False) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"{output.prompt_token_ids=}") print(f"{output.outputs[0].token_ids=}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Before appling this change, the outputs was: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 323, 279, 1112, 279] Prompt: '1. Who are you?', Generated text: ' What and the... the' ``` After applying this change, the output is: ```text output.prompt_token_ids=[16, 13, 10479, 525, 498, 30] output.outputs[0].token_ids=[3555, 374, 697, 829, 30] Prompt: '1. Who are you?', Generated text: ' What is your name?' ``` Why does this change sovle the problem? Now, `query_start_loc` is padded because of `fia`. But, for `gdn-attention`, padded version of `query_start_loc` will cause accuracy problem. So, we need an unpadded version of `query_start_loc` named `gdn_query_start_loc` and use it in `gdn-attention`, it works fine. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? As described aboved. - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` Signed-off-by: drslark <slarksblood@qq.com>	2026-02-28 08:57:53 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
Icey	ee59429015	upgrade main to 0212 (#6712 ) ### What this PR does / why we need it? Fixes `transformers_utils/processors/__init__` import error, due to https://github.com/vllm-project/vllm/pull/33247 Fixes Fused MoE break introduced by `MoERunner abstraction,` due to https://github.com/vllm-project/vllm/pull/32344 > delete AscendMoERunnere when https://github.com/vllm-project/vllm/pull/35178 is merged Fixes `Make Qwen3VL compatible with Transformers v5`, due to https://github.com/vllm-project/vllm/pull/34262 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-25 09:17:29 +08:00
LI SHENGYONG	0331f16a50	[EPLB] Reduce the memory used for heat aggregation (#6729 ) ### What this PR does / why we need it? If dist.all_gather is used directly, 2 x HCCL_BUFFSIZE memory will be consumed, but the actual memory required for hotspot aggregation is less than 1 MB. Therefore, a separate small communication domain is created for it. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Original： ![1](https://github.com/user-attachments/assets/8880b461-c26f-497c-9a05-2ca60cc46aa4) Current： ![2](https://github.com/user-attachments/assets/c9da32b5-9200-4fa2-aff9-d8c4978ac602) - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-24 18:02:24 +08:00
Shaoxu Cheng	b6bc3d2f9d	[Feat.][310P]: weightNZ feature with quant or unquant. (#6705 ) NZ Format Support for Linear Layers: Implemented support for the NZ (N-dimensional Z-order) format for linear layer weights on Ascend 310P, enhancing performance for both quantized and unquantized layers. Unquantized Linear Method for Ascend 310P: Introduced AscendUnquantizedLinearMethod310 to specifically handle and apply NZ format casting to unquantized linear layer weights during the loading process. MRotaryEmbedding Integration: Extended Rotary Embedding support by adding AscendMRotaryEmbedding310 to provide an Ascend-specific implementation for MRotaryEmbedding. Quantization Method Updates: Updated the w8a8_static quantization method to directly transpose weights and apply NZ format casting, ensuring consistency with the new format. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-13 15:41:02 +08:00
Shaoxu Cheng	f40256b697	[Feat.][310P] addrmsnorm for 300I DUO (#6704 ) ### What this PR does / why we need it? This PR integrates the `npu_add_rms_norm` fused kernel for RMSNorm operations with residual connections on 310P devices. This change optimizes the computation by replacing a two-step process (manual residual addition followed by RMSNorm) with a single, more efficient fused operation. This is needed to improve the performance of models utilizing RMSNorm with residual connections on the 310P architecture. Fixes # ### Does this PR introduce _any_ user-facing change? No, this PR introduces an internal optimization and does not change any user-facing APIs or behaviors. ### How was this patch tested? This patch was tested with updated unit tests (`test_RMSNorm_forward_310p`) that mock the `npu_add_rms_norm` operation to verify the correctness of the fused kernel integration. --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-13 15:40:49 +08:00
pu-zhe	85e33941e8	[Feat.]: 310p support MOE models (#6530 ) ### What this PR does / why we need it? This pull request integrates comprehensive support for Mixture of Experts (MoE) models on the Ascend 310P device within the vllm-ascend framework. It achieves this by introducing specialized modules for expert selection, fused MoE layers, and optimized all-gather communication. The changes also refine existing NPU operations, making them more consistent and efficient for 310P, ultimately enhancing the performance and compatibility of MoE models on this hardware. Highlights 310P MoE Support: Introduces dedicated implementations for Mixture of Experts (MoE) models on Ascend 310P devices, including new modules for expert selection, fused MoE layers, and communication. All-Gather Communication: Enforces the use of ALLGATHER communication for MoE operations on 310P, optimizing data transfer and leveraging NPU-specific token dispatching. Simplified NPU Operations: Removes conditional type casting for npu_swiglu and enables custom rotary embedding kernels unconditionally, suggesting improved native support for 310P. New MoE Classes Registered: Registers AscendFusedMoE310 and AscendSharedFusedMoE310 to integrate 310P-specific MoE layers into the system's custom operation registry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? offline test and server test, with qwen3-30b-a3b,tp/ep 4 on 310p - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-06 10:30:56 +08:00
Shaoxu Cheng	460ea88276	[Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425 ) ### What this PR does / why we need it? - Replace the RoPE operator implementation. - Refactor some leftover implementations of 300I DUO in the main branch. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-02 16:12:04 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
wangqiankun13	08d7014874	[Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (#5758 ) ### What this PR does / why we need it? Operator `DispatchGmmCombineDecode` does not support non-W8A8 scenarios and cannot share the same communication domain with Operator `Dispatch`/`Combine`. > for instance, when the draft model uses a non-W8A8 MOE architecture while the main model employs a W8A8 MOE architecture. Therefore days ago, I implemented an interception that unconditionally disables Operator `DispatchGmmCombineDecode` whenever the speculative mode is `EAGLE` or `EAGLE-3`. [PR: 5293](https://github.com/vllm-project/vllm-ascend/pull/5293) However, this approach was not precise enough. This PR further refines the logic by specifically identifying the draft model's configuration: Operator `DispatchGmmCombineDecode` will now be disabled only when the draft model uses an MOE architecture and is non-W8A8. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode ```shell nic_name="xxxx" local_ip="xxx.xxx.xxx.xxx" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export VLLM_ASCEND_ENABLE_FUSED_MC2=2 echo "VLLM_ASCEND_ENABLE_FUSED_MC2=${VLLM_ASCEND_ENABLE_FUSED_MC2}" export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_BUFFSIZE=512 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD vllm serve /dataset/Qwen3-235B-A22B-Instruct-2507-w8a8-QuaRot/ \ --served-model-name "qwen" \ --host 0.0.0.0 \ --port 8004 \ --async-scheduling \ --tensor-parallel-size 4 \ --data-parallel-size 4 \ --max-num-seqs 64 \ --max-model-len 40960 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --no-enable-prefix-caching \ --quantization "ascend" \ --trust-remote-code \ --speculative_config \ '{ "method": "eagle3", "model": "/dataset/Qwen3-235B-A22B-Instruct-2507-speculator-eagle3/", "num_speculative_tokens": 2 }' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ 2>&1 \| tee qwen3_235b_eagle3.log ``` \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 80.00 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-22 10:51:02 +08:00
LeeWenquan	55b20ac63b	[Ops] Add layernorm for qwen3Next (#5765 ) ### What this PR does / why we need it? Add layernormFn triton op for qwen3Next model for better performance. <img width="248" height="526" alt="image" src="https://github.com/user-attachments/assets/27b47157-5df5-4db1-aa88-1dae799b2bf6" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-20 14:43:14 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00
SILONG ZENG	52086394ae	[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912 ) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-16 20:57:46 +08:00
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
zzhxxx	db12c1e2c8	[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701 ) ### What this PR does / why we need it? > Extracted from PR #5513 Based on the Sharded-CP feature PR:#4702; RFC:https://github.com/vllm-project/vllm/issues/30055 ### All-gather KV Cache for Communication Overlap: - This PR adjusts the calculation order in the SFA. - split `index_select` into `indexer_select_pre_process` and `indexer_select_post_process`. - Combine `nope`, `rope` and `index-k` into a tensor to perform asynchronous all-gather. ### benchmark: input=40k && num_batch_token=20k - before: ``` Mean TTFT (ms): 2614.52 Median TTFT (ms): 3148.03 P50 TTFT (ms): 3148.03 P90 TTFT (ms): 3163.48 P99 TTFT (ms): 3170.20 ``` - after: ``` Mean TTFT (ms): 2529.92 Median TTFT (ms): 3051.69 P50 TTFT (ms): 3051.69 P90 TTFT (ms): 3067.31 P99 TTFT (ms): 3072.15 ``` ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-11 09:47:27 +08:00
Levi	ecd4232698	[Feat] flashcomm2+oshard Generalized (#4723 ) ### What this PR does / why we need it? [FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf) introduces redundant storage of the o_proj matrix, which imposes pressure on GPU memory. We propose the FlashComm2+Oshard approach by integrating the shared linear layer feature (#2931). This approach distributes weights layer-by-layer to each GPU and accesses the o_proj of each layer via asynchronous broadcast operations, thereby alleviating memory pressure while achieving nearly lossless performance compared to the original FlashComm2. This PR implements a generalized FlashComm2+Oshard solution. Using following env to support flashcomm2 with oshard ```shell export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 --additional-config '{ "layer_sharding": ["o_proj"] }' ``` ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2026-01-10 22:57:57 +08:00
zzhxxx	64d29875f9	[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698 ) ### What this PR does / why we need it? Based on the Sharded-CP feature PR:https://github.com/vllm-project/vllm-ascend/pull/4702; RFC:https://github.com/vllm-project/vllm/issues/30055 This PR officially integrates Deepseek V3.2's DSA-CP support on the basis of https://github.com/vllm-project/vllm-ascend/pull/4702, improving inference efficiency and scalability under mixed prefill-decode workloads. The main improvements include: - Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for TP=1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-09 15:58:40 +08:00
zzhxxx	f7db812ed7	[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181 ) ### What this PR does / why we need it? - Delete the environment variable `VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED` - Introduce layer_sharding as a configurable feature in additional_config - Revise the term "shared weight" to "shard weight." Configuration : The feature is opt-in via the additional_config argument: ``` --additional-config '{ "layer_sharding": ["o_proj", "q_b_proj"] }' ``` This is orthogonal to standard tensor parallelism and weight replication strategies. It is treated as a separate, explicit feature.It can be used in any scenario, combined with the flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature or the ShardedCP #4702 feature, to achieve significant performance. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: clrs97 <524936896@qq.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com>	2026-01-08 09:05:02 +08:00
LICO67373	380f089fbf	[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870 ) ## What this PR does / why we need it? This PR fixes the `AttentionMaskBuilder` singleton initialization issue introduced in PR #4779 and removes the unused `pcp_prefill_mask` field. ### Background After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton` decorator, the class constructor now requires a `device` parameter. However, two initialization sites were still using the old parameterless constructor, causing failures. ### Changes 1. Fix singleton initialization - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendMLAMetadataBuilder.__init__()` - Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)` in `AscendAttentionMetadataBuilder.__init__()` 2. Remove unused field - Removed `pcp_prefill_mask` field from `AscendPrefillContextParallelMetadata` (never used in codebase) - Updated related test assertions ### Related - Issue #5463 - PR #4779 (Unify all mask generation methods) - PR #5389 (Make AttentionMaskBuilder singleton) ## Does this PR introduce _any_ user-facing change? No. This is an internal refactoring. ## How was this patch tested? - ✅ Local testing: No linter errors - ✅ Unit tests for attention modules verified - ⏳ CI pipeline Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-07 17:09:52 +08:00
weiguihua2	2b8a9ce8bd	[Bugfix] fix resource are insufficient when pcp and piecewise (#5377 ) ### What this PR does / why we need it? Resolving the issue of insufficient resources during service operation when PCP is enabled in a piecewise scenario. When enabling PCP and executing in piecewise mode, the curl request fails due to insufficient resources, resulting in the error message "The resources are insufficient." Through profiling analysis, it was found that the PCP communication domain also occupies streams and consumes resources. Therefore, when updating aclgraph sizes, the PCP communication domain needs to be taken into account. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-07 15:39:52 +08:00
wangxiyuan	1112208052	[Refactor] Cleanup platform (#5566 ) ### What this PR does / why we need it? 1. add `COMPILATION_PASS_KEY` constant 2. clean up useless platform interface `empty_cache`, `synchronize`, `mem_get_info`, `clear_npu_memory` 3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED` 4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC` NPUPlatform is the interface called by vLLM. Do not call it inner vllm-ascend. ### Does this PR introduce _any_ user-facing change? This PR is just a cleanup. All CI should pass. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 09:25:55 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
ZixuanWang	8eae949d11	Revert "[Feat] enable hierarchical mc2 ops on A2 by default (#5545 )" (#5611 ) This reverts commit `fb9fdcdbe4`. ### What this PR does / why we need it? this pr breaks the smoke test because of that leads the error of aclnnNeScalar:Kernel Run failed. opType: 25, NotEqual launch failed for NotEqual, errno:361001 <img width="1149" height="166" alt="A6C9453D-4F0B-4256-DD80-A9C181DAB2D9" src="https://github.com/user-attachments/assets/cab9c4b8-3fd1-4c6b-b424-474b46042726" /> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: zxwang <1476209578@qq.com>	2026-01-05 22:39:05 +08:00
wangqiankun13	350b95efcf	[BugFix]Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series (#5293 ) …w8a8 while main model uses w8a8 ### What this PR does / why we need it? Disable dispatch_gmm_combine_decode operator when mtp drafter model uses non-w8a8 while main model uses w8a8, or drafter model is eagle series. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-04 17:51:28 +08:00
Qiu	7c210225a2	[Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382 ) ### What this PR does / why we need it? This PR adds multi-stream for GQA to enable computation-communication overlap. For chunked prefill, we reduce TTFT by approximately 4%. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-04 16:33:18 +08:00

1 2 3 4

187 Commits