xc-llm-ascend

Author	SHA1	Message	Date
pu-zhe	e76b69b9ef	[BugFix] [310p] Fix attention accuracy issue (#6803 ) ### What this PR does / why we need it? This pull request resolves an attention accuracy issue by enhancing the AttentionMaskBuilder310 to correctly handle the maximum model length. The change ensures that the attention mask generation process is properly parameterized by the model's configuration, rather than relying on a fixed internal value. This leads to more accurate attention mask creation, which is crucial for the correct functioning of the attention mechanism. Update fused_moe to main branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3 dense mode & moe model e2e test - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-26 14:30:39 +08:00
Cao Yi	3953dcf784	[Feature][Quant] Auto-detect quantization format from model files (#6645 ) ## Summary - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. ## Details Detection priority: 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior Technical approach: Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. ## Files Changed \| File \| Description \| \|------\|-------------\| \| `vllm_ascend/quantization/utils.py` \| Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` \| \| `vllm_ascend/platform.py` \| Integrated auto-detection in `check_and_update_config()` \| \| `vllm_ascend/quantization/modelslim_config.py` \| Improved error handling for weight loading \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-26 10:59:25 +08:00
Li-Yongwen	2870f7c8ad	[Feat] Support routing replay (#6696 ) ### What this PR does / why we need it? [Feat] Support routing replay same as https://github.com/vllm-project/vllm-ascend/pull/6666 resubmit because of DOC failure ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: liyongwen <1310439159@qq.com> Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-26 10:22:47 +08:00
Rozwel-dx	a9cca0c5c4	[Refactor] Modify the binding logic, added memory migration and interrupt core binding functions. (#6785 ) [Refactor] Modify the binding logic, added memory migration and interrupt core binding functions. ### What this PR does / why we need it? Controls the use of memory on a closer NUMA node to achieve a lower memory access latency, while binding interrupts to different CPU cores to prevent them form interrupting the inference process. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `b8eaaa073b` Signed-off-by: rowzwel_dx <1392851715@qq.com> Signed-off-by: Rozwel-dx <1392851715@qq.com> - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-02-26 08:49:50 +08:00
Frank Chen	3da2ba22eb	[Platform] Enable ARM-only CPU binding with NUMA-balanced A3 policy and update docs/tests (#6686 ) ### What this PR does / why we need it? - Keeps enable_cpu_binding default on, but skips binding on non‑ARM CPUs inside bind_cpus, with a clear log. - Uses a table-driven binding policy: A3 uses NUMA‑balanced binding; other device types use NUMA‑affinity binding. - Updates docs to reflect the exact behavior and adds/updates unit tests for the new logic. ### Does this PR introduce _any_ user-facing change? - Yes. CPU binding is now enabled by default via additional_config, and documented in the user guide. - CPU binding behavior differs by device type (A3 vs. others). ### How was this patch tested? Added/updated unit tests: test_cpu_binding.py 1. test_binding_mode_table covers A2 vs A3 binding mode mapping. 2. test_build_cpu_pools_fallback_to_numa_balanced covers fallback when affinity info is missing. 3. TestBindingSwitch.test_is_arm_cpu covers ARM/x86/unknown arch detection. 4. test_bind_cpus_skip_non_arm covers non‑ARM skip path in bind_cpus. test_worker_v1.py 1. Updated mocks for enable_cpu_binding default True to align with new config default. - vLLM version: v0.14.1 - vLLM main: d7de043 --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>	2026-02-25 11:15:14 +08:00
Icey	ee59429015	upgrade main to 0212 (#6712 ) ### What this PR does / why we need it? Fixes `transformers_utils/processors/__init__` import error, due to https://github.com/vllm-project/vllm/pull/33247 Fixes Fused MoE break introduced by `MoERunner abstraction,` due to https://github.com/vllm-project/vllm/pull/32344 > delete AscendMoERunnere when https://github.com/vllm-project/vllm/pull/35178 is merged Fixes `Make Qwen3VL compatible with Transformers v5`, due to https://github.com/vllm-project/vllm/pull/34262 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-25 09:17:29 +08:00
pu-zhe	a8e951e6f5	[Feat] 310p supports PrefillCacheHit State (#6756 ) ### What this PR does / why we need it? This PR extends the Ascend 310P attention backend to support the `PrefillCacheHit` state. Previously, only `PrefillNoCache`, `DecodeOnly`, and `ChunkedPrefill` were supported. This PR handles this state by routing it to the existing `forward_chunked_prefill_310` implementation, which is suitable for this scenario. The changes also include refactoring the main `forward_impl` dispatch method for better clarity and updating unit tests to cover the new state and ensure correctness. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy test when chunked prefill is disabled. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-24 16:48:05 +08:00
LI SHENGYONG	ff29e029de	[EPLB][Bugfix] Bugfix for ineffective dynamic eplb (#6653 ) ### What this PR does / why we need it? #6043 deleted the forward_before phase of the dynamic eplb. Currently, the end-to-end precision is monitored in the UT, and the log is not printed in the key place. As a result, the eplb does not take effect and is not intercepted. 1. The forward_before function is added back. 2. Delete unnecessary logs and add key logs. 3. Warm-up of algorithm 3 is added. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ![Snipaste_2026-02-10_15-57-31](https://github.com/user-attachments/assets/03813e5f-3d19-42d8-8118-76223afe8298) #### The conversation is normal. Okay, the user is asking, \"What is deep learning?\" I need to explain this in a clear and concise way. Let me start by recalling what I know about deep learning. It's a subset of machine learning, right? So first, I should mention that it's part of machine learning, which itself is a branch of AI. Then, the key aspect of deep learning is the use of neural networks with multiple layers. These are called deep neural networks.\n\nWait, I should define neural networks first. Maybe start with the basics. A neural network is inspired by the human brain, with layers of nodes (neurons) that process data. But deep learning specifically refers to networks with many layers—hence \"deep.\" So the term \"deep\" comes from the number of layers. \n\nI should explain how deep learning works. It involves training these networks on large datasets, allowing them to automatically learn features from the data. Unlike traditional machine learning, where you might have to manually extract features, deep learning models can do this automatically. That's a key point. For example, in image recognition, a deep learning model can learn to detect edges, shapes, and then more complex patterns without human intervention.\n\nApplications are important too. The user might want to know where deep learning is used. Common examples include image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. Maybe mention specific technologies like self-driving cars using computer vision or virtual assistants like Siri or Alexa - vLLM version: v0.15.0 - vLLM main: `13397841ab` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-24 14:43:04 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
Shaoxu Cheng	b6bc3d2f9d	[Feat.][310P]: weightNZ feature with quant or unquant. (#6705 ) NZ Format Support for Linear Layers: Implemented support for the NZ (N-dimensional Z-order) format for linear layer weights on Ascend 310P, enhancing performance for both quantized and unquantized layers. Unquantized Linear Method for Ascend 310P: Introduced AscendUnquantizedLinearMethod310 to specifically handle and apply NZ format casting to unquantized linear layer weights during the loading process. MRotaryEmbedding Integration: Extended Rotary Embedding support by adding AscendMRotaryEmbedding310 to provide an Ascend-specific implementation for MRotaryEmbedding. Quantization Method Updates: Updated the w8a8_static quantization method to directly transpose weights and apply NZ format casting, ensuring consistency with the new format. - vLLM version: v0.15.0 - vLLM main: `9562912cea` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-13 15:41:02 +08:00
Shaoxu Cheng	f40256b697	[Feat.][310P] addrmsnorm for 300I DUO (#6704 ) ### What this PR does / why we need it? This PR integrates the `npu_add_rms_norm` fused kernel for RMSNorm operations with residual connections on 310P devices. This change optimizes the computation by replacing a two-step process (manual residual addition followed by RMSNorm) with a single, more efficient fused operation. This is needed to improve the performance of models utilizing RMSNorm with residual connections on the 310P architecture. Fixes # ### Does this PR introduce _any_ user-facing change? No, this PR introduces an internal optimization and does not change any user-facing APIs or behaviors. ### How was this patch tested? This patch was tested with updated unit tests (`test_RMSNorm_forward_310p`) that mock the `npu_add_rms_norm` operation to verify the correctness of the fused kernel integration. --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-02-13 15:40:49 +08:00
Icey	7164990904	[Graph][Fusion] Integrating inductor pass and npugraph ex pass (#6354 ) ### What this PR does / why we need it? Integrating inductor pass and npugraph ex pass, see RFC: https://github.com/vllm-project/vllm-ascend/issues/6347 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? all tests passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-13 15:34:55 +08:00
wangxiaoteng888	b881fab416	[P/D][PCP] mooncake layerwise support pcp function (#6627 ) ### What this PR does / why we need it? mooncake layerwise support pcp function PCP (Prefill Context Parallelism) Support: Introduced explicit support for Prefill Context Parallelism (PCP) and Decode Context Parallelism (DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing for more granular control and awareness of parallel configurations during data transfer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>	2026-02-12 11:02:25 +08:00
iiiklw	a0315f6697	[npugraph_ex]enable npugraph_ex by default (#6664 ) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>	2026-02-12 08:44:06 +08:00
pu-zhe	02886e2641	[Feat] 310p support MoE W8A8 quantizaition (#6641 ) ### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-10 17:17:44 +08:00
LI SHENGYONG	7cf285a77a	[MOE Refactor] Remove QuantType in prepare_finalize.py (#6534 ) ### What this PR does / why we need it? To prevent confusion between different QuantType classes, we remove** QuantType in prepare_finalize.py - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:59:58 +08:00
LI SHENGYONG	34eecacace	[EPLB] Avoiding eplb's dependency on a specified model (#6528 ) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\nDeep learning is essentially this, but for computers. It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\nDeep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI). It uses artificial neural networks with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. Artificial Intelligence (AI): The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. Machine Learning (ML): A subset of AI that gives computers the ability to learn from data without being explicitly programmed for every single rule.\n3. Deep Learning (DL): A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-02-10 15:58:44 +08:00
Ronald	77305df398	implement batch invariant with ascendc (#6590 ) ### What this PR does / why we need it? there are batch invariant ops implemented by triton and ascendc, this pr aims to choose which kind of ops to be used to enable batch invariant. #5487 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-02-10 14:15:26 +08:00
Nengjun Ma	66b60c9440	[Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629 ) ### What this PR does / why we need it? 1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight prefetch 2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? 1) Performance result: Perf test data: ) MLA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 11.9669 token/s \| 12.0287 token/s \| 11.9978 \| \| o_proj no duplicate prefetch \| 12.5594 token/s \| 12.6216 token/s \| 12.5905 \| 4.94%\| \| single layer performace improve: 5%~8% ) SFA: \| \| 1st test \| 2nd test \| Output Token Throughput(Avg) \| Performance improvement percentage \| \| --- \| --- \| --- \| --- \| --- \| \| o_proj duplicate prefetch \| 13.0523 token/s \| 13.1084 token/s \| 13.08035 \| \| \| o_proj no duplicate prefetch \| 13.9844 token/s \| 14.1678 token/s \| 14.0761 \| 7.6% \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-10 14:14:37 +08:00
wangxiyuan	2a826b5fad	[Misc] upgrade to vllm main (#6646 ) ### What this PR does / why we need it? This PR upgrades the core vLLM dependency to a newer version from the main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is necessary to keep our project up-to-date with the latest features and fixes from upstream vLLM. 1. `ac32e66cf9` pass file is moved. - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wxsIcey <1790571317@qq.com>	2026-02-10 14:08:59 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
Canlin Guo	b7aa511daa	[Patch] Remove the patch of MiniCPM (#5975 ) ### What this PR does / why we need it? Part of #5304. After https://github.com/vllm-project/vllm/pull/32523 merge, we could remove the patch of `MiniCPMAttention`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Test it locally. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-02-09 14:07:44 +08:00
pu-zhe	4f33e25046	[Refactor]refactor 310p attention impl and add ut (#6579 ) ### What this PR does / why we need it? This pull request significantly refactors the attention mechanism for the Ascend 310P hardware, enhancing its architecture by separating mask generation concerns from the core attention implementation. It introduces a dedicated mask builder class capable of handling various mask types, including causal, splitfuse, and sliding window attention masks, all optimized for the NPU's fractal data format. This change not only cleans up the codebase but also lays the groundwork for more robust and feature-rich attention operations on Ascend devices, backed by new, extensive unit tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3 and qwen3-moe - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:26:26 +08:00
pu-zhe	23524f2ca4	[Refactor]refactor 310p ops and add ut (#6591 ) ### What this PR does / why we need it? This pull request focuses on a significant refactoring effort within the vllm-ascend project, specifically targeting operations optimized for the Ascend 310P hardware. The changes aim to streamline the implementation of core components like quantization and multi-head attention, making the codebase more maintainable and robust. Concurrently, new unit tests have been introduced to ensure the correctness and reliability of these refactored modules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test with qwen3-32b w8a8 - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-02-07 09:25:17 +08:00
wangxiyuan	6c49f95da2	[Ops][Refactor] Remove custom rotary_embedding operator (#6523 ) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-07 09:24:05 +08:00
Nengjun Ma	11339eb48a	[CI] Update UT CANN version to 8.5.0 for main branch (#6564 ) ### What this PR does / why we need it? Update UT CANN version to 8.5.0 ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-06 10:28:42 +08:00
meihanc	922e5c163b	[main2main] upgrade vllm main 0202 (#6560 ) ### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to https://github.com/vllm-project/vllm/pull/32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to https://github.com/vllm-project/vllm/pull/33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to https://github.com/vllm-project/vllm/pull/33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to https://github.com/vllm-project/vllm/pull/32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to https://github.com/vllm-project/vllm/pull/32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to https://github.com/vllm-project/vllm/pull/27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to https://github.com/vllm-project/vllm/pull/33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to https://github.com/vllm-project/vllm/pull/32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-02-05 19:31:17 +08:00
Wang Kunpeng	13c4a9c78b	[bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (#6491 ) ### What this PR does / why we need it? This PR fixes an accuracy issue that occurs when using Prefill/Decode Context Parallelism (PCP/DCP) in conjunction with speculative decoding (MTP). The issue is caused by an irregular attention mask shape when both features are enabled. The fix involves flattening the `block_table` for speculative decoding requests under PCP/DCP to ensure a regular attention mask. This PR also introduces a `use_cp` property for cleaner code and updates dummy runs to handle this scenario correctly. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix that improves accuracy and should not have user-facing API changes. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2026-02-05 10:06:14 +08:00
Nengjun Ma	78fad4e348	[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442 ) ### What this PR does / why we need it? Refactor MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage. Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP, VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following: --additional-config '{"weight_prefetch_config": { "enabled": true, "prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}' ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-02-04 09:08:18 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Qiu	638cae824d	[bugfix](CP) Fix and unify the PD request discrimination logic. (#5939 ) ### What this PR does / why we need it? Since the PR (https://github.com/vllm-project/vllm/pull/32118) has modified the criteria for judging Prefill and Decode requests in vLLM, PCPManager needs to synchronize with this standard. As PCPManager involves multiple calculations of PD request counts, this PR attempts to consolidate the related logic and update the PD request count once per batch. ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ``` - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-31 10:26:02 +08:00
liziyu	d252e4f5ec	[P/D] Using the cache load operator to replace the index select operator. (#6295 ) ### What this PR does / why we need it? Using the cache load operator to replace the index select operator. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-30 14:27:53 +08:00
zxr2333	14bd55f30c	[P/D][BugFix] Fix layerwise P/D request_id error (#6360 ) ### What this PR does / why we need it? Fix layerwise Connector P/D request_id error, due to vllm pr: https://github.com/vllm-project/vllm/pull/27987, which will add a random suffix to request_id in EngineCore. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>	2026-01-29 20:19:05 +08:00
Wang Kunpeng	c498cea22d	[refactor] refactor excute_model and _dymmy_run method (#6043 ) ### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449 Co-authored-by: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>	2026-01-27 22:27:01 +08:00
pu-zhe	21b6779a33	[UT]: refactoring 310p ops ut (#6296 ) ### What this PR does / why we need it? Refactor swiglu and rms_norm unittest case for 310P and 910B. Apply attention_v1 get_kv_cache_shape and build metadata on all of platforms ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? CI UT test - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>	2026-01-27 16:31:51 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
yuxinshan	0bb1f91c2c	[Feature] Mooncake connector get remote ptp size (#5822 ) ### What this PR does / why we need it? To support elastic scaling when using mooncake connector, we should support to configure different tp sizes for different nodes. As a result, we transfer the prefill node information, such as tp size, through the request's kv_transfer_params. The decode nodes get the prefill tp size through the request's kv_transfer_params, instead of getting it from the configuration of the mooncake connector . - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: yuxinshan <syx_ctyg@126.com> Signed-off-by: CalvinXKY <kyxiezju@163.com>	2026-01-26 14:28:33 +08:00
LI SHENGYONG	611e223b7d	[EPLB][Bugfix] EPLB support fp/bf16 (#5531 ) ### What this PR does / why we need it? EPLB support dtype of fp/bf16. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? w8a8_dynamic Baseline: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| w8a8_dynamic eplb: \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| The fp16 conversation is normal. The fp16 test is in progress. Baseline fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb fp16 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-26 14:28:16 +08:00
wangxiyuan	4e3919e965	Reapply "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) (#6231 ) This reverts commit `95649344aa`. The CI failure doesn't related to this change. Let's reapply it. - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-26 09:04:54 +08:00
wangxiyuan	95649344aa	Revert "[Refactor] Unify full-graph parameter update logic (#6041 )" (#6227 ) This reverts commit `8966a99710`. It breaks the test `tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]` - vLLM version: v0.14.0 - vLLM main: `d68209402d`	2026-01-25 15:25:38 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
LICO67373	8966a99710	[Refactor] Unify full-graph parameter update logic (#6041 ) ### What this PR does / why we need it? Refactor: Unify full-graph parameter update logic This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. Key improvements: 1. Unified interface - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. Better architecture - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. Cleaner parameter handling - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values Why we need it: - Maintainability: Future changes only need to be made in one place per Backend - Code quality: Follows DRY principle and Single Responsibility Principle - Readability: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? No. This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-24 20:12:57 +08:00
Angazenn	019a2fe6e6	[Eagle3]enhance skipping dp allreduce and add it into eagle proposer (#6192 ) ### What this PR does / why we need it? This PR： 1. Enhances the logic of `_skip_all_reduce_across_dp_group` to skip all cpu dp allreduce for dense models. This is also for purpose 2. 2. Adds `_skip_all_reduce_across_dp_group` into eagle_proposer. Now models like Qwen3-235b supports eagle3 spec decode. A typical setting for these moe models on pd disaggregation often introduce `dp_size > 1`. This requires `set_forward_context` to call a cpu dp allreduce to retrieve `num_tokens_across_dp` on all cases. Skipping this allreduce greatly improves performance. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-24 11:29:42 +08:00
yjmyl	e90b14140b	[feature] add_rms_norm support bias (#5790 ) ### What this PR does / why we need it? This PR is to replace addRmsNorm and Add With addRmsNormBias. This way can lead to a more effecient result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Full Test Pass - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: Chen_HaoWen <chenhaowen12@huawei.com> Co-authored-by: Chen_HaoWen <chenhaowen12@huawei.com>	2026-01-23 21:09:54 +08:00
Qiu	749e24f81e	[bugfix] align max_num_batched_tokens with tppcp when using FLASHCOMM1 (#6000 ) ### What this PR does / why we need it? Align max_num_batched_tokens with tppcp when using FLASHCOMM1 to avoid assert error in `NPUModelRunner._dummy_run`. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-23 14:19:49 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
dsxsteven	8378bc28b0	[Misc] Remove CP Redundant Variables after FIA operator enables for CANN 8.5 (#6013 ) ### What this PR does / why we need it? PCP/DCP splits the kv-cache onto different cards. After introducing the parameter cp-kv-cache-interleave-size, the first size tokens will be cached at Card 0, and so on. However, if there are too few tokens, some cards will not store the key-value pairs, resulting in values of 0, corrupted values, and precision issues. Currently, additional operations are introduced to avoid this precision problem. After we integrate FIA operator in mla_cp._forward_decode and CANN updates to 8.5.0, we now can remove these additional operations. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? passed all CI by CANN 8.5.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: dsxsteven <dsxsteven@sina.com> Signed-off-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2026-01-23 14:13:12 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00

1 2 3 4 5 ...

590 Commits