xc-llm-ascend

Author	SHA1	Message	Date
wangqiankun13	ebb940691f	[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (#5755 ) ### What this PR does / why we need it? [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. - Before: weight scale must be float32 - After: weight scale can be float32/float16 when x is float16, float32/bfloat16 when x is float32/bfloat16. And w1 scale can use different dtype with w2 scale. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Perf > When scale is of type fp16 or bf16, it will be cast to fp32 internally within the operator, while the subsequent computations remain unchanged. Therefore, this PR will introduce an additional cast operation but halve the memory copy operations for scale . Furthermore, since the scale data is only a few KB in size and participates in relatively few computations, its impact is almost negligible compared to major operations like matrix multiplication. Thus, the theoretical performance change should be minimal. test single operator cases from qwen3-235b, - single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b ep32) - batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536 The test was conducted for 100 rounds, and the average of the last 95 rounds was taken. \| \| bs18(us)\| bs32(us)\| \| -----\| -----\| -----\| \|Without this PR\|96.28\|108.83\| \|With this PR\|96.06\|107.90\| Note: Single-operator benchmarks represent an ideal scenario. They are usually only useful for referencing relative changes and may not fully align with performance data observed within the full model. #### Acc test qwen3-235b eplb on a single A3 node(ep16), with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-19 16:10:43 +08:00
LICO67373	687df88151	[Refactor] Move AttentionSpec initialization to Attention module (#5834 ) ### What this PR does / why we need it? This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec creation to each attention module's own `get_kv_cache_spec()` method, aligning with the vllm source code structure. Changes: - Simplify `get_kv_cache_spec` in `model_runner_v1.py` and `cpu_offload_connector.py` - Remove manual `AttentionType` checks for `Attention` modules - Delegate spec creation to each attention module's `get_kv_cache_spec` method directly - Let `MambaBase` layers use their own `get_kv_cache_spec` method - Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as Ascend-specific handling This change follows RFC #5463 item 12: move AttentionSpec to Attention module. - Fixes #5463 (item 12) ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring that simplifies code structure without changing any external behavior. ### How was this patch tested? - Syntax validation passed via `python -m py_compile` - CI tests will verify the changes work correctly with existing test cases - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com>	2026-01-19 14:22:18 +08:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
SILONG ZENG	b27774dbd6	[CI]fix for lint CI (#5982 ) ### What this PR does / why we need it? fix lint CI - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 09:49:28 +08:00
Icey	c929bd1e8d	[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034 ) This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph fusion pass for `matmul_allreduce_rmsnorm` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`. Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tongrunze <t00574058@china.huawei.com>	2026-01-19 09:28:07 +08:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
LI SHENGYONG	bc1f6713e7	[EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (#5933 ) ### What this PR does / why we need it? 1. Move the logic of expert mapping forward to prevent shotgun changes 2. Disable the update of expert map. ### How was this patch tested? a2 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| GPQA_diamond \| 53064e \| accuracy \| gen \| 73.23 \| a3 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:24:25 +08:00
LI SHENGYONG	9fed2636cb	[EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908 ) ### What this PR does / why we need it? 1. If the model has dense layers, the current code will attempt to obtain the routing experts of the dense layers, which will cause an error. This should be fixed by modifying the code to skip the dense layers when obtaining the routing experts. 2. The global_expert_map that the function directly outputs a affects the performance of dsv3.2. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepSeek V3.1 conversation is normal. #### aime precision test (dsv3.1) baseline without eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 66.67 \| eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 70.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:23:28 +08:00
Shanshan Shen	ad3a1eaf70	[Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (#5855 ) ### What this PR does / why we need it? As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339, multi-modal inference on vllm-ascend may lead to OOM issues in some scenarios. After our analysis, this is due to the memory fragmentation caused by frequent dynamic memory size adjustments during runtime. During the inference, the figure for non-torch memory see a gradual increase from around 1G to over 5G until the OOM issue occurs. We find that this problem can be resolved by just directly setting `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue. Thus, we decide to set this value by default, except RL (sleep mode) scenarios. It's also worthy to note that this environment variable may have more than one key-value pairs. We should append `",expandable_segments:True"` to the current configs. For example: ```python PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True". ``` > [!NOTE] > `max_split_size_mb` or `garbage_collection_threshold` cannot be enabled together with `expandable_segments=True`. ### Does this PR introduce _any_ user-facing change? Users do not need to set `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more. ### How was this patch tested? I have build a dataset consisting of my own photographs, which can stably reproduce this OOM issue on Qwen3-VL serie models. After apply this PR, this problem has been resolved and the amount of non-torch memory will keep stable at around 1G throughout the whole inference. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-19 09:17:31 +08:00
SILONG ZENG	329961b375	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #2 ) (#5977 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| `vllm_ascend/attention/attention_mask.py` \| \| `vllm_ascend/attention/attention_v1.py` \| \| `vllm_ascend/attention/context_parallel/attention_cp.py` \| \| `vllm_ascend/attention/context_parallel/common_cp.py` \| \| `vllm_ascend/attention/context_parallel/mla_cp.py` \| \| `vllm_ascend/attention/utils.py` \| \| `vllm_ascend/batch_invariant.py` \| \| `vllm_ascend/device/device_op.py` \| \| `vllm_ascend/device_allocator/camem.py` \| \| `vllm_ascend/envs.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 08:59:46 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
wangxiaoteng888	fff5df3efe	[P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968 ) ### What this PR does / why we need it? The force-free secondary release request causes the node to crash. When requests are pulled too quickly, they should not be added to the delay-free queue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-17 18:49:27 +08:00
Jade Zheng	22f253142a	[Feature] Support fine-grained shared expert overlap (#5482 ) Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>	2026-01-17 11:53:22 +08:00
lidenghui1110	48e10de8c9	[Bugfix] fix cpu offload hang with tp=1 (#5963 ) ### What this PR does / why we need it? As issue #5948 reported，when using cpu_offload_connector with TP=1, the server will hang on starting, we found several bugs here to fix. 1. some crash error encountered because of code changed with vllm version updating, some of them can be fixed as #5948, and this PR fixed all of them. 2. hang problem described in #5948, the direct reason is that in cpu_offload_connector, RPC client using the same client id in scheduler and worker when tensor_parrallel_size is 1, this PR force the client id to be different, then it is fixed. - Why we didn't find this hang problem before? Because we using --distributed-executor-backend mp or tensor_parrallel_size > 1 in our test, in our old test case, the scheduler and workers are different procceses, then client ids build by `worker-{os.getpid()}` are not the same. But when using tensor_parrallel_size=1, vllm will use uniproc as distributed-executor-backend by default, the scheduler and worker will by in the same proccess, then client ids are the same and hang. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: lidenghui <lidenghui1110@gmail.com>	2026-01-17 11:50:13 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00
Angazenn	7feb74590b	Revert "[bugfix]limit graph replay sync (#5761 )" (#5965 ) ### What this PR does / why we need it? reverts #5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Angazenn <supperccell@163.com>	2026-01-16 23:29:35 +08:00
SILONG ZENG	52086394ae	[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912 ) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-16 20:57:46 +08:00
rjg-lyh	3af91e5ac4	[Bugfix] Fix the input constraints checks for the mlapo and bmm_transpose operators (#5764 ) ### What this PR does / why we need it? This PR fix the input constraints checks for the mlapo and bmm_transpose operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` ### Perf 64K/3K，1P1D，bs=32 before this pr: TPOT 29ms, TTFT 47s，TPS 606 token/s after this pr: TPOT 29ms, TTFT 48s，TPS 636 token/s Signed-off-by: rjg-lyh <1318825571@qq.com>	2026-01-16 09:52:48 +00:00
zhangxinyuehfad	4f446aec4c	[CI] Add DeepSeek-V3.2-W8A8-Pruning e2e test (#5922 ) ### What this PR does / why we need it? 1. Fix DeepSeek-V3.2-W8A8-Pruning mtp 2. Add DeepSeek-V3.2-W8A8-Pruning e2e test ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-16 15:49:57 +08:00
lty	3cb0af0bcf	[Refactor]Refactor of vllm_ascend/distributed module (#5910 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 16:26:53 +08:00
Magnus	e8bbf72867	[Bugfix] Fix XliteModelRunner init failed when aclgraph is enabled (#5899 ) ### What this PR does / why we need it? Fix XliteModelRunner init failed when aclgraph is enabled. Ensure function graph_capture of vllm.v1.worker.gpu_model_runner is replaced. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: changdawei1 <changdawei3@huawei.com>	2026-01-15 15:40:28 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
Zetong Li	ea01aeaab7	[Refactor][EAGLE] 4/N extract common methods from eagle and mtp (#5870 ) ### What this PR does / why we need it? This PR aims to extract common methods from eagle_proposer and mtp_proposer. This is a small step towards merging eagle and mtp. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-15 10:24:35 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
wangqiankun13	d840f153f4	[Bugfix] Fix acc bug when enbale dispatch_gmm_combine_decode and eplb (#5806 ) ### What this PR does / why we need it? Fix acc bug when enbale dispatch_gmm_combine_decode and eplb. After eplb, expert table may change, so mapping is needed, while fused_mc2 miss the mapping. More info about this operator, please refer to RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? without this pr, qwen3-235b eplb with dispatch_gmm_combine_decode get acc 3.33% on aime2024. with this pr, test qwen3-235b eplb on a single A3 node(ep16) without dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| with dispatch_gmm_combine_decode \| dataset \| version \| metric \| mode \| vllm-api-stream-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wangqiankun <wangqiankun13@huawei.com>	2026-01-15 09:21:18 +08:00
Ronald	7078dff691	[Feature] implenment set_additional_forward_context for model runner v2 (#5720 ) ### What this PR does / why we need it? we implement set_additional_forward_context in platform, it's necessary to reuse code of gpu in model runner v2 by inheriting method in gpu model runer v2. please see model runner v2's plan #5208 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-15 09:18:28 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
cookieyyds	51415aaa2f	[bugfix]support dsv3.2 enable both mtp and full_decode_only (#5849 ) ### What this PR does / why we need it? support dsv3.2 enable both mtp and full_decode_only PR5626 To align with the community, the branch logic was modified. Previously, dsv32 could not reach inside the branch, and now an additional unpadded step is required, which causes transformations in positions and num_input_tokens, leading to changes in the cos and sin dimensions in sfa_v1.py. This, in turn, causes an illegal shape error when passed to the operator. 1. The unpadded function is introduced to align with the community， and in the community the function does not have the parameters num_input_tokens and positions. 2. The positions are split and num_input_tokens=num_actual_tokens are used to correspond to the function name unpad, so that the padded positions and num_input_tokens are not output. However, in fact, attention_v1 does not use the above two parameters. This is done because we are concerned that some people might use these parameters later and encounter shape mismatch issues if they are not aware of this. Therefore, we have performed the cropping. From the perspective of the source of acquisition, positions are not cropped, so there is actually no need to add unpad in this case. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2026-01-14 22:57:38 +08:00
Qiu	a88937f5cb	[bugfix](cp) replace None with zeros/inf tensor to avoid TypeError (#5837 ) ### What this PR does / why we need it? When there is no kv cache in some devices, the `_compute_prefill_context func` will return `None`, which is unexecpted. This PR replaces None with full zeros/-inf tensors to avoid TypeError. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash pytest tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py -k test_models_chunked_prefill_with_empty_kvcache ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-14 20:57:48 +08:00
zhaomingyu13	01805fbd7d	Revert "[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 )"(#5902 ) This reverts commit `d886b81971`. it breaks pd function - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>	2026-01-14 20:55:10 +08:00
LICO67373	2a6d95c389	[Cleanup] Remove dead code make_attention_mask function (#5818 ) ### What this PR does / why we need it? This PR removes the unused `make_attention_mask` function from `vllm_ascend/worker/v2/attn_utils.py`. Why it's dead code: - After PR #4870 (attention mask unification refactor), attention mask generation has been centralized in the `AttentionMaskBuilder` singleton class - The mask is now generated directly by metadata builders when needed (e.g., `AscendAttentionMetadataBuilder`, `AscendMLAMetadataBuilder`) - The `make_attention_mask` function is no longer called anywhere in the codebase - The function's parameters (including `attn_mask` and `spec_attn_mask`) were also removed from `build_attn_metadata` in the same refactor Changes: - Remove `make_attention_mask` function (24 lines) from `vllm_ascend/worker/v2/attn_utils.py` ### Does this PR introduce _any_ user-facing change? No. This is a code cleanup that removes dead code. No user-facing behavior changes. ### How was this patch tested? - Verified that `make_attention_mask` is not called anywhere in the codebase (via `grep`) - CI tests pass to ensure no regressions - The function has been unused since PR #4870 was merged - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2026-01-14 16:52:51 +08:00
Ronald	e20813f441	[Feature] implement eagle spec decoding for model runner v2 (#5840 ) ### What this PR does / why we need it? this pr implement eagle spec decoding for model runner v2, please see RFC https://github.com/vllm-project/vllm-ascend/issues/5208 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.13.0 --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-14 09:18:05 +08:00
LHXuuu	0415e694cd	[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-01-14 09:17:26 +08:00
LI SHENGYONG	ecf2fa482e	[EPLB][Bugfix] Get expert map from layers (#5817 ) ### What this PR does / why we need it? The initialization method of expert_map used by the eplb module is different from that used by the fused_moe module. This PR deletes the expert_map initialization method used by the eplb module to make the initialization methods consistent. #### before bugfix self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,62, 63], device='npu:1', dtype=torch.int32) self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32) ### How was this patch tested? #### qwen3-235B-w8a8 aime \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-14 09:16:51 +08:00
drslark	48ec97821a	[Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816 ) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in https://github.com/vllm-project/vllm-ascend/issues/5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in https://github.com/vllm-project/vllm-ascend/issues/5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` * vLLM version: v0.13.0 * vLLM main: `2f4e6548ef` Signed-off-by: drslark <slarksblood@qq.com>	2026-01-14 09:00:37 +08:00
liziyu	e1bed43cff	[P/D] bugfix for p node force free requset (#5431 ) ### What this PR does / why we need it? Fix the bug where the P-node's schedule dead after it force-frees a request due to timeout and then receives the completed kv cache pulled by the D-node again. By add list to recode all requests. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-01-14 08:51:31 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00
liziyu	eed9e366a7	[Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846 ) ### What this PR does / why we need it? Fix layerwise connector for decoder tp size > num kv heads. In this case prefiller should push kv cache to all decoder npu. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-13 17:30:33 +08:00
Shanshan Shen	d350c2ada6	[CustomOp][Perf] Merge Q/K split to simplify AscendApplyRotaryEmb for better performance (#5799 ) ### What this PR does / why we need it? - Use upstream util function (`_pre_process()` and `_post_process()`) to reduce redundant codes. (Find more details at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/common.py#L184-L213) - Merge Q/K split to simplify the logic of calling `torch_npu.npu_rotary_mul()` for better performance (TPOT has been reduced by 6.22%). ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? #### ✅ Functional test Launch the server: ```bash export VLLM_USE_MODELSCOPE=True vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Query the server: ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, {"type": "text", "text": "What is the text in the illustrate? How does it look?"} ]} ], "max_tokens": 100 }' ``` Output: ``` {"id":"chatcmpl-b2911ab6989ef098","object":"chat.completion","created":1768202780,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design, with \"TONGYI\" being more prominent and \"Qwen\" being slightly smaller and positioned below it. The font style is modern and clean, with \"TONGYI\" having a slightly bolder appearance compared to \"Qwen.\"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":178,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` #### ✅ Benchmark Run: ```bash export VLLM_USE_MODELSCOPE=False export HF_ENDPOINT="https://hf-mirror.com" vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 10 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 10 Failed requests: 0 Benchmark duration (s): 5.96 Total input tokens: 7191 Total generated tokens: 996 Request throughput (req/s): 1.68 Output token throughput (tok/s): 167.05 Peak output token throughput (tok/s): 261.00 Peak concurrent requests: 10.00 Total token throughput (tok/s): 1373.16 ---------------Time to First Token---------------- Mean TTFT (ms): 964.43 Median TTFT (ms): 858.48 P99 TTFT (ms): 1691.45 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 63.08 Median TPOT (ms): 40.86 P99 TPOT (ms): 241.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.16 Median ITL (ms): 33.61 P99 ITL (ms): 250.30 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 10 Failed requests: 0 Benchmark duration (s): 5.71 Total input tokens: 7191 Total generated tokens: 996 Request throughput (req/s): 1.75 Output token throughput (tok/s): 174.45 Peak output token throughput (tok/s): 279.00 Peak concurrent requests: 10.00 Total token throughput (tok/s): 1433.95 ---------------Time to First Token---------------- Mean TTFT (ms): 992.14 Median TTFT (ms): 938.30 P99 TTFT (ms): 1728.71 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 59.16 Median TPOT (ms): 37.65 P99 TPOT (ms): 234.89 ---------------Inter-token Latency---------------- Mean ITL (ms): 36.55 Median ITL (ms): 30.73 P99 ITL (ms): 170.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-13 15:47:23 +08:00
lhchg	4b679984de	enable ep32 for dispatch_ffn_combine (#5787 ) ### What this PR does / why we need it? To support dispatch_ffn_combine ep32 enabled ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Single operator tested --------- Signed-off-by: lhchg <lhao_cheng@163.com>	2026-01-13 14:35:52 +08:00
weijinqian0	1ccb9acd9a	[Refactor] Provide a framework to accommodate operators for different hardware devices (#5735 ) come from: https://github.com/vllm-project/vllm-ascend/issues/5463 Reason: During the iteration process of the hardware version, there may be a large number of iterations for the operators, which can lead to short-term compatibility differences. Therefore, an intermediate adaptation layer is provided to accommodate the short-term differences in operators. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Signed-off-by: weijinqian0 <1184188277@qq.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2026-01-13 09:53:26 +08:00
Rozwel-dx	8d571286dd	[Refactor] Modify the binding logic to allocate CPU cores for each NPU card (#5555 ) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card ### What this PR does / why we need it? Modify the binding logic to allocate CPU cores for each NPU card based on NUMA affinity, while isolating acl_thread/release_thread and other processes to prevent mutual interference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `c85cc045f8` Signed-off-by: rowzwel_dx <1392851715@qq.com> - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: Rozwel-dx <1392851715@qq.com>	2026-01-13 09:21:28 +08:00
zhaomingyu13	d886b81971	[BugFix] Support setting tp=1 for the Eagle draft model to take effect (#5519 ) ### What this PR does / why we need it? According to the official documentation, the parameter "draft_tensor_parallel_size": 1 is supposed to be applied to the Eagle3 model. However, based on actual debugging, it was found that the number of tensor parallelisms (tp) of the Eagle model is consistent with that of the target model. The setting of tp for the draft model did not take effect as expected. Note: This feature has not been superimposed and tested with `sp` and `dp`. It will be adapted later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```python from vllm import LLM, SamplingParams def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=4, gpu_memory_utilization=0.9, enforce_eager=True, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B" "draft_tensor_parallel_size": 1, "num_speculative_tokens": 3, }, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) print(f"Outputs: {outputs}") for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Fixes vllm-project/vllm#31345 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Co-authored-by: drslark <slarksblood@qq.com>	2026-01-13 09:14:30 +08:00
shiyuan680	7af3b880c1	support triton of mrope (#5664 ) ### What this PR does / why we need it? this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops this triton ops should use cann 8.5.0 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? test in qwen3-vl-235b acc textvqa native 81.82 npu triton 81.58 cuda triton 81.52 - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shiyuan680 <917935075@qq.com>	2026-01-13 09:13:51 +08:00
DreamerLeader	db7cf9b0ca	[bugfix] A2 Environment Pooling for Memcache Compatibility (#5601 ) ### What this PR does / why we need it? When running memcache in the A2 environment, the logic for registering memory needs to be added. Additionally, there is a link establishment conflict between memcache and HCCS during initialization in A2, so the link should be established in advance. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: fangjianwei <f30058701@china.huawei.com> Co-authored-by: fangjianwei <f30058701@china.huawei.com>	2026-01-13 09:07:38 +08:00
LICO67373	c8a324ab73	[Refactor] Add comments for Metadata classes in attention module (#5789 ) ### What this PR does / why we need it? Add docstrings for Metadata and MetadataBuilder classes in the attention module to improve code readability. Related to #5463 (Item 11: Add some comments for CommonMetadata and others) Modified files: - `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments for `AscendPCPMetadata`, `CPChunkedContextMetadata`, `AscendMetadataForPrefill`, `AscendMetadataForDecode` - `vllm_ascend/attention/utils.py`: Added comments for `AscendPrefillContextParallelMetadata` - `vllm_ascend/attention/mla_v1.py`: Added comments for `ChunkedContextMetadata`, `AscendMLADecodeMetadata` - `vllm_ascend/attention/attention_v1.py`: Added comments for `AscendMetadata`, `AscendAttentionMetadataBuilder` - `vllm_ascend/attention/context_parallel/attention_cp.py`: Added comments for `AscendAttentionCPMetadataBuilder` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation only, no functional changes. Signed-off-by: lico67373 <918688502@qq.com>	2026-01-13 08:46:50 +08:00
LiuYi-Up	dde547e900	[Bugfix] bugfix for the order of dummy run pad and sync (#5777 ) ### What this PR does / why we need it? This PR addresses an issue in piecewise graph mode when Multi-Threading Parallelism (MTP) is enabled. Specifically, the original dummy run sequence performs the following steps in order: 1. Sync DP (input length = 1 + k) 2. Dispatch (input length = 1 + k, with padding==graph size) However, in the model execution phase, the sequence differs, resulting in: 1. Padding (input length = 1, with padding) 2. Sync DP (input length = 1 + k) 3. Dispatch (input length 1 + k != graph size 1 + k, with padding) This discrepancy leads to a mismatch between the input sizes used in the model execution and those expected by the dispatch graph, causing an inconsistency in graph size. This PR ensures that the dispatch graph size aligns correctly by modifying the sequence of operations during model execution to match the dummy run sequence, resolving the mismatch issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: LiuYi-UP <1150854440@qq.com>	2026-01-13 08:44:10 +08:00
Qiu	5f4b13ab3d	[bugfix](cp) align max_context_chunk to cp_virtual_block_size (#5767 ) ### What this PR does / why we need it? In the chunked prefill scenario, CP needs to align the `max_context_chunk` to the `cp_virtual_block_size`, but the current implementation only aligns it to the `block_size`. For PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal to `block_size`, in which case `cp_virtual_block_size=block_size * dcp_size * pcp_size`. Under specific conditions, this can lead to misalignment of certain chunks, subsequently triggering assertion check errors. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-12 20:11:46 +08:00
wangyongjun	4453c60262	[bugfix]limit graph replay sync (#5761 ) ### What this PR does / why we need it? when graph mode is picewise，replay by synchronize will be effect performance, sync almost cost 250us ![123](https://github.com/user-attachments/assets/04d2a1f3-1f57-4dbb-85ce-b250f2ee7ff0) ### Does this PR introduce _any_ user-facing change? only sync when graph mode contain full mode ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: wangyongjun <wangyongjun7@huawei.com>	2026-01-12 16:46:21 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00

... 8 9 10 11 12 ...

1693 Commits