xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
ChenCangtao	46cee945b3	[doc][npugraph_ex]add npugraph_ex introduction doc (#6306 ) ### What this PR does / why we need it? As part of the preparation work for the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214) We have added a documentation about npugraph_ex, which mainly explains and introduces its usage and FX graph optimization. The introduction to FX graph optimization also includes specific explanations of the default passes, the implementation methods for custom fusion passes, and how to capture the FX graph during the optimization process through environment variable configuration. --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-30 11:21:37 +08:00
Nengjun Ma	597091be9f	[Doc] Reranker guide remove deprecated task option (#6385 ) ### What this PR does / why we need it? Reranker guide remove deprecated task option. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-29 16:00:26 +08:00
CodeCat	54e8389f8e	[Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (#6006 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. This time, we performed the operator fusion of MatmulAllReduceAddRMSNorm and added corresponding ST test cases for regression monitoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-27 16:41:48 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
wangxiyuan	d9979f4d13	[Doc] quick fix for vllm-ascend version (#6278 ) Correct vllm-ascend version name in doc - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 19:33:18 +08:00
wangxiyuan	cb553f8eee	[Community] Nominate whx-sjtu as maintainer (#6268 ) Since the first release v0.13.0rc2 and v0.14.0rc1 in 2026 are released. We consider to refresh the maintainer team. I nominate whx-sjtu as the new maintainer. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 19:22:26 +08:00
Nengjun Ma	f910cebe04	[Doc] 310P Documents update (#6246 ) ### What this PR does / why we need it? 310P support guides updates, as currently has supported in main branch. --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-26 14:33:21 +08:00
wangxiyuan	52d4acfa51	[Doc] add release note for v0.14.0rc1 (#6225 ) Add release note for v0.14.0rc1 - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 14:22:40 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
Shanshan Shen	e3eefdecbd	[Doc] Update `max_tokens` to `max_completion_tokens` in all docs (#6248 ) ### What this PR does / why we need it? Fix: ``` DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field. ``` - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-26 11:57:40 +08:00
wangxiyuan	99bdd7363c	[CI] update vLLM to 0.14.1 (#6222 ) Upgrade vLLM to 0.14.1 - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-25 17:52:16 +08:00
Icey	7799c4ca3b	[Fusion] change fusion env variable (#6201 ) ### What this PR does / why we need it? Since CI has integrated Triton, `fuse_qknorm_rope` is enabled by default. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-24 22:49:33 +08:00
wangxiyuan	21833a4321	[Doc] Add release note for 0.13.0rc2 (#6207 ) Add release note for 0.13.0rc2 - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-24 12:51:47 +08:00
liziyu	14bef9af6f	[P/D] Remove restrictions on mooncake for IPv6 (#5946 ) ### What this PR does / why we need it? Remove restrictions on mooncake for IPv6 Dependencies: cann8.5、mooncake v0.3.8.post1 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: liziyu <liziyu16@huawei.com>	2026-01-24 11:30:22 +08:00
zhangyiming	56d8f088dd	[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196 ) ### What this PR does / why we need it? [Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: menogrey <1299267905@qq.com>	2026-01-24 11:29:07 +08:00
zhaomingyu13	2dd68652bc	[Doc] Add the setting description of cudagraph_capture_sizes in speculative decoding user guide (#5637 ) ### What this PR does / why we need it? Add the setting description of cudagraph_capture_sizes, guide users to avoid the common mistakes frequently made when using the EAGLE overlay fullgraph. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need for testing - vLLM version: v0.13.0 - vLLM main: `8be6432bda` --------- Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com> Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-23 23:22:44 +08:00
Angazenn	1e116829ac	[doc]update --max-num-seqs in Qwen3-235b tutorial (#6197 ) ### What this PR does / why we need it? This pr update --max-num-seqs in Qwen3-235b single-node-deployment tutorial to ensure running into graph mode correctly. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: Angazenn <supperccell@163.com>	2026-01-23 17:11:10 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
Li Wang	4d780a8b01	[Misc] Revert "[Misc] Bump mooncake version to v0.3.8.post1 (#6110 )" (#6164 ) ### What this PR does / why we need it? The new version of moonkcake lead to the image build failure. see https://github.com/vllm-project/vllm-ascend/actions/runs/21236469259/job/61105443733, we should revert it first ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-23 09:53:32 +08:00
zhangxinyuehfad	08a45e6053	[Doc] update supported features (#6165 ) ### What this PR does / why we need it? update supported features - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:50:11 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wjunLu	88632cf976	[CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (#6145 ) ### What this PR does / why we need it? Upgrade wheel building's CANN to 8.5.0 and update the Docs - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 19:50:54 +08:00
meihanc	e54d294df3	[CI]Install clang in dokerfile for triton ascend (#4409 ) ### What this PR does / why we need it? Install clang in dokerfile for triton ascend - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-22 19:01:28 +08:00
wjunLu	a7d781f135	[Main] Upgrade PTA to 2.9.0 (#6112 ) ### What this PR does / why we need it? Upgrade PTA to 2.9.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 17:59:06 +08:00
Li Wang	37a9cf818a	[Misc] Bump mooncake version to v0.3.8.post1 (#6110 ) ### What this PR does / why we need it? Since the mooncake has the newer [release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1), we pin the tag to latest release ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 11:03:16 +08:00
wangxiyuan	69740039b7	[CI] Upgrade CANN to 8.5.0 (#6070 ) ### What this PR does / why we need it? 1. Upgrade CANN to 8.5.0 2. move triton-ascend 3.2.0 to requirements note: we skipped the two failed e2e test, see https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail. We'll fix it soon. ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/5494 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-22 09:29:50 +08:00
Nengjun Ma	ab676413e6	Default enable MLAPO (#5952 ) ### What this PR does / why we need it? 1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD disagregation D Instance, for example: DeepSeekV3-W8A8, DeepSeek-R1-W8A8. 2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models, currently is DeepSeek-V3.2-W8A8. ### Does this PR introduce _any_ user-facing change? Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO feature for deepseek w8a8 model The effect of enabling MLAPO SFA model deployed on a single A3 Node: Test with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py dataset: gsm8k-lite，without set MTP, FULL GRAPH, has 19% promote：未默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 14055.8836 ms │ ├─────────────────────────┤ │ ITL │ 66.8171 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 104.9105 token/s │ ├─────────────────────────┤ 默认开启 MLAPO 时： ├─────────────────────────┤ │ TTFT │ 3753.1547 ms │ ├─────────────────────────┤ │ ITL. │ 61.4236 ms. │ ├─────────────────────────┤ │ Output Token Throughput │ 125.2075 token/s│ ├─────────────────────────┤ - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-22 09:26:39 +08:00
MengLong Chen	a15a5f6aa5	[Doc] Supplement PD separation parameters of DeepSeek V3.1 (#6053 ) ### What this PR does / why we need it? Supplement PD separation parameters of DeepSeek V3.1 The recommended parameter configuration for DeepSeek V3.1 in the EP32 scenario after PD separation has been adjusted, and the core parameters have been described in detail. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2026-01-22 08:53:44 +08:00
meihanc	53bfb38192	[CI]Update triton ascend version in 3.2.0 (#6067 ) ### What this PR does / why we need it? update triton ascend version in 3.2.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-21 16:02:23 +08:00
Magnus	5b129cf0a1	[1/N][Feat] Xlite Qwen3 MoE Support (#5951 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`69b170b8b5`) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph \| maxconcurrency \| item \| TTFT(ms) \| \| TPOT(ms) \| \| QPS (req/s) \| OutputSpeed (token/s) \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| \| \| Avg \| P99 \| Avg \| P99 \| \| \| \| 1 \| baseline-aclgraph \| 205.07 \| 287.29 \| 12.34 \| 12.65 \| 0.14 \| 78.81 \| \| 1 \| xlite-full \| 66.40 \| 113.69 \| 11.71 \| 12.40 \| 0.15 \| 84.73 \| \| 1 \| xlite-decode-only \| 221.15 \| 316.40 \| 12.16 \| 12.91 \| 0.14 \| 79.70 \| \| 1 \| diff1 \| -67.62% \| -60.43% \| -5.11% \| -1.98% \| 7.14% \| 7.51% \| \| 1 \| diff2 \| 7.84% \| 10.13% \| -1.46% \| 2.06% \| 0.00% \| 1.13% \| \| \| \| \| \| \| \| \| \| \| 16 \| baseline-aclgraph \| 1892.16 \| 13916.86 \| 22.78 \| 39.28 \| 1.15 \| 589.89 \| \| 16 \| xlite-full \| 1355.40 \| 8907.45 \| 15.96 \| 25.15 \| 1.65 \| 850.21 \| \| 16 \| xlite-decode-only \| 1519.42 \| 8711.64 \| 19.23 \| 29.73 \| 1.38 \| 711.60 \| \| 16 \| diff1 \| -28.37% \| -36.00% \| -29.94% \| -35.97% \| 43.48% \| 44.13% \| \| 16 \| diff2 \| -19.70% \| -37.40% \| -15.58% \| -24.31% \| 20.00% \| 20.63% \| \| \| \| \| \| \| \| \| \| \| 32 \| baseline-aclgraph \| 673.80 \| 3914.90 \| 32.20 \| 37.95 \| 1.80 \| 928.54 \| \| 32 \| xlite-full \| 481.65 \| 2710.50 \| 19.95 \| 25.35 \| 2.91 \| 1506.67 \| \| 32 \| xlite-decode-only \| 372.22 \| 1095.25 \| 25.19 \| 28.47 \| 2.33 \| 1202.82 \| \| 32 \| diff1 \| -28.52% \| -30.76% \| -38.04% \| -33.20% \| 61.67% \| 62.26% \| \| 32 \| diff2 \| -44.76% \| -72.02% \| -21.77% \| -24.98% \| 29.44% \| 29.54% \| \| \| \| \| \| \| \| \| \| \| 48 \| baseline-aclgraph \| 583.18 \| 3277.65 \| 41.02 \| 46.05 \| 2.17 \| 1115.08 \| \| 48 \| xlite-full \| 973.42 \| 8237.33 \| 23.29 \| 30.50 \| 3.71 \| 1908.09 \| \| 48 \| xlite-decode-only \| 480.79 \| 2026.98 \| 31.48 \| 35.41 \| 2.83 \| 1453.75 \| \| 48 \| diff1 \| 66.92% \| 151.32% \| -43.22% \| -33.77% \| 70.97% \| 71.12% \| \| 48 \| diff2 \| -17.56% \| -38.16% \| -23.26% \| -23.11% \| 30.41% \| 30.37% \| \| \| \| \| \| \| \| \| \| \| 64 \| baseline-aclgraph \| 742.74 \| 5953.39 \| 47.79 \| 53.15 \| 2.48 \| 1272.37 \| \| 64 \| xlite-full \| 545.22 \| 3941.34 \| 25.09 \| 30.41 \| 4.64 \| 2376.44 \| \| 64 \| xlite-decode-only \| 752.40 \| 4534.29 \| 38.67 \| 43.28 \| 3.06 \| 1567.94 \| \| 64 \| diff1 \| -26.59% \| -33.80% \| -47.50% \| -42.78% \| 87.10% \| 86.77% \| \| 64 \| diff2 \| 1.30% \| -23.84% \| -19.08% \| -18.57% \| 23.39% \| 23.23% \| \| \| \| \| \| \| \| \| \| \| 100 \| baseline-aclgraph \| 565.52 \| 1716.81 \| 60.89 \| 68.69 \| 3.08 \| 1580.64 \| \| 100 \| xlite-full \| 398.14 \| 2328.88 \| 30.70 \| 32.45 \| 6.01 \| 3086.42 \| \| 100 \| xlite-decode-only \| 712.53 \| 4875.94 \| 52.71 \| 60.78 \| 3.53 \| 1813.58 \| \| 100 \| diff1 \| -29.60% \| 35.65% \| -49.58% \| -52.76% \| 95.13% \| 95.26% \| \| 100 \| diff2 \| 26.00% \| 184.01% \| -13.43% \| -11.52% \| 14.61% \| 14.74% \| \| \| \| \| \| \| \| \| \| \| 150 \| baseline-aclgraph \| 842.42 \| 5175.01 \| 73.60 \| 88.18 \| 3.80 \| 1952.26 \| \| 150 \| xlite-full \| 568.52 \| 4204.33 \| 37.90 \| 40.01 \| 7.27 \| 3734.72 \| \| 150 \| xlite-decode-only \| 654.43 \| 2504.06 \| 67.40 \| 77.00 \| 4.18 \| 2145.11 \| \| 150 \| diff1 \| -32.51% \| -18.76% \| -48.51% \| -54.63% \| 91.32% \| 91.30% \| \| 150 \| diff2 \| -22.32% \| -51.61% \| -8.42% \| -12.68% \| 10.00% \| 9.88% \| \| \| \| \| \| \| \| \| \| \| 200 \| baseline-aclgraph \| 750.63 \| 3049.91 \| 88.26 \| 101.95 \| 4.28 \| 2189.72 \| \| 200 \| xlite-full \| 558.48 \| 3791.98 \| 45.54 \| 49.04 \| 8.17 \| 4175.52 \| \| 200 \| xlite-decode-only \| 807.09 \| 4254.95 \| 85.18 \| 101.79 \| 4.44 \| 2271.52 \| \| 200 \| diff1 \| -25.60% \| 24.33% \| -48.40% \| -51.90% \| 90.89% \| 90.69% \| \| 200 \| diff2 \| 7.52% \| 39.51% \| -3.49% \| -0.16% \| 3.74% \| 3.74% \| \| \| \| \| \| \| \| \| \| ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>	2026-01-21 09:26:03 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
Canlin Guo	afabb49f00	[Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (#6034 ) ### What this PR does / why we need it? Add docs for Qwen3-VL-Embedding & Qwen3-VL-Reranker. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-20 17:36:31 +08:00
meihanc	ea57e3e7a4	[Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5988 ) ### What this PR does / why we need it? Upgrade vllm commit to releases/v0.14.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-20 15:10:40 +08:00
starmountain1997	0664c6e67a	[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921 ) ### What this PR does / why we need it? #### Documentation Improvements New Configuration: Added the layer_sharding parameter to the DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include `["q_b_proj", "o_proj"]` in their prefill node setup for better resource utilization. #### CI and Testing Updates Test Config Update: Updated the multi-node E2E test configuration file: tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml. including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update performance baseline. ### Does this PR introduce any user-facing change? Yes. The documentation now recommends a more optimized startup command for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see improved performance in multi-node PD disaggregation environments. ### How was this patch tested? CI Validation: The updated E2E test configuration has been verified through the nightly CI pipeline. Environment: * vLLM version: v0.13.0 Base Commit: [11b6af5](`11b6af5280`) Hardware: Ascend A3/A2 multi-node cluster. --------- Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>	2026-01-20 12:40:54 +08:00
Qiu	38cfcd572a	[doc](cp) correct the prefill of GQA and adjust desc of block table. (#5697 ) ### What this PR does / why we need it? correct the seq length of KV for prefill of GQA and clarify the desc of block table distribution in developer guide. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-19 18:53:48 +08:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
meihanc	9cad1a8349	[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928 ) ### What this PR does / why we need it? Migrate the torch profiler configuration from deprecated environment variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`, `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit `ProfilerConfig` object, aligning with vLLM's configuration best practices. The profiler environment variable approach is deprecated in vLLM and will be removed in v0.14.0 or v1.0.0. ### Does this PR introduce _any_ user-facing change? yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR` ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-19 09:27:55 +08:00
herizhen	0eafed9bd6	[doc]Table split (#5929 ) ### What this PR does / why we need it? Added legend descriptions, and split redundant tables into core supported model tables and extended compatible model tables. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: herizhen <1270637059@qq.com>	2026-01-19 09:15:04 +08:00
Li Wang	c4fde5c064	[Doc] Upgrade outdated ut doc (#5937 ) ### What this PR does / why we need it? For cpu env, we should set `SOC_VERSION` to mock different NPU chips for different compilation paths ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-19 09:12:46 +08:00
zzhxxx	05e69b99e5	[Doc] Remove Chinese characters from the icons in the doc. (#5959 ) ### What this PR does / why we need it? Remove Chinese characters from the icons in the doc. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: zzhx1 <zzh_201018@outlook.com>	2026-01-18 07:22:57 +08:00
wjunLu	73a3f822c7	[Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5911 ) ### What this PR does / why we need it? Upgrade vllm commit to releases/v0.14.0 - Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since the errors before have been fixed by https://github.com/vllm-project/vllm/pull/32243 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-15 23:22:43 +08:00
Shanshan Shen	efa0f64f22	[Doc] Add tutorials for Qwen3-VL-30B-A3B-Instruct (#5331 ) ### What this PR does / why we need it? Add tutorials for `Qwen3-VL-30B-A3B-Instruct`. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-15 10:56:19 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
SILONG ZENG	4811ba62e0	[Lint]Style: reformat markdown files via markdownlint (#5884 ) ### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>	2026-01-15 09:06:01 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
herizhen	d31170496b	[doc]index display by category (#5852 ) ### What this PR does / why we need it? upgrade tutorial doc index display by category ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-01-14 16:50:49 +08:00
LHXuuu	0415e694cd	[Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W8A8 Int8 dynamic weight. 2. Specify W4A16 quantization configuration. Co-authored-by: menogrey 1299267905@qq.com Co-authored-by: kunpengW-code 1289706727@qq.com ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: menogrey <1299267905@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>	2026-01-14 09:17:26 +08:00
zhangxinyuehfad	f7b904641e	[Main2Main] Upgrade vllm commit to 0109 (#5752 ) ### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to https://github.com/vllm-project/vllm/pull/31786 2. fix spec_decode e2e test due to https://github.com/vllm-project/vllm/pull/29821 break 3. fix `vllm.v1.attention.backends.utils` duo to https://github.com/vllm-project/vllm/pull/31891 4. fix `self.seq_lens - query_lens` on same device due to https://github.com/vllm-project/vllm/pull/31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-13 19:14:43 +08:00

1 2 3 4 5 ...

600 Commits