xc-llm-ascend

Author	SHA1	Message	Date
Cao Yi	5ec610e832	[Feature][Quant] Reapply auto-detect quantization format and support remote model ID (#7111 ) ### What this PR does / why we need it? Reapply the auto-detect quantization format feature (originally in #6645, reverted in #6873) and extend it to support remote model identifiers (e.g., `org/model-name`). Changes: - Reapply auto-detection of quantization method from model files (`quant_model_description.json` for ModelSlim, `config.json` for compressed-tensors) - Add `get_model_file()` utility to handle file retrieval from both local paths and remote repos (HuggingFace Hub / ModelScope) - Update `detect_quantization_method()` to accept remote repo IDs with optional `revision` parameter - Update `maybe_update_config()` to work with remote model identifiers - Add platform-level `auto_detect_quantization` support - Add unit tests and e2e tests for both local and remote model ID scenarios Closes #6836 ### Does this PR introduce _any_ user-facing change? Yes. When `--quantization` is not explicitly specified, vllm-ascend will now automatically detect the quantization format from the model files for both local directories and remote model IDs. - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-03-13 22:53:25 +08:00
Ronald	c980e68d40	[Feature] support aclgraph for model runner v2 (#7110 ) ### What this PR does / why we need it? This PR aims to support aclgraph for model runner v2, please see RFC #5208. The PR contains these modifications: - adapt to newest commit of vllm main branch. - supply a unified interface of extra forward context for both model runner v1 and model runner v2. - implement graph mode for main model. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-03-13 09:11:46 +08:00
Li Wang	33234aa0c5	Revert "[Feature][Quant] Auto-detect quantization format from model f… (#6873 ) This reverts commit `3953dcf784`. to keep the basic functions available --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-03-10 11:27:32 +08:00
王远	82fdd40d49	[Feat]Xlite Qwen3 MoE Support Data Parallel (#6715 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE data parallel in Xlite. For more details about Xlite, please refer to the following link:[https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md](https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md). online server config: ```shell port=$1 log=$2 export VLLM_USE_V1=1 export TASK_QUEUE_ENABLE=1 export HCCL_BUFFSIZE=512 export HCCL_OP_EXPANSION_MODE="AIV" export OMP_PROC_BIND=false export VLLM_ASCEND_ENABLE_NZ=0 sysctl -w vm.swappiness=0 sysctl -w kernel.numa_balancing=0 sysctl kernel.sched_migration_cost_ns=50000 ip=127.0.0.1 python -m vllm.entrypoints.openai.api_server \ --model /mnt/nvme1n1/wy/models/Qwen3-30B-A3B \ --tensor-parallel-size 2 \ --enable-expert-parallel \ --data-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-num-batched-tokens 32768 \ --data-parallel-size-local 4 \ --max-num-seqs=200 \ --block-size 128 \ --max-model-len 6656 \ --trust-remote-code \ --disable-log-requests \ --served-model-name qwen \ --no-enable-prefix-caching \ --additional-config '{"xlite_graph_config": {"enabled": true, "full_mode": true}, "enable_cpu_binding": true}' \ --compilation-config '{"cudagraph_capture_sizes":[1, 16, 32, 48, 64, 100, 150, 200], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --async-scheduling \ --host ${ip} \ --port ${port} > ${log} 2>&1 & ``` test_config: ```shell vllm bench serve \ --max-concurrency ${maxconcurrency} \ --num-prompts ${num_prompts} \ --host ${HOST} \ --port ${PORT} \ --model ${MODEL_NAME} \ --dataset-name random \ --backend openai-chat \ --random-input-len 512 \ --random-output-len 512 \ --random-range-ratio 0.2 \ --temperature 0.6 \ --metric-percentiles "50,90,99" \ --tokenizer ${TOKENIZER_PATH} \ --endpoint /v1/chat/completions \ --ignore-eos ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `c86cdcbcd2` Signed-off-by: uuzWY <Ethan.wangyuan@huawei.com> Co-authored-by: uuzWY <Ethan.wangyuan@huawei.com>	2026-03-09 17:53:35 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00
realliujiaxu	5e24b26a54	[Bugfix] rename enable_flash_comm_v1 back to enable_sp (#6883 ) ### What this PR does / why we need it? PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2026-03-01 20:22:50 +08:00
luomin2005	3cc8bf15da	Support platform.get_device_uuid function (#6777 ) ### What this PR does / why we need it? Support platform.get_device_uuid function. currently, the pytorch.npu.get_device_properties return uuid as full zero, vllm-ascend implement the interface at first, once the pytorch.npu.get_device_properties return the real uuid, vllm-ascend will support without modification. more details see https://github.com/vllm-project/vllm-ascend/issues/6669 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` root@localhost:/workspace/l00614971/vllm_test# python vllm_test.py INFO 02-24 09:43:48 [__init__.py:43] Available plugins for group vllm.platform_plugins: INFO 02-24 09:43:48 [__init__.py:45] - ascend -> vllm_ascend:register INFO 02-24 09:43:48 [__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. INFO 02-24 09:43:48 [__init__.py:217] Platform plugin ascend is activated device_uuid = 00000000-0000-0000-0000-000000000000 --------- Signed-off-by: liziyu <liziyu16@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: luomin2005 <luomin2005@huawei.com> Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2026-02-28 14:17:12 +08:00
Canlin Guo	e4458b2d2b	[Main2Main] Upgrade vLLM to 0226 (#6813 ) ### What this PR does / why we need it? Breaking: 1. https://github.com/vllm-project/vllm/pull/33452 2. https://github.com/vllm-project/vllm/pull/33451 3. https://github.com/vllm-project/vllm/pull/32567 4. https://github.com/vllm-project/vllm/pull/32344 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `83b47f67b1` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-02-27 16:05:21 +08:00
realliujiaxu	5def28dcd3	[Feat]support sequence parallelism by pass for VL models (#5632 )	2026-02-27 08:27:41 +08:00
Cao Yi	3953dcf784	[Feature][Quant] Auto-detect quantization format from model files (#6645 ) ## Summary - Add automatic quantization format detection, eliminating the need to manually specify `--quantization` when serving quantized models. - The detection inspects only lightweight JSON files (`quant_model_description.json` and `config.json`) at engine initialization time, with no `.safetensors` reads. - User-explicit `--quantization` flags are always respected; auto-detection only applies when the flag is omitted. ## Details Detection priority: 1. `quant_model_description.json` exists → `quantization="ascend"` (ModelSlim) 2. `config.json` contains `"quant_method": "compressed-tensors"` → `quantization="compressed-tensors"` (LLM-Compressor) 3. Neither → default float behavior Technical approach: Hooked into `NPUPlatform.check_and_update_config()` to run detection after `VllmConfig.__post_init__`. Since `quant_config` is already `None` at that point, we explicitly recreate it via `VllmConfig._get_quantization_config()` to trigger the full quantization initialization pipeline. ## Files Changed \| File \| Description \| \|------\|-------------\| \| `vllm_ascend/quantization/utils.py` \| Added `detect_quantization_method()` and `maybe_auto_detect_quantization()` \| \| `vllm_ascend/platform.py` \| Integrated auto-detection in `check_and_update_config()` \| \| `vllm_ascend/quantization/modelslim_config.py` \| Improved error handling for weight loading \| - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-02-26 10:59:25 +08:00
SILONG ZENG	e2237819a9	[CI]Fixed the spell check function in `typos.toml` (#6753 ) ### What this PR does / why we need it? The incorrect regular expression syntax `.[UE4M3\|ue4m3].` actually ignores all words containing any of the following characters: `u, e, 4, m, 3, \|` ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".[UE4M3\|ue4m3].", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ===fix===> ```yaml extend-ignore-identifiers-re = [".Unc.", "._thw", ".UE8M0.", ".(UE4M3\|ue4m3]).", ".eles.", ".fo.", ".ba.", ".ot.", ".[Tt]h[rR]."] ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `9562912cea` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-14 11:57:26 +08:00
Icey	7164990904	[Graph][Fusion] Integrating inductor pass and npugraph ex pass (#6354 ) ### What this PR does / why we need it? Integrating inductor pass and npugraph ex pass, see RFC: https://github.com/vllm-project/vllm-ascend/issues/6347 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? all tests passed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-02-13 15:34:55 +08:00
Ronald	f1ffb5fb19	[Feature] adapt to uva buffer and main2main (#6657 ) ### What this PR does / why we need it? vllm model runner v2 use uva buffer to prepare input data, but npu doesn't support uva yet, this pr implement a uvawrapper class to mimic gpu's uva backend. what's more, this pr make some modifications to adapt to the newer main branch. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM main: `13397841ab` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-02-12 10:36:31 +08:00
iiiklw	a0315f6697	[npugraph_ex]enable npugraph_ex by default (#6664 ) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>	2026-02-12 08:44:06 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
zhangxinyuehfad	05cc03d785	[Bugfix] fix hash conflict due to reset incompatible configuations (#6368 ) ### What this PR does / why we need it? [Bugfix] fix hash conflict due to reset incompatible configuations ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-03 10:32:02 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
Canlin Guo	b45bd92c2b	[Bugfix] Add defensive check for multimodal_config (#6230 ) ### What this PR does / why we need it? In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a check before accessing the sub-field of model_config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will checked by CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-25 17:39:19 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
CodeCat	1402cf6874	[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model for QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:https://github.com/vllm-project/vllm-ascend/issues/4715 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-22 17:22:41 +08:00
Magnus	5b129cf0a1	[1/N][Feat] Xlite Qwen3 MoE Support (#5951 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`69b170b8b5`) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph \| maxconcurrency \| item \| TTFT(ms) \| \| TPOT(ms) \| \| QPS (req/s) \| OutputSpeed (token/s) \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| \| \| Avg \| P99 \| Avg \| P99 \| \| \| \| 1 \| baseline-aclgraph \| 205.07 \| 287.29 \| 12.34 \| 12.65 \| 0.14 \| 78.81 \| \| 1 \| xlite-full \| 66.40 \| 113.69 \| 11.71 \| 12.40 \| 0.15 \| 84.73 \| \| 1 \| xlite-decode-only \| 221.15 \| 316.40 \| 12.16 \| 12.91 \| 0.14 \| 79.70 \| \| 1 \| diff1 \| -67.62% \| -60.43% \| -5.11% \| -1.98% \| 7.14% \| 7.51% \| \| 1 \| diff2 \| 7.84% \| 10.13% \| -1.46% \| 2.06% \| 0.00% \| 1.13% \| \| \| \| \| \| \| \| \| \| \| 16 \| baseline-aclgraph \| 1892.16 \| 13916.86 \| 22.78 \| 39.28 \| 1.15 \| 589.89 \| \| 16 \| xlite-full \| 1355.40 \| 8907.45 \| 15.96 \| 25.15 \| 1.65 \| 850.21 \| \| 16 \| xlite-decode-only \| 1519.42 \| 8711.64 \| 19.23 \| 29.73 \| 1.38 \| 711.60 \| \| 16 \| diff1 \| -28.37% \| -36.00% \| -29.94% \| -35.97% \| 43.48% \| 44.13% \| \| 16 \| diff2 \| -19.70% \| -37.40% \| -15.58% \| -24.31% \| 20.00% \| 20.63% \| \| \| \| \| \| \| \| \| \| \| 32 \| baseline-aclgraph \| 673.80 \| 3914.90 \| 32.20 \| 37.95 \| 1.80 \| 928.54 \| \| 32 \| xlite-full \| 481.65 \| 2710.50 \| 19.95 \| 25.35 \| 2.91 \| 1506.67 \| \| 32 \| xlite-decode-only \| 372.22 \| 1095.25 \| 25.19 \| 28.47 \| 2.33 \| 1202.82 \| \| 32 \| diff1 \| -28.52% \| -30.76% \| -38.04% \| -33.20% \| 61.67% \| 62.26% \| \| 32 \| diff2 \| -44.76% \| -72.02% \| -21.77% \| -24.98% \| 29.44% \| 29.54% \| \| \| \| \| \| \| \| \| \| \| 48 \| baseline-aclgraph \| 583.18 \| 3277.65 \| 41.02 \| 46.05 \| 2.17 \| 1115.08 \| \| 48 \| xlite-full \| 973.42 \| 8237.33 \| 23.29 \| 30.50 \| 3.71 \| 1908.09 \| \| 48 \| xlite-decode-only \| 480.79 \| 2026.98 \| 31.48 \| 35.41 \| 2.83 \| 1453.75 \| \| 48 \| diff1 \| 66.92% \| 151.32% \| -43.22% \| -33.77% \| 70.97% \| 71.12% \| \| 48 \| diff2 \| -17.56% \| -38.16% \| -23.26% \| -23.11% \| 30.41% \| 30.37% \| \| \| \| \| \| \| \| \| \| \| 64 \| baseline-aclgraph \| 742.74 \| 5953.39 \| 47.79 \| 53.15 \| 2.48 \| 1272.37 \| \| 64 \| xlite-full \| 545.22 \| 3941.34 \| 25.09 \| 30.41 \| 4.64 \| 2376.44 \| \| 64 \| xlite-decode-only \| 752.40 \| 4534.29 \| 38.67 \| 43.28 \| 3.06 \| 1567.94 \| \| 64 \| diff1 \| -26.59% \| -33.80% \| -47.50% \| -42.78% \| 87.10% \| 86.77% \| \| 64 \| diff2 \| 1.30% \| -23.84% \| -19.08% \| -18.57% \| 23.39% \| 23.23% \| \| \| \| \| \| \| \| \| \| \| 100 \| baseline-aclgraph \| 565.52 \| 1716.81 \| 60.89 \| 68.69 \| 3.08 \| 1580.64 \| \| 100 \| xlite-full \| 398.14 \| 2328.88 \| 30.70 \| 32.45 \| 6.01 \| 3086.42 \| \| 100 \| xlite-decode-only \| 712.53 \| 4875.94 \| 52.71 \| 60.78 \| 3.53 \| 1813.58 \| \| 100 \| diff1 \| -29.60% \| 35.65% \| -49.58% \| -52.76% \| 95.13% \| 95.26% \| \| 100 \| diff2 \| 26.00% \| 184.01% \| -13.43% \| -11.52% \| 14.61% \| 14.74% \| \| \| \| \| \| \| \| \| \| \| 150 \| baseline-aclgraph \| 842.42 \| 5175.01 \| 73.60 \| 88.18 \| 3.80 \| 1952.26 \| \| 150 \| xlite-full \| 568.52 \| 4204.33 \| 37.90 \| 40.01 \| 7.27 \| 3734.72 \| \| 150 \| xlite-decode-only \| 654.43 \| 2504.06 \| 67.40 \| 77.00 \| 4.18 \| 2145.11 \| \| 150 \| diff1 \| -32.51% \| -18.76% \| -48.51% \| -54.63% \| 91.32% \| 91.30% \| \| 150 \| diff2 \| -22.32% \| -51.61% \| -8.42% \| -12.68% \| 10.00% \| 9.88% \| \| \| \| \| \| \| \| \| \| \| 200 \| baseline-aclgraph \| 750.63 \| 3049.91 \| 88.26 \| 101.95 \| 4.28 \| 2189.72 \| \| 200 \| xlite-full \| 558.48 \| 3791.98 \| 45.54 \| 49.04 \| 8.17 \| 4175.52 \| \| 200 \| xlite-decode-only \| 807.09 \| 4254.95 \| 85.18 \| 101.79 \| 4.44 \| 2271.52 \| \| 200 \| diff1 \| -25.60% \| 24.33% \| -48.40% \| -51.90% \| 90.89% \| 90.69% \| \| 200 \| diff2 \| 7.52% \| 39.51% \| -3.49% \| -0.16% \| 3.74% \| 3.74% \| \| \| \| \| \| \| \| \| \| ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>	2026-01-21 09:26:03 +08:00
Zetong Li	1ab6cd4935	[Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (#5945 ) ### What this PR does / why we need it? This PR aims to fix setting of `speculative_config.enforce_eager` in deepseek v3.2 mtp. The point is that, vllm sets `speculative_config.enforce_eager` as True if using deepseek_v32 with mtp. Since we support graph mode, we simply ignore it here. However, this fix will also implicitly ignore user setting of `speculative_config.enforce_eager`, we need to take care and remove it once vllm supports this feature. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-21 09:24:33 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
zhangxinyuehfad	a5b099c73d	[Bugfix] Reset incompatible config (#6005 ) ### What this PR does / why we need it? This PR introduces compatibility fixes for running vLLM on Ascend NPU hardware. The changes ensure that GPU-specific parameters are automatically detected and reset to Ascend-compatible values with appropriate warnings logged. \| Module \| Parameter \| Default Value \| \|--------\|-----------\|---------------\| \| Model Config \| `disable_cascade_attn` \| `False` \| \| Parallel Config \| `all2all_backend` \| `"allgather_reducescatter"` \| \| Cache Config \| `cpu_kvcache_space_bytes` \| `None` \| \| MultiModal Config \| `mm_encoder_attn_backend` \| `None` \| \| Observability Config \| `enable_layerwise_nvtx_tracing` \| `False` \| \| Scheduler Config \| `max_num_partial_prefills` \| `1` \| \| Speculative Config \| `quantization` \| `None` \| \| KV Transfer Config \| `kv_buffer_size` \| `1e9` \| \| KV Transfer Config \| `enable_permute_local_kv` \| `False` \| \| Attention Config \| `use_prefill_decode_attention` \| `False` \| \| Attention Config \| `use_cudnn_prefill` \| `False` \| \| Attention Config \| `use_trtllm_ragged_deepseek_prefill` \| `False` \| \| Attention Config \| `use_trtllm_attention` \| `False` \| \| Attention Config \| `disable_flashinfer_prefill` \| `False` \| \| Attention Config \| `disable_flashinfer_q_quantization` \| `False` \| \| Attention Config \| `flash_attn_version` \| `None` \| \| Attention Config \| `backend` \| `None` \| \| Attention Config \| `flash_attn_max_num_splits_for_cuda_graph` \| `32` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 11:02:38 +08:00
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
SILONG ZENG	b27774dbd6	[CI]fix for lint CI (#5982 ) ### What this PR does / why we need it? fix lint CI - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 09:49:28 +08:00
Icey	c929bd1e8d	[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034 ) This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph fusion pass for `matmul_allreduce_rmsnorm` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`. Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tongrunze <t00574058@china.huawei.com>	2026-01-19 09:28:07 +08:00
Shanshan Shen	ad3a1eaf70	[Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (#5855 ) ### What this PR does / why we need it? As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339, multi-modal inference on vllm-ascend may lead to OOM issues in some scenarios. After our analysis, this is due to the memory fragmentation caused by frequent dynamic memory size adjustments during runtime. During the inference, the figure for non-torch memory see a gradual increase from around 1G to over 5G until the OOM issue occurs. We find that this problem can be resolved by just directly setting `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue. Thus, we decide to set this value by default, except RL (sleep mode) scenarios. It's also worthy to note that this environment variable may have more than one key-value pairs. We should append `",expandable_segments:True"` to the current configs. For example: ```python PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True". ``` > [!NOTE] > `max_split_size_mb` or `garbage_collection_threshold` cannot be enabled together with `expandable_segments=True`. ### Does this PR introduce _any_ user-facing change? Users do not need to set `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more. ### How was this patch tested? I have build a dataset consisting of my own photographs, which can stably reproduce this OOM issue on Qwen3-VL serie models. After apply this PR, this problem has been resolved and the amount of non-torch memory will keep stable at around 1G throughout the whole inference. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-19 09:17:31 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00
SILONG ZENG	52086394ae	[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912 ) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-16 20:57:46 +08:00
Ronald	7078dff691	[Feature] implenment set_additional_forward_context for model runner v2 (#5720 ) ### What this PR does / why we need it? we implement set_additional_forward_context in platform, it's necessary to reuse code of gpu in model runner v2 by inheriting method in gpu model runer v2. please see model runner v2's plan #5208 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-15 09:18:28 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
Nengjun Ma	48811bc0b8	Optimize the print info format when deprecated code is used in vllm-ascend (#5696 ) ### What this PR does / why we need it? Optimize the warning print information format when detects depredated code is used in vllm-ascend. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-08 09:26:49 +08:00
wangxiyuan	1112208052	[Refactor] Cleanup platform (#5566 ) ### What this PR does / why we need it? 1. add `COMPILATION_PASS_KEY` constant 2. clean up useless platform interface `empty_cache`, `synchronize`, `mem_get_info`, `clear_npu_memory` 3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED` 4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC` NPUPlatform is the interface called by vLLM. Do not call it inner vllm-ascend. ### Does this PR introduce _any_ user-facing change? This PR is just a cleanup. All CI should pass. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 09:25:55 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
wangxiyuan	1d7539ab3f	Cleanup pass config override (#5283 ) since we support self-defined pass manager now, it's no need to override the pass config. Let's clean up it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-04 11:52:12 +08:00
weiguihua2	15d73f248e	[refactor] refactor model runner capture model (#5230 ) ### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-30 08:32:14 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
Li Wang	8d2998d0e4	[Misc] Upgrade vllm hash to 12_14 (#5000 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix https://github.com/vllm-project/vllm/pull/27938 2. fix https://github.com/vllm-project/vllm/pull/27145 pooling models now supports chunked prefill and prefix caching, 3. fix https://github.com/vllm-project/vllm/pull/30181 define the CPU fields in the field config where they really belong. 4. fix https://github.com/vllm-project/vllm/pull/28168 define the CPU fields in the field config where they really belong. 5. fix https://github.com/vllm-project/vllm/pull/30201 some moudle rename 6. fix https://github.com/vllm-project/vllm/pull/29067 fusedmoe moudle refactor 7. fix https://github.com/vllm-project/vllm/pull/29066 fusedmoe moudle refactor 8. fix https://github.com/vllm-project/vllm/pull/29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 19:54:23 +08:00
linfeng-yuan	0fbe0831ec	[bugfix][refactor] fix recompute_scheduler break with vllm 0.12.0 & support async scheduling & refactor recompute_scheduler.py (#4895 ) ### What this PR does / why we need it? Currently, the initialization and fundamental functions of RecomputeScheduler are broken with `vLLM v0.12.0`. This PR fixes the conflicts of `RecomputeScheduler` and refactor its implementations by inheriting original `Scheduler` of vLLM. Meanwhile, this PR also supports async cheduling with recompute scheduler by implementing `AsyncRecomputeScheduler` which is simply inherited `AsncyScheduler` of vLLM and `RecomputeScheduler` of vLLM-Ascend with python MRO. ### Does this PR introduce _any_ user-facing change? No. The switch naming is the same as v0.11.0 : `recompute_scheduler_enable` ### How was this patch tested? E2E serving with 2P1D dsv3.1 passed. The performance was the same as original vllm scheduler with `async_scheduling` and preempted requests in D Nodes are successfully transfered to Proxy and further to P Node. This significantly improves the performance and robustness of PD disaggregation deployments. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-11 22:24:49 +08:00
Icey	18221c0e1d	[Fusion] normalize fusion naming and enable e2e test (#4693 ) ### What this PR does / why we need it? This PR standardizes the fusion naming, changing `enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e testing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-11 17:53:43 +08:00

1 2 3 4

185 Commits