xc-llm-ascend

Author	SHA1	Message	Date
Ronald	f1ffb5fb19	[Feature] adapt to uva buffer and main2main (#6657 ) ### What this PR does / why we need it? vllm model runner v2 use uva buffer to prepare input data, but npu doesn't support uva yet, this pr implement a uvawrapper class to mimic gpu's uva backend. what's more, this pr make some modifications to adapt to the newer main branch. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM main: `13397841ab` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-02-12 10:36:31 +08:00
iiiklw	a0315f6697	[npugraph_ex]enable npugraph_ex by default (#6664 ) ### What this PR does / why we need it? This pull request enables the `npugraph_ex` backend by default to improve performance on Ascend NPUs, as proposed in the [RFC](https://github.com/vllm-project/vllm-ascend/issues/6214). ### Does this PR introduce _any_ user-facing change? Yes. `npugraph_ex` is now enabled by default. Users can disable it by setting `enable: false` in the `npugraph_ex_config` section of the `additional_config`. ### How was this patch tested? CI passed. The changes are covered by existing and new E2E tests (`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`) that have been updated to reflect the new default behavior. The tests verify correctness and consistency with `npugraph_ex` enabled and disabled, as well as with the new static kernel option. Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com> Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>	2026-02-12 08:44:06 +08:00
Qiu	cb7c419bc0	[Feat](sfa,dcp) support dcp for sfa (#6563 ) ### What this PR does / why we need it? This PR adds DCP support to the SFA backend. Please note that due to operator constraints, the current implementation has to all-gather the entire KV cache and modify the block table to satisfy the operator input requirements. This results in significantly increased communication overhead and peak memory usage. Therefore, this is only a temporary workaround and will be refactored once the operator provides proper support. Additionally, because of the above limitations, `cp_kv_cache_interleave_size` is currently required to be equal to `block_size`. This restriction will also be removed after the refactor. #### Test accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8 \| dataset \| version \| metric \| mode \| vllm-api-general-stream \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8kdataset \| - \| accuracy \| gen \| 96.35 \| - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-02-09 18:52:25 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
zhangxinyuehfad	05cc03d785	[Bugfix] fix hash conflict due to reset incompatible configuations (#6368 ) ### What this PR does / why we need it? [Bugfix] fix hash conflict due to reset incompatible configuations ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-02-03 10:32:02 +08:00
Angazenn	5e34c70ffc	[Misc] Removes unnecessary graph size re-initialization (#6280 ) ### What this PR does / why we need it? This PR removes `update_default_aclgraph_sizes`. In earlier versions, we add this function to change default `cudagraph_capture_sizes` because `_npu_paged_attention` degrades significantly on certain shapes (which is included in default `cudagraph_capture_sizes` of VLLM). Now since we use FIA as default attention op (which does not contain such performance degradation), there is no need to add this default change. Otherwise, it could cause some conflicts if we set a small `cudagraph_capture_sizes` that < 20 now. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: Angazenn <supperccell@163.com>	2026-01-27 14:38:07 +08:00
Canlin Guo	b45bd92c2b	[Bugfix] Add defensive check for multimodal_config (#6230 ) ### What this PR does / why we need it? In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a check before accessing the sub-field of model_config. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will checked by CI. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: gcanlin <canlinguosdu@gmail.com>	2026-01-25 17:39:19 +08:00
Cao Yi	a69ef10c3a	[Refactor] Quantization Module Refactor (#5738 ) ### Summary This PR refactors the `vllm_ascend/quantization` module to improve code organization, maintainability, and extensibility. The refactoring introduces a clear separation of concerns with a registry-based scheme discovery pattern, abstract base classes for quantization schemes, and dedicated wrapper classes. ### Key Changes #### 1. Modular Directory Structure \| Before \| After \| \|--------\|-------\| \| Flat file structure with mixed responsibilities \| Organized into `methods/` subpackage for schemes \| \| Single `quant_config.py` (600+ lines) \| Separate config files: `modelslim_config.py`, `compressed_tensors_config.py` \| \| `utils.py` with scheme lookup logic \| `methods/registry.py` with decorator-based registration \| #### 2. Registry-Based Scheme Discovery Replaced hardcoded `ASCEND_QUANTIZATION_METHOD_MAP` dictionary with a decorator-based registry pattern: ```python # Before: Manual dictionary mapping ASCEND_QUANTIZATION_METHOD_MAP = { "W8A8_DYNAMIC": {"linear": AscendW8A8DynamicLinearMethod, ...}, ... } # After: Decorator-based registration @register_scheme("W8A8_DYNAMIC", "linear") class AscendW8A8DynamicLinearMethod(AscendLinearScheme): ... ``` #### 3. Abstract Base Classes Introduced three abstract base classes in `methods/base.py`: - `AscendLinearScheme` - Base for linear layer quantization - `AscendMoEScheme` - Base for MoE layer quantization - `AscendAttentionScheme` - Base for attention layer quantization #### 4. Separated Config and Wrapper Classes - Config classes (`AscendModelSlimConfig`, `AscendCompressedTensorsConfig`): Handle config parsing and scheme selection - Wrapper classes (`AscendLinearMethod`, `AscendFusedMoEMethod`, etc.): Implement vLLM interfaces and delegate to schemes #### 5. Cleaner Public API ```python # New clean module interface from vllm_ascend.quantization import ( AscendModelSlimConfig, AscendCompressedTensorsConfig, ) from vllm_ascend.quantization.methods import get_scheme_class ``` ### Architecture Diagram ```mermaid classDiagram direction TB class QuantizationConfig { <<vLLM Interface>> +get_quant_method() } class AscendModelSlimConfig { +quant_description +get_quant_method() -create_scheme_for_layer() } class AscendCompressedTensorsConfig { +target_scheme_map +get_quant_method() -_get_scheme_from_parts() } class AscendLinearMethod { <<Wrapper>> +quant_method: AscendLinearScheme +create_weights() +apply() } class AscendFusedMoEMethod { <<Wrapper>> +quant_method: AscendMoEScheme +create_weights() +apply() } class AscendLinearScheme { <<Abstract>> +get_weight()* +apply()* +get_pertensor_param() +get_perchannel_param() } class AscendMoEScheme { <<Abstract>> +get_weight()* +get_dynamic_quant_param()* +apply()* } class W8A8DynamicLinear { +get_weight() +apply() } class W8A8DynamicMoE { +get_weight() +apply() } QuantizationConfig <\|-- AscendModelSlimConfig QuantizationConfig <\|-- AscendCompressedTensorsConfig AscendModelSlimConfig ..> AscendLinearMethod : creates AscendModelSlimConfig ..> AscendFusedMoEMethod : creates AscendCompressedTensorsConfig ..> AscendLinearMethod : creates AscendCompressedTensorsConfig ..> AscendFusedMoEMethod : creates AscendLinearMethod o-- AscendLinearScheme : delegates to AscendFusedMoEMethod o-- AscendMoEScheme : delegates to AscendLinearScheme <\|-- W8A8DynamicLinear AscendMoEScheme <\|-- W8A8DynamicMoE ``` ### Scheme Registration Flow ```mermaid sequenceDiagram participant Module as Scheme Module participant Registry as _SCHEME_REGISTRY participant Config as QuantConfig participant Wrapper as Wrapper Class Note over Module: At import time Module->>Registry: @register_scheme("W8A8_DYNAMIC", "linear") Registry->>Registry: Store (quant_type, layer_type) -> Class Note over Config: At runtime Config->>Config: Determine quant_type from description Config->>Registry: get_scheme_class(quant_type, layer_type) Registry-->>Config: Return scheme class Config->>Config: scheme = scheme_cls() Config->>Wrapper: Create wrapper with scheme Wrapper-->>Config: Return wrapper instance ``` ### File Changes Summary \| Original Files \| Refactored Files \| \|----------------\|------------------\| \| `__init__.py` (empty) \| `__init__.py` (exports public API) \| \| `quant_config.py` \| `modelslim_config.py` + `wrappers.py` \| \| `compressed_tensors/` \| `compressed_tensors_config.py` \| \| `utils.py` \| `methods/registry.py` \| \| `w8a8_dynamic.py` \| `methods/w8a8_dynamic.py` \| \| `w8a8.py` \| `methods/w8a8_static.py` \| \| `w4a4_flatquant_dynamic.py` \| `methods/w4a4_flatquant.py` \| \| ... \| `methods/base.py` (new) \| ### Benefits 1. Extensibility: Adding new quantization schemes only requires implementing the base class and adding `@register_scheme` decorator 2. Maintainability: Clear separation between config parsing, wrapper logic, and scheme implementation 3. Testability: Abstract base classes enable easier unit testing and mocking 4. Discoverability: Registry pattern makes it easy to list all supported schemes 5. Reduced Coupling: Config classes no longer need to know about all scheme implementations ___ - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-23 14:13:47 +08:00
CodeCat	1402cf6874	[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721 ) ### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model for QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:https://github.com/vllm-project/vllm-ascend/issues/4715 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-22 17:22:41 +08:00
Magnus	5b129cf0a1	[1/N][Feat] Xlite Qwen3 MoE Support (#5951 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`69b170b8b5`) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph \| maxconcurrency \| item \| TTFT(ms) \| \| TPOT(ms) \| \| QPS (req/s) \| OutputSpeed (token/s) \| \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| \| \| Avg \| P99 \| Avg \| P99 \| \| \| \| 1 \| baseline-aclgraph \| 205.07 \| 287.29 \| 12.34 \| 12.65 \| 0.14 \| 78.81 \| \| 1 \| xlite-full \| 66.40 \| 113.69 \| 11.71 \| 12.40 \| 0.15 \| 84.73 \| \| 1 \| xlite-decode-only \| 221.15 \| 316.40 \| 12.16 \| 12.91 \| 0.14 \| 79.70 \| \| 1 \| diff1 \| -67.62% \| -60.43% \| -5.11% \| -1.98% \| 7.14% \| 7.51% \| \| 1 \| diff2 \| 7.84% \| 10.13% \| -1.46% \| 2.06% \| 0.00% \| 1.13% \| \| \| \| \| \| \| \| \| \| \| 16 \| baseline-aclgraph \| 1892.16 \| 13916.86 \| 22.78 \| 39.28 \| 1.15 \| 589.89 \| \| 16 \| xlite-full \| 1355.40 \| 8907.45 \| 15.96 \| 25.15 \| 1.65 \| 850.21 \| \| 16 \| xlite-decode-only \| 1519.42 \| 8711.64 \| 19.23 \| 29.73 \| 1.38 \| 711.60 \| \| 16 \| diff1 \| -28.37% \| -36.00% \| -29.94% \| -35.97% \| 43.48% \| 44.13% \| \| 16 \| diff2 \| -19.70% \| -37.40% \| -15.58% \| -24.31% \| 20.00% \| 20.63% \| \| \| \| \| \| \| \| \| \| \| 32 \| baseline-aclgraph \| 673.80 \| 3914.90 \| 32.20 \| 37.95 \| 1.80 \| 928.54 \| \| 32 \| xlite-full \| 481.65 \| 2710.50 \| 19.95 \| 25.35 \| 2.91 \| 1506.67 \| \| 32 \| xlite-decode-only \| 372.22 \| 1095.25 \| 25.19 \| 28.47 \| 2.33 \| 1202.82 \| \| 32 \| diff1 \| -28.52% \| -30.76% \| -38.04% \| -33.20% \| 61.67% \| 62.26% \| \| 32 \| diff2 \| -44.76% \| -72.02% \| -21.77% \| -24.98% \| 29.44% \| 29.54% \| \| \| \| \| \| \| \| \| \| \| 48 \| baseline-aclgraph \| 583.18 \| 3277.65 \| 41.02 \| 46.05 \| 2.17 \| 1115.08 \| \| 48 \| xlite-full \| 973.42 \| 8237.33 \| 23.29 \| 30.50 \| 3.71 \| 1908.09 \| \| 48 \| xlite-decode-only \| 480.79 \| 2026.98 \| 31.48 \| 35.41 \| 2.83 \| 1453.75 \| \| 48 \| diff1 \| 66.92% \| 151.32% \| -43.22% \| -33.77% \| 70.97% \| 71.12% \| \| 48 \| diff2 \| -17.56% \| -38.16% \| -23.26% \| -23.11% \| 30.41% \| 30.37% \| \| \| \| \| \| \| \| \| \| \| 64 \| baseline-aclgraph \| 742.74 \| 5953.39 \| 47.79 \| 53.15 \| 2.48 \| 1272.37 \| \| 64 \| xlite-full \| 545.22 \| 3941.34 \| 25.09 \| 30.41 \| 4.64 \| 2376.44 \| \| 64 \| xlite-decode-only \| 752.40 \| 4534.29 \| 38.67 \| 43.28 \| 3.06 \| 1567.94 \| \| 64 \| diff1 \| -26.59% \| -33.80% \| -47.50% \| -42.78% \| 87.10% \| 86.77% \| \| 64 \| diff2 \| 1.30% \| -23.84% \| -19.08% \| -18.57% \| 23.39% \| 23.23% \| \| \| \| \| \| \| \| \| \| \| 100 \| baseline-aclgraph \| 565.52 \| 1716.81 \| 60.89 \| 68.69 \| 3.08 \| 1580.64 \| \| 100 \| xlite-full \| 398.14 \| 2328.88 \| 30.70 \| 32.45 \| 6.01 \| 3086.42 \| \| 100 \| xlite-decode-only \| 712.53 \| 4875.94 \| 52.71 \| 60.78 \| 3.53 \| 1813.58 \| \| 100 \| diff1 \| -29.60% \| 35.65% \| -49.58% \| -52.76% \| 95.13% \| 95.26% \| \| 100 \| diff2 \| 26.00% \| 184.01% \| -13.43% \| -11.52% \| 14.61% \| 14.74% \| \| \| \| \| \| \| \| \| \| \| 150 \| baseline-aclgraph \| 842.42 \| 5175.01 \| 73.60 \| 88.18 \| 3.80 \| 1952.26 \| \| 150 \| xlite-full \| 568.52 \| 4204.33 \| 37.90 \| 40.01 \| 7.27 \| 3734.72 \| \| 150 \| xlite-decode-only \| 654.43 \| 2504.06 \| 67.40 \| 77.00 \| 4.18 \| 2145.11 \| \| 150 \| diff1 \| -32.51% \| -18.76% \| -48.51% \| -54.63% \| 91.32% \| 91.30% \| \| 150 \| diff2 \| -22.32% \| -51.61% \| -8.42% \| -12.68% \| 10.00% \| 9.88% \| \| \| \| \| \| \| \| \| \| \| 200 \| baseline-aclgraph \| 750.63 \| 3049.91 \| 88.26 \| 101.95 \| 4.28 \| 2189.72 \| \| 200 \| xlite-full \| 558.48 \| 3791.98 \| 45.54 \| 49.04 \| 8.17 \| 4175.52 \| \| 200 \| xlite-decode-only \| 807.09 \| 4254.95 \| 85.18 \| 101.79 \| 4.44 \| 2271.52 \| \| 200 \| diff1 \| -25.60% \| 24.33% \| -48.40% \| -51.90% \| 90.89% \| 90.69% \| \| 200 \| diff2 \| 7.52% \| 39.51% \| -3.49% \| -0.16% \| 3.74% \| 3.74% \| \| \| \| \| \| \| \| \| \| ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>	2026-01-21 09:26:03 +08:00
Zetong Li	1ab6cd4935	[Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (#5945 ) ### What this PR does / why we need it? This PR aims to fix setting of `speculative_config.enforce_eager` in deepseek v3.2 mtp. The point is that, vllm sets `speculative_config.enforce_eager` as True if using deepseek_v32 with mtp. Since we support graph mode, we simply ignore it here. However, this fix will also implicitly ignore user setting of `speculative_config.enforce_eager`, we need to take care and remove it once vllm supports this feature. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Zetong Li <slippersss@126.com>	2026-01-21 09:24:33 +08:00
ChenCangtao	6c30f8bf87	[Feature]refactor the npugraph_ex config, support online-infer with static kernel (#5775 ) ### What this PR does / why we need it? This is a part of https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762 1. refactor the npugraph_ex config，modified the default configuration of the static kernel, new default value of static kernel is false 2. support online-infer with static kernel 3. fixed the issue where manually modifying FX graphs caused an abnormal model return type, and removed the related redundant code. ### Does this PR introduce _any_ user-facing change? yes，the new config of npugraph_ex is as follow: ``` additional_config={ "npugraph_ex_config": { "enable": True, "enable_static_kernel": False } } ``` ### How was this patch tested? ``` vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 4 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 48 \ --max-model-len 40000 \ --async-scheduling \ --max-num-batched-tokens 9000 \ --trust-remote-code \ --no-enable-prefix-caching \ --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \ --gpu-memory-utilization 0.9 \ --compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \ --additional-config \ '{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}' ``` - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com>	2026-01-20 21:31:38 +08:00
zhangxinyuehfad	a5b099c73d	[Bugfix] Reset incompatible config (#6005 ) ### What this PR does / why we need it? This PR introduces compatibility fixes for running vLLM on Ascend NPU hardware. The changes ensure that GPU-specific parameters are automatically detected and reset to Ascend-compatible values with appropriate warnings logged. \| Module \| Parameter \| Default Value \| \|--------\|-----------\|---------------\| \| Model Config \| `disable_cascade_attn` \| `False` \| \| Parallel Config \| `all2all_backend` \| `"allgather_reducescatter"` \| \| Cache Config \| `cpu_kvcache_space_bytes` \| `None` \| \| MultiModal Config \| `mm_encoder_attn_backend` \| `None` \| \| Observability Config \| `enable_layerwise_nvtx_tracing` \| `False` \| \| Scheduler Config \| `max_num_partial_prefills` \| `1` \| \| Speculative Config \| `quantization` \| `None` \| \| KV Transfer Config \| `kv_buffer_size` \| `1e9` \| \| KV Transfer Config \| `enable_permute_local_kv` \| `False` \| \| Attention Config \| `use_prefill_decode_attention` \| `False` \| \| Attention Config \| `use_cudnn_prefill` \| `False` \| \| Attention Config \| `use_trtllm_ragged_deepseek_prefill` \| `False` \| \| Attention Config \| `use_trtllm_attention` \| `False` \| \| Attention Config \| `disable_flashinfer_prefill` \| `False` \| \| Attention Config \| `disable_flashinfer_q_quantization` \| `False` \| \| Attention Config \| `flash_attn_version` \| `None` \| \| Attention Config \| `backend` \| `None` \| \| Attention Config \| `flash_attn_max_num_splits_for_cuda_graph` \| `32` \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 11:02:38 +08:00
aipaes	f58e110afe	【feat】switch for fusion ops gmmswigluquant (#5992 ) ### What this PR does / why we need it? Set a additional config parameter to control whether the gmmswigluequant fuseion operator is enabled; it is enabled by True. / When enabled with a small number of GPUs, the gmmswigluquant fused operator can cause some performance degradation. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` #### Perf test model: GLM 4.6(w8a8) - single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY - bs=1, input_lens=32000, ouput_lens=1024 Without this PR: TPOT 32.22.ms With this PR: TPOT 30.23ms --------- Signed-off-by: zjks98 <zhangjiakang4@huawei.com> Co-authored-by: zjks98 <zhangjiakang4@huawei.com>	2026-01-19 13:19:25 +00:00
SILONG ZENG	b27774dbd6	[CI]fix for lint CI (#5982 ) ### What this PR does / why we need it? fix lint CI - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-19 09:49:28 +08:00
Icey	c929bd1e8d	[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034 ) This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph fusion pass for `matmul_allreduce_rmsnorm` operations. The implementation includes a new configuration flag, a pattern matching pass using `torch._inductor.pattern_matcher`. Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: tongrunze <t00574058@china.huawei.com>	2026-01-19 09:28:07 +08:00
Shanshan Shen	ad3a1eaf70	[Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (#5855 ) ### What this PR does / why we need it? As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339, multi-modal inference on vllm-ascend may lead to OOM issues in some scenarios. After our analysis, this is due to the memory fragmentation caused by frequent dynamic memory size adjustments during runtime. During the inference, the figure for non-torch memory see a gradual increase from around 1G to over 5G until the OOM issue occurs. We find that this problem can be resolved by just directly setting `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue. Thus, we decide to set this value by default, except RL (sleep mode) scenarios. It's also worthy to note that this environment variable may have more than one key-value pairs. We should append `",expandable_segments:True"` to the current configs. For example: ```python PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True". ``` > [!NOTE] > `max_split_size_mb` or `garbage_collection_threshold` cannot be enabled together with `expandable_segments=True`. ### Does this PR introduce _any_ user-facing change? Users do not need to set `PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more. ### How was this patch tested? I have build a dataset consisting of my own photographs, which can stably reproduce this OOM issue on Qwen3-VL serie models. After apply this PR, this problem has been resolved and the amount of non-torch memory will keep stable at around 1G throughout the whole inference. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-19 09:17:31 +08:00
Shaoxu Cheng	1ffca8673f	[Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (#5776 ) ### What this PR does / why we need it? Add basic 310p support. Only dense models work with eager mode now. - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-01-17 11:49:18 +08:00
SILONG ZENG	52086394ae	[Lint]Style: Convert `vllm-ascend/compilation` to ruff format (#5912 ) ### What this PR does / why we need it? Convert `vllm-ascend/compilation` to ruff format. ### Does this PR introduce _any_ user-facing change? During this migration, we encountered some errors in our CI and testing environments, such as: ``` vllm_ascend/utils.py:653: in <module> def register_ascend_customop(vllm_config: VllmConfig \| None = None): ^^^^^^^^^^^^^^^^^ E TypeError: unsupported operand type(s) for \|: 'NoneType' and 'NoneType' ``` 1. Root Cause Analysis: The project uses a common pattern to break circular dependencies: ```python if TYPE_CHECKING: from vllm.config import VllmConfig else: VllmConfig = None # Placeholder assigned at runtime ``` When Python parses the function definition `def register_ascend_customop(vllm_config: VllmConfig \| None)`, it attempts to evaluate the expression `VllmConfig \| None`. Since `VllmConfig` is assigned `None` at runtime, the expression effectively becomes `None \| None`. In Python, `None` is an instance of `NoneType`. While the `\|` operator is implemented for Type objects (classes), it is not supported for `NoneType` instances, leading to the `TypeError` shown above. 2. Solution: To maintain the modern `\|` syntax required by our new linting standards while preserving our dependency management strategy, I have introduced: ```python from __future__ import annotations ``` at the top of the affected files. This enables Postponed Evaluation of Annotations (PEP 563). 3. Impact and Benefits: - By enabling `annotations`, Python no longer executes the `VllmConfig \| None` operation during module load. Instead, it stores the annotation as a string literal, completely avoiding the `None \| None` calculation. - We can keep the `VllmConfig = None` placeholders. This ensures that other modules can still import these symbols without triggering an `ImportError`, maintaining a stable dependency graph. - IDEs and static type checkers (MyPy/Pyright) continue to resolve the types correctly. This allows us to use modern syntax without sacrificing type safety or runtime stability. - The only side effect is that `__annotations__` will now return strings instead of type objects. Since this module does not use runtime type enforcement or reflection, this change has zero negative impact on existing functionality. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-16 20:57:46 +08:00
Ronald	7078dff691	[Feature] implenment set_additional_forward_context for model runner v2 (#5720 ) ### What this PR does / why we need it? we implement set_additional_forward_context in platform, it's necessary to reuse code of gpu in model runner v2 by inheriting method in gpu model runer v2. please see model runner v2's plan #5208 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2026-01-15 09:18:28 +08:00
lty	295018ec0f	[Refactor]Refactor of vllm_ascend/distributed module (#5719 ) ### What this PR does / why we need it? Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604 This PR is a refactoring of vllm_ascend/distributed, moving all kv_transfer realtaed codes into a dedicated folder, which has already been done in vLLM ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: lty <linhebiwen@gmail.com>	2026-01-15 08:57:40 +08:00
gh924	6880c1b383	[Feature] Support for cross-attention and whisper model (#5592 ) ### What this PR does / why we need it? To solve the problem of the issue：https://github.com/vllm-project/vllm-ascend/issues/2262 - support for cross-attention when the model is encoder-decoder - support for whisper model - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: gh924 <guihao2@huawei.com> Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>	2026-01-11 11:38:45 +08:00
Nengjun Ma	48811bc0b8	Optimize the print info format when deprecated code is used in vllm-ascend (#5696 ) ### What this PR does / why we need it? Optimize the warning print information format when detects depredated code is used in vllm-ascend. ### Does this PR introduce _any_ user-facing change? NA - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2026-01-08 09:26:49 +08:00
wangxiyuan	1112208052	[Refactor] Cleanup platform (#5566 ) ### What this PR does / why we need it? 1. add `COMPILATION_PASS_KEY` constant 2. clean up useless platform interface `empty_cache`, `synchronize`, `mem_get_info`, `clear_npu_memory` 3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED` 4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC` NPUPlatform is the interface called by vLLM. Do not call it inner vllm-ascend. ### Does this PR introduce _any_ user-facing change? This PR is just a cleanup. All CI should pass. ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-07 09:25:55 +08:00
Shanshan Shen	b94d589769	[MM][Bugfix] Update `hf_config` to `hf_text_config` (#5319 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm-ascend/pull/5205, update `hf_config` to `hf_text_config`. Find more details at https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534 and https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-06 16:41:39 +08:00
weiguihua2	549be94397	[Bugfix] fix pcp + eplb error (#5561 ) ### What this PR does / why we need it? Fix the bug in the PCP overlay feature 1、Fix the bug related to PCP and EPLB overlap by including PCP size in the word_size calculation. 2、In the PCP pooling scenario, a prompt has been added for setting the cp_kv_cache_interleave_size. - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2026-01-05 14:08:11 +08:00
wangxiyuan	1d7539ab3f	Cleanup pass config override (#5283 ) since we support self-defined pass manager now, it's no need to override the pass config. Let's clean up it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-04 11:52:12 +08:00
weiguihua2	15d73f248e	[refactor] refactor model runner capture model (#5230 ) ### What this PR does / why we need it? Refactor the `capture_model` method in model_runner to directly reuse the method from vLLM. Currently, most of the logic in the capture_model method is similar to that in the vllm code. Directly using the vllm method can reduce the maintenance cost of the vllm-ascend code. Modify as follows: 1、refactor capture_model function, directly inheriting community methods 2、refactor initialize_aclgraph_capture function, move to initialize_attn_backend ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-30 08:32:14 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
lvjunqi	55beac9c91	[Feat]Xlite Qwen3-vl Support (#5228 ) ### What this PR does / why we need it? This patch adds support for the Qwen3-VL model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. The latest performance comparison data between xlite and the default aclgraph mode is as follows: ### Does this PR introduce _any_ user-facing change? XLite graph mode supports the Qwen3-VL model. ### How was this patch tested? vLLM version: v0.12.0 - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: lvjunqi <lvjunqi1@huawei.com> Co-authored-by: lvjunqi <lvjunqi1@huawei.com>	2025-12-22 16:30:52 +08:00
wangxiyuan	758d81dcb1	Drop 0.12.0 support (#5146 ) We decided to release v0.13.0 soon. So no need to support 0.12.0 now. Let's drop it. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-20 09:38:53 +08:00
weichen	ca6f631cba	[2/N][Pangu][MoE] Remove Pangu Related Code (#5130 ) ### What this PR does / why we need it? Remove Pangu Related Code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: weichen <calvin_zhu0210@outlook.com>	2025-12-19 09:00:07 +08:00
Ronald	b69b04d3a9	implement model runner v2 basic framework (#5051 ) ### What this PR does / why we need it? This PR aim to implement model runner v2 basic framework in vllm-ascend, the e2e function is not guaranteed by this pr. ### Does this PR introduce _any_ user-facing change? use envs.VLLM_USE_V2_MODEL_RUNNER to decide if choose model_runenr_v2. ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-12-18 15:51:54 +08:00
ZixuanWang	b1a853b0f6	Upgrade vllm commit hash to 1216 (#5053 ) ### What this PR does / why we need it? Upstream vLLM PR #30212 https://github.com/vllm-project/vllm/pull/30212 refactored the attention backend selection interface, This PR adapts vllm-ascend's get_attn_backend_cls to align with the new upstream standard, ensuring compatibility and reducing maintenance overhead. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? co-author:[leo-pony][nengjunma@outlook.com](mailto:nengjunma@outlook.com) - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: zxwang <1476209578@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-17 08:48:36 +08:00
Li Wang	8d2998d0e4	[Misc] Upgrade vllm hash to 12_14 (#5000 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? 1. fix https://github.com/vllm-project/vllm/pull/27938 2. fix https://github.com/vllm-project/vllm/pull/27145 pooling models now supports chunked prefill and prefix caching, 3. fix https://github.com/vllm-project/vllm/pull/30181 define the CPU fields in the field config where they really belong. 4. fix https://github.com/vllm-project/vllm/pull/28168 define the CPU fields in the field config where they really belong. 5. fix https://github.com/vllm-project/vllm/pull/30201 some moudle rename 6. fix https://github.com/vllm-project/vllm/pull/29067 fusedmoe moudle refactor 7. fix https://github.com/vllm-project/vllm/pull/29066 fusedmoe moudle refactor 8. fix https://github.com/vllm-project/vllm/pull/29624 ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-15 19:54:23 +08:00
linfeng-yuan	0fbe0831ec	[bugfix][refactor] fix recompute_scheduler break with vllm 0.12.0 & support async scheduling & refactor recompute_scheduler.py (#4895 ) ### What this PR does / why we need it? Currently, the initialization and fundamental functions of RecomputeScheduler are broken with `vLLM v0.12.0`. This PR fixes the conflicts of `RecomputeScheduler` and refactor its implementations by inheriting original `Scheduler` of vLLM. Meanwhile, this PR also supports async cheduling with recompute scheduler by implementing `AsyncRecomputeScheduler` which is simply inherited `AsncyScheduler` of vLLM and `RecomputeScheduler` of vLLM-Ascend with python MRO. ### Does this PR introduce _any_ user-facing change? No. The switch naming is the same as v0.11.0 : `recompute_scheduler_enable` ### How was this patch tested? E2E serving with 2P1D dsv3.1 passed. The performance was the same as original vllm scheduler with `async_scheduling` and preempted requests in D Nodes are successfully transfered to Proxy and further to P Node. This significantly improves the performance and robustness of PD disaggregation deployments. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-11 22:24:49 +08:00
Icey	18221c0e1d	[Fusion] normalize fusion naming and enable e2e test (#4693 ) ### What this PR does / why we need it? This PR standardizes the fusion naming, changing `enable_quantization_fusion` to `fuse_norm_quant`, and enables e2e testing. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-11 17:53:43 +08:00
wangxiyuan	bb76f7962c	cleanup useless torchair logic (#4856 ) This PR clean up useless torchair logic in model runner. The moge doc is only for torchair, it can be removed as well. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-11 11:21:13 +08:00
ChenCangtao	dd622aa6a6	[Feature] Support npuhraph_ex backend (#4700 ) ### What this PR does / why we need it? We introduced the npugraph_ex backend through the vllm's adaptor dispatch mechanism to accelerate aclgraph. This solution is based on torch.compile and uses torchair to optimize the fx.graph. The performance gains are mainly obtained from the static kernel. We conducted tests on Qwen3-30B and achieved over 5% performance optimization. ### Does this PR introduce _any_ user-facing change? Yes, we add a new switch named"enable_npugraph_ex" in additional_config, default is False. We also add an example to show how to register custom replacement pass ### More information about this PR This feature depends on the release of CANN and torch_npu in Q4. We tested it on a package that has not been publicly released yet and verified that the functionality works. This feature is still experimental at the moment; setting the config true will directly raise error. Merging into the main branch initially involves some preliminary commits to facilitate subsequent development and testing of the feature, as well as to avoid submitting an excessively large PR at once. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chencangtao <chencangtao@huawei.com> Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com> Co-authored-by: chencangtao <chencangtao@huawei.com> Co-authored-by: panchao-hub <315134829@qq.com> Co-authored-by: wbigat <wbigat@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 20:48:05 +08:00
wangxiyuan	835b4c8f1d	Drop torchair (#4814 ) aclgraph is stable and fast now. Let's drop torchair graph mode now. TODO: some logic to adapt torchair should be cleaned up as well. We'll do it in the following PR. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-10 09:20:40 +08:00
LuLina	2be0fe2691	[Feat] Add Euler xlite graph wrapper support (#4526 ) ### What this PR does / why we need it? This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md The latest performance comparison data between xlite and the default aclgraph mode is as follows: ## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison - aclgraph: main(`c4a71fc6`) - xlite-full: main(`c4a71fc6`) + xlite-full - xlite-decode-only: main(`c4a71fc6`) + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph ### Does this PR introduce _any_ user-facing change? Enable the xlite graph mode by setting xlite_graph_config: --additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only --additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: lulina <lina.lulina@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-08 08:27:46 +08:00
LookAround0301	b32ef53b3b	[long_seq] remove long_seq env (#4660 ) ### What this PR does / why we need it? remove env VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL - vLLM version: v0.12.0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: ZhangMingWei716 <2894054457@qq.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 10:31:49 +08:00
wangxiyuan	ea54388e19	Drop ascend scheduler (#4623 ) It's safe to drop ascend scheduler now. The related test and doc has been removed already - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-05 09:03:45 +08:00
Icey	178ca1607e	Adopt inductor fusion and define quantization fusion pass (#4168 ) ### What this PR does / why we need it? The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.In order to solve that, we propose to leverage `torch.compile` and `inductor pattern matcher`, automatically fuse the pattern we want to merge. For more details can refer to the RFC https://github.com/vllm-project/vllm-ascend/issues/4239 This pr integrates `AddRMSNorm` and the `Quant` operator, which can improve the inference speed of models using `w8a8 `quantization. ### Does this PR introduce _any_ user-facing change? Yes, add new additional_config ### How was this patch tested? ```python def main(): prompts = [ "The president of the United States is Mr.", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="/root/.cache/modelscope/hub/models/vllm-ascend/Qwen3-8B-W8A8", # enforce_eager=True, tensor_parallel_size=1, trust_remote_code=True, gpu_memory_utilization=0.7, quantization="ascend", ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` ```text Prompt: 'The president of the United States is Mr.', Generated text: ' Trump. The president of the United States is Mr. Biden. Which of the following statements is correct? \n\nA. Mr. Trump is Mr. Biden. \nB. Mr. Trump is not Mr. Biden. \nC. The president of the United States is not Mr. Trump. \nD. The president of the United States is not Mr. Biden.\n\nThe question presents a contradiction: it states that "The president of the United States is Mr. Trump" and "The president of' ``` - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: wxsIcey <1790571317@qq.com>	2025-12-04 10:29:48 +08:00
wangxiyuan	3f4c0ea0a0	upgrade vLLM to 0.12.0 tag (#4647 ) Upgrade vLLM to v0.12.0 tag - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 23:43:05 +08:00
wangxiyuan	6ece6660ec	fix custom ops env set error (#4675 ) Move Custom ops register to correct place to make CI happy - vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24 - vLLM main: `86e178f7c4` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-03 19:27:38 +08:00
wangxiyuan	7f2673ea2d	upgrade vLLM to main (#4608 ) 1. fix https://github.com/vllm-project/vllm/pull/28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix https://github.com/vllm-project/vllm/pull/29121 the output token now type changed from np to `list[list[int]]` 3. fix https://github.com/vllm-project/vllm/pull/29262 `xformers` backend for multimodal now has been deprecated 4. fix https://github.com/vllm-project/vllm/pull/29342 5. fix https://github.com/vllm-project/vllm/pull/28579 6. fix https://github.com/vllm-project/vllm/pull/28718 7. fix https://github.com/vllm-project/vllm/issues/28665 8. fix https://github.com/vllm-project/vllm/pull/26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix https://github.com/vllm-project/vllm/pull/29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-12-02 22:10:52 +08:00
Chenxi Qian	4588cdac02	[Bugfix] fix custom op GmmSwigluQuantWeightNzTensorList (#4593 ) ### What this PR does / why we need it? 1. Fixes the environment path used to locate custom op shared libraries. 2. Uses empty tensor initialization for op outputs instead of zero-initialization for better efficiency. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>	2025-12-02 22:02:04 +08:00
MengLong Chen	143e1f46d0	[Feat] shared expert dp for deepseek_mtp (#3811 ) ### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com> Signed-off-by: zengran <zengran2@huawei.com> Co-authored-by: zengran <zengran2@huawei.com>	2025-12-01 20:44:11 +08:00

1 2 3 4

173 Commits