### What this PR does / why we need it?
Current version will run into init error when user set max_num_seqs to
number not a multiple of tp size. The reason is that we will first find
out the valid size of sequence parallelism, and then remove numbers that
are not the multiple of tp size. This may cause an error when we set a
max_num_seqs above a multiple of 8 before a multiple of tp size, say
when the tp size is 16 and the max_num_seqs is 90. The system will just
drop the calculated max graph capture size 88 from the valid size list
but not reset the max_cudagraph_capture_size to the next valid number.
Thus, we will need to add the line to match them up.
Cherry-pick from main PR #7801
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Full CI passed with this PR.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
This PR adds support for hierarchical communication for `dispatch_v2`
and `combine_v2` MoE operations. This is achieved by introducing a new
configuration `enable_mc2_hierarchy_comm`. When enabled, the
communication algorithm is set to "hierarchy", which support mc2 op comm
between two super pod.
The changes include:
- Adding `enable_mc2_hierarchy_comm` to `AscendConfig`.
- Modifying `TokenDispatcherWithMC2` to pass `comm_alg: "hierarchy"` to
the underlying `torch_npu` ops when the new config is enabled.
- Adding validation to ensure that this feature is only used with
compatible PTA/CANN versions and is not used with the conflicting
`fused_mc2` op.
- Updating `is_hierarchical_communication_enabled` to respect the new
configuration flag.
### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces a new user-facing configuration option
`enable_mc2_hierarchy_comm` in `additional_config` to enable
hierarchical communication for MoE.
### How was this patch tested?
- vLLM version: v0.18.0
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
https://github.com/vllm-project/vllm/pull/35122 This PR in the vllm
community refactors the update mode of block_size. As a result, when the
user does not specify `--block-size`, dsv3.2 obtains an incorrect
block_size.
**The root cause of the problem is analyzed from the block_size update
process as follows:**
1. In NPUPlatform, `check_and_update_config` calls `refresh_block_size`
to set block_size to 128.
2. During Modelrunner initialization, the `self.block_size` parameter is
generated. At this time, block_size is still 128. This parameter will be
used for operations such as kvcache initialization.
3. `update_block_size_for_backend` updates block_size to the size set in
attn_backend. The reason why the DSV3.2 is faulty is that it has an
additional attn_backend `DeepseekV32IndexerBackend`, and this backend is
not rewritten. The block_size obtained from attn_backend is 64. In this
case, only `vllm_config.cache_config.block_size` is updated, and other
parts are not modified. As a result, the block_size on the entire
network is inconsistent.
**Modification solution:**
Skip `update_block_size_for_backend` and modify block_size only in the
`check_and_update_config` method.
In the future, the block_size update logic can be migrated to the
`update_block_size_for_backend` method. Ensure that all block_size
values on the entire network are updated.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This pull request addresses a linting issue by reordering a specific
configuration assignment within the `apply_config_platform_defaults`
method in `vllm_ascend/platform.py`. This change ensures compliance with
code style guidelines without altering the functional behavior of the
system.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.
Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.
### Does this PR introduce _any_ user-facing change?
Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.
If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.
### How was this patch tested?
Add unit tests to cover:
- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.
### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.
### How was this patch tested?
- Model: Qwen3-VL-30B-A3B,
- TP4 DP2
- 100 reqs
- max concurrency 1
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k | 429.40 | 323.3 |
| 16k | 1297.01 | 911.74 |
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Upgrade vllm commit to 2026.03.19.
1.Fix socket removed from StatelessProcessGroup. Upstream vLLM PR
[#36330](https://github.com/vllm-project/vllm/pull/36330) ("elastic_ep:
Fix stateless group port races") refactored StatelessProcessGroup and
removed the socket: socket.socket | None field. The socket ownership was
moved to a new create_tcp_store() helper instead of being stored as a
field on the dataclass.
2.fix `virtual_engine` parameter removed from `set_forward_context().
Upstream [V0 Deprecation] Deprecate virtual engine
[#37195](https://github.com/vllm-project/vllm/pull/37195)
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
## Summary
Flash Comm V1 (flashcomm1) was previously blocked for all VL models.
**Root cause:** For VL models, `inputs_embeds` at layer 0 originates
from the vision encoder as a full `[N, H]` tensor — it has **not** been
reduce-scattered across TP ranks. The original MLA forward path assumed
inputs were already scattered, producing wrong output shapes under TP >
1.
**Fix:**
- Detect at init time (statically, not via runtime shape checks) whether
a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo
treats the branch as a constant.
- In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1
+ is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate
output as `[N//tp_size, H]`.
- Remove the platform-level assertion that prevented VL models from
enabling Flash Comm V1.
**Other improvements:**
- `is_vl_model()` now uses vllm's canonical detection (`hf_config is not
hf_text_config`) instead of fragile key-name checks, with the old checks
kept as fallback.
- Added `parse_layer_idx(prefix)` utility.
- Added `maybe_chunk_residual` call in `AscendRMSNorm` before the
add-rms-norm op.
- Removed unnecessary CPU/fp32 round-trip in
`AscendLearnable2DInterpPosEmbDivided_fixed.forward()`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: LoganJane <loganJane73@hotmail.com>
### What this PR does / why we need it
This PR fixes a startup regression for Ascend hybrid attention + mamba
models after upgrading to vLLM `0.18.0`.
However, after the vLLM `0.18.0` upgrade, worker initialization still
calls the generic platform hook:
- `current_platform.update_block_size_for_backend(vllm_config)`
### How this PR fixes it
This PR keeps the fix strictly inside `vllm-ascend`.
It adds an Ascend override for
`NPUPlatform.update_block_size_for_backend()`:
- for hybrid models, do not run the generic upstream block-size fallback
- preserve the block size that was already computed by the hybrid
model-specific config logic
- for non-hybrid models, keep the original upstream behavior unchanged
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
Reapply the auto-detect quantization format feature (originally in
#6645, reverted in #6873) and extend it to support remote model
identifiers (e.g., `org/model-name`).
Changes:
- Reapply auto-detection of quantization method from model files
(`quant_model_description.json` for ModelSlim, `config.json` for
compressed-tensors)
- Add `get_model_file()` utility to handle file retrieval from both
local paths and remote repos (HuggingFace Hub / ModelScope)
- Update `detect_quantization_method()` to accept remote repo IDs with
optional `revision` parameter
- Update `maybe_update_config()` to work with remote model identifiers
- Add platform-level `auto_detect_quantization` support
- Add unit tests and e2e tests for both local and remote model ID
scenarios
Closes#6836
### Does this PR introduce _any_ user-facing change?
Yes. When `--quantization` is not explicitly specified, vllm-ascend will
now automatically detect the quantization format from the model files
for both local directories and remote model IDs.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
CI passed with new added test.
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
PR #5632 introduced a bug by replacing some branches gated by enable_sp
with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is
enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and
VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with
the previous logic and leads to accuracy issues. This PR restores the
original enable_sp-based branching to recover expected behavior and
accuracy.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
#### 1. start server
``` bash
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \
--port 8001 \
--served-model-name auto \
--max-model-len 1024 \
--enforce-eager \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--additional-config '{"enable_shared_expert_dp": true}'
```
#### 2. curl
```bash
curl -s http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Hello. I have a question. Who are you?"}
],
"max_tokens": 10,
"temperature": 0.0,
"ignore_eos_token": true
}'
```
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Support platform.get_device_uuid function.
currently, the pytorch.npu.get_device_properties return uuid as full
zero, vllm-ascend implement the interface at first, once the
pytorch.npu.get_device_properties return the real uuid, vllm-ascend will
support without modification.
more details see
https://github.com/vllm-project/vllm-ascend/issues/6669
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
root@localhost:/workspace/l00614971/vllm_test# python vllm_test.py
INFO 02-24 09:43:48 [__init__.py:43] Available plugins for group
vllm.platform_plugins:
INFO 02-24 09:43:48 [__init__.py:45] - ascend -> vllm_ascend:register
INFO 02-24 09:43:48 [__init__.py:48] All plugins in this group will be
loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 02-24 09:43:48 [__init__.py:217] Platform plugin ascend is
activated
device_uuid = 00000000-0000-0000-0000-000000000000
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
## Summary
- Add automatic quantization format detection, eliminating the need to
manually specify `--quantization` when serving quantized models.
- The detection inspects only lightweight JSON files
(`quant_model_description.json` and `config.json`) at engine
initialization time, with no `.safetensors` reads.
- User-explicit `--quantization` flags are always respected;
auto-detection only applies when the flag is omitted.
## Details
**Detection priority:**
1. `quant_model_description.json` exists → `quantization="ascend"`
(ModelSlim)
2. `config.json` contains `"quant_method": "compressed-tensors"` →
`quantization="compressed-tensors"` (LLM-Compressor)
3. Neither → default float behavior
**Technical approach:**
Hooked into `NPUPlatform.check_and_update_config()` to run detection
after `VllmConfig.__post_init__`. Since `quant_config` is already `None`
at that point, we explicitly recreate it via
`VllmConfig._get_quantization_config()` to trigger the full quantization
initialization pipeline.
## Files Changed
| File | Description |
|------|-------------|
| `vllm_ascend/quantization/utils.py` | Added
`detect_quantization_method()` and `maybe_auto_detect_quantization()` |
| `vllm_ascend/platform.py` | Integrated auto-detection in
`check_and_update_config()` |
| `vllm_ascend/quantization/modelslim_config.py` | Improved error
handling for weight loading |
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
".*ot.*", ".*[Tt]h[rR].*"]
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
9562912cea
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
Integrating inductor pass and npugraph ex pass, see RFC:
https://github.com/vllm-project/vllm-ascend/issues/6347
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
all tests passed.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
### What this PR does / why we need it?
vllm model runner v2 use uva buffer to prepare input data, but npu
doesn't support uva yet, this pr implement a uvawrapper class to mimic
gpu's uva backend. what's more, this pr make some modifications to adapt
to the newer main branch.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM main:
13397841ab
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).
### Does this PR introduce _any_ user-facing change?
Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.
### How was this patch tested?
CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.
Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
### What this PR does / why we need it?
This PR adds DCP support to the SFA backend.
Please note that due to operator constraints, the current implementation
has to all-gather the entire KV cache and modify the block table to
satisfy the operator input requirements. This results in significantly
increased communication overhead and peak memory usage. Therefore, this
is only a temporary workaround and will be refactored once the operator
provides proper support.
Additionally, because of the above limitations,
`cp_kv_cache_interleave_size` is currently required to be equal to
`block_size`. This restriction will also be removed after the refactor.
#### Test
accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8
| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.35 |
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
[Bugfix] fix hash conflict due to reset incompatible configuations
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR removes `update_default_aclgraph_sizes`. In earlier versions, we
add this function to change default `cudagraph_capture_sizes` because
`_npu_paged_attention` degrades significantly on certain shapes (which
is included in default `cudagraph_capture_sizes` of VLLM). Now since we
use FIA as default attention op (which does not contain such performance
degradation), there is no need to add this default change. Otherwise, it
could cause some conflicts if we set a small `cudagraph_capture_sizes`
that < 20 now.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
d68209402d
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
In vLLM-Omni, there exists the empty `ModelConfig`. We need to add a
check before accessing the sub-field of model_config.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Will checked by CI.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
This PR builds upon PR
https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to
further enhance the npu_graph_ex_passes module. Based on prior work, we
have added graph optimization support for the add_rms_quant fused
operator in scenarios where a bias term is present—ensuring the fusion
pattern is correctly registered and matched into the computation graph.
For validation, we switched to the Qwen3-235B-A22B-W8A8 model for
QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark
results show that, compared to the unfused baseline, enabling this
fusion pass significantly improves inference throughput for W8A8
quantized models.
For more details can refer to the
RFC:https://github.com/vllm-project/vllm-ascend/issues/4715
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
llm = LLM(
model=model,
tensor_parallel_size=GPUs_per_dp_rank,
enforce_eager=False,
enable_expert_parallel=enable_expert_parallel,
trust_remote_code=trust_remote_code,
gpu_memory_utilization=0.98,
max_num_batched_tokens=512,
# load_format="dummy",
max_model_len=2048,
max_num_seqs=16,
quantization="ascend",
additional_config={
"refresh": True,
"enable_npugraph_ex": True
},
compilation_config={
"cudagraph_capture_sizes": [8, 16],
"cudagraph_mode": "FULL_DECODE_ONLY",
},
)
if profile_dir:
llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
if profile_dir:
llm.stop_profile()
for i, output in enumerate(outputs):
if i >= 5:
break
prompt = output.prompt
generated_text = output.outputs[0].text
print(
f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
f"Generated text: {generated_text!r}"
)
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: cjian <2318164299@qq.com>
### What this PR does / why we need it?
This PR aims to fix setting of `speculative_config.enforce_eager` in
deepseek v3.2 mtp. The point is that, vllm sets
`speculative_config.enforce_eager` as True if using deepseek_v32 with
mtp. Since we support graph mode, we simply ignore it here. However,
this fix will also implicitly ignore user setting of
`speculative_config.enforce_eager`, we need to take care and remove it
once vllm supports this feature.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This is a part of
https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762
1. refactor the npugraph_ex config,modified the default configuration of
the static kernel, new default value of static kernel is false
2. support online-infer with static kernel
3. fixed the issue where manually modifying FX graphs caused an abnormal
model return type, and removed the related redundant code.
### Does this PR introduce _any_ user-facing change?
yes,the new config of npugraph_ex is as follow:
```
additional_config={
"npugraph_ex_config": {
"enable": True,
"enable_static_kernel": False
}
}
```
### How was this patch tested?
```
vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--tensor-parallel-size 4 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 48 \
--max-model-len 40000 \
--async-scheduling \
--max-num-batched-tokens 9000 \
--trust-remote-code \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config \
'{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}'
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: chencangtao <chencangtao@huawei.com>
Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
### What this PR does / why we need it?
Set a additional config parameter to control whether the gmmswigluequant
fuseion operator is enabled; it is enabled by True. / When enabled with
a small number of GPUs, the gmmswigluquant fused operator can cause some
performance degradation.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
#### Perf
test model: GLM 4.6(w8a8)
- single A3 node(ep16, tp16), async-scheduling, mtp, FULL_DECODE_ONLY
- bs=1, input_lens=32000, ouput_lens=1024
Without this PR: TPOT 32.22.ms
With this PR: TPOT 30.23ms
---------
Signed-off-by: zjks98 <zhangjiakang4@huawei.com>
Co-authored-by: zjks98 <zhangjiakang4@huawei.com>
This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph
fusion pass for `matmul_allreduce_rmsnorm` operations. The
implementation includes a new configuration flag, a pattern matching
pass using `torch._inductor.pattern_matcher`.
Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com)
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: tongrunze <t00574058@china.huawei.com>
### What this PR does / why we need it?
As mentioned in https://github.com/vllm-project/vllm-ascend/issues/5339,
multi-modal inference on vllm-ascend may lead to OOM issues in some
scenarios.
After our analysis, this is due to the memory fragmentation caused by
frequent dynamic memory size adjustments during runtime. During the
inference, the figure for non-torch memory see a gradual increase from
around 1G to over 5G until the OOM issue occurs.
We find that this problem can be resolved by just directly setting
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True`. Find more details at
https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-handle-the-out-of-memory-issue.
Thus, we decide to set this value by default, except RL (sleep mode)
scenarios.
It's also worthy to note that this environment variable may have more
than one key-value pairs. We should append `",expandable_segments:True"`
to the current configs.
For example:
```python
PYTORCH_NPU_ALLOC_CONF = "page_size:1g" + ",expandable_segments:True".
```
> [!NOTE]
> `max_split_size_mb` or `garbage_collection_threshold` cannot be
enabled together with `expandable_segments=True`.
### Does this PR introduce _any_ user-facing change?
Users do not need to set
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` manually any more.
### How was this patch tested?
I have build a dataset consisting of my own photographs, which can
stably reproduce this OOM issue on Qwen3-VL serie models.
After apply this PR, this problem has been resolved and the amount of
non-torch memory will keep stable at around 1G throughout the whole
inference.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Add basic 310p support. Only dense models work with eager mode now.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
Convert `vllm-ascend/compilation` to ruff format.
### Does this PR introduce _any_ user-facing change?
During this migration, we encountered some **errors** in our CI and
testing environments, such as:
```
vllm_ascend/utils.py:653: in <module>
def register_ascend_customop(vllm_config: VllmConfig | None = None):
^^^^^^^^^^^^^^^^^
E TypeError: unsupported operand type(s) for |: 'NoneType' and 'NoneType'
```
**1. Root Cause Analysis:**
The project uses a common pattern to break circular dependencies:
```python
if TYPE_CHECKING:
from vllm.config import VllmConfig
else:
VllmConfig = None # Placeholder assigned at runtime
```
When Python parses the function definition `def
register_ascend_customop(vllm_config: VllmConfig | None)`, it attempts
to evaluate the expression `VllmConfig | None`.
Since `VllmConfig` is assigned `None` at runtime, the expression
effectively becomes `None | None`. In Python, `None` is an instance of
`NoneType`. While the `|` operator is implemented for Type objects
(classes), it is not supported for `NoneType` instances, leading to the
`TypeError` shown above.
**2. Solution:**
To maintain the modern `|` syntax required by our new linting standards
while preserving our dependency management strategy, I have introduced:
```python
from __future__ import annotations
```
at the top of the affected files. This enables **Postponed Evaluation of
Annotations (PEP 563)**.
**3. Impact and Benefits:**
- By enabling `annotations`, Python no longer executes the `VllmConfig |
None` operation during module load. Instead, it stores the annotation as
a string literal, completely avoiding the `None | None` calculation.
- We can keep the `VllmConfig = None` placeholders. This ensures that
other modules can still import these symbols without triggering an
`ImportError`, maintaining a stable dependency graph.
- IDEs and static type checkers (MyPy/Pyright) continue to resolve the
types correctly. This allows us to use modern syntax without sacrificing
type safety or runtime stability.
- The only side effect is that `__annotations__` will now return strings
instead of type objects. Since this module does not use runtime type
enforcement or reflection, this change has zero negative impact on
existing functionality.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
we implement set_additional_forward_context in platform, it's necessary
to reuse code of gpu in model runner v2 by inheriting method in gpu
model runer v2. please see model runner v2's plan #5208
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
Optimize the warning print information format when detects depredated
code is used in vllm-ascend.
### Does this PR introduce _any_ user-facing change?
NA
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
1. add `COMPILATION_PASS_KEY` constant
2. clean up useless platform interface `empty_cache`, `synchronize`,
`mem_get_info`, `clear_npu_memory`
3. rename `CUSTOM_OP_REGISTERED` to `_CUSTOM_OP_REGISTERED`
4. remove uesless env `VLLM_ENABLE_CUDAGRAPH_GC`
NPUPlatform is the interface called by vLLM. Do not call it inner
vllm-ascend.
### Does this PR introduce _any_ user-facing change?
This PR is just a cleanup. All CI should pass.
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix the bug in the PCP overlay feature
1、Fix the bug related to PCP and EPLB overlap by including PCP size in
the word_size calculation.
2、In the PCP pooling scenario, a prompt has been added for setting the
cp_kv_cache_interleave_size.
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
since we support self-defined pass manager now, it's no need to override
the pass config. Let's clean up it.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.
Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>