### What this PR does / why we need it?
Register the connector in the plugin
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
- Refacotr and integrate a unified `WeightPrefetchMethod`
- Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention
modules
- Prefetching these weights ahead of matmul-like operators imporves
performance by reducing L2 cache transfer latency
### Does this PR introduce _any_ user-facing change?
Add a new config in `--additional-config` for configuration:
```json
{
"weight_prefetch_config": {
"enabled": false,
"prefetch_ratio": {
"attn": {
"qkv": 1.0,
"o": 1.0,
},
},
},
}
```
This feature is enabled by default, and can be disabled through this
configuration
### How was this patch tested?
- vLLM version: v0.11.0
---------
Signed-off-by: yuzhup <15705211260@163.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Co-authored-by: yuzhup <15705211260@163.com>
### What this PR does / why we need it?
1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.
### Does this PR introduce _any_ user-facing change?
please use a torch_npu version >= torch_npu-2.7.1.dev20250919
### How was this patch tested?
1. no special parameters to set, no new envs to set.
2. use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Signed-off-by: h30027576 <huangdong51@huawei.com>
### What this PR does / why we need it?
when mtp>1, we need refresh cos ans sin in each step.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.11.0
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
the multistream moe in tochari only validate in decode, but can't be
applied to chunked prefill, So add some judgments to isolate the
scenario
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
1. Move additional functionalities from fused_moe.py to
common_fused_moe.py and remove fused_moe.py
2. Remove unnecessary custom classes from qwen3_moe.py, and it will be
completely removed after we release vllm-ascend v0.11.0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing:
1. Enable/Disable EP
3. Aclgraph & eager
4. SP
- vLLM version: v0.11.0
---------
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
Since https://github.com/vllm-project/vllm-ascend/pull/3284 merged,
should discard some extra code that was previously done for version
compatibility
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
1. clean up v0.10.2 support in ut and e2e test
2. remove v0.11.0 period job, we're at v0.11.0 now.
3. remove uesless patch for deepseek v3.2. They have been done in vLLM
already.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
There are 3 step to upgrade vllm-ascend to newest vllm. We'll create 3
PR
- [x] Upgrade vllm to v0.11.0 to make CI happy first .
- [ ] Move deepseek v3.2 to vllm way
- [ ] Then we'll add a new PR to add vllm main support.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
When running DP in a non-equilibrium scenario, which means there is some
dp groups executing `dummy_run`, we need to make sure it running the
same mode as other dp, thus improving then performance in dp scenario
### How was this patch tested?
Tested by adding log in `_dummy_run`
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Currently we run an extra profile_run with `num_tokens ==
self.mc2_tokens_capacity`. However, when setting `max_num_batched_tokens
< self.mc2_tokens_capacity`, this will trigger an assertion error that
requires num_tokens in `_dummy_run` to be smaller than
`max_num_batched_tokens`. This PR skips this extra `profile_run` if
`self.max_num_tokens <= self.mc2_tokens_capacity` so as to avoid this
bug.
This PR fixes a bug that `kernel_block_sizes` never equals to
`[self.cache_config.block_size]`. `kernel_block_sizes` is type of
List[List[int]], so the condition should be `kernel_block_sizes !=
[[self.cache_config.block_size]]`. This also helps to resolve a issue
that cpu_offload_gb cannot be enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: Angazenn <supperccell@163.com>
- Fixes Qwen3-Next because of vllm #24982
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.7,
block_size=64
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Add restriction conditions to the ApplyTopPTopK operator : 1 <= K <=1024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
Fix the error "cur batch_size is invalid" during profile_run in the
torchair scenario.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
Fix dp+ep+tp inplace copy error when sp chunked the `hidden_states`.
### How was this patch tested?
test locally with the following scripts
```bash
python examples/offline_data_parallel.py \
--model="Qwen/Qwen3-30B-A3B" \
--dp-size=2 \
--tp-size=2 \
--enable-expert-parallel
```
Signed-off-by: MengqingCao <cmq0113@163.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR #2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Before optimizing,the rmsnorm time in one decoding is 531.5us. After
optimizing,the rmsnorm time in one decoding is 105us.
I closed the previous
PR(https://github.com/vllm-project/vllm-ascend/pull/2456) by mistake and
resubmitted it now
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: socrahow <suzihao4@h-partners.com>
### What this PR does / why we need it?
Running multimodal model with ascend scheduler may cause assert error
【assert (request.num_tokens - request.num_computed_tokens) == 1】
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: fan2956 <zhoufan53@huawei.com>
### What this PR does / why we need it?
- Fixes the bug that Multiple calls (maybe >100) to eagle3-qwen3-8b often incurs "attn_mask index out of range" error
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --served-model-name Eagle3 --port 8000 --model Qwen/Qwen3-8B --seed 42 -tp 1 --speculative_config '{"model": "Tengyunw/qwen3_8b_eagle3", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
```
Co-authored-by: liuruijin17
[ricklrj@outlook.com](mailto:ricklrj@outlook.com)
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
1. Solved the issue where sizes capture failed for the Qwen3-32b-int8
model when aclgraph, dp1, and tp4 were enabled.
2. Added the exception thrown when sizes capture fails and provided a
solution
3. Add this common problem to the FAQ doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Relying on #3044, this PR aims to further fix:
1. The forward error occured when `LogitsProcessorWithLoRA` calls
`AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA`
bypasses the MRO to call it, `super().forward(...)` in
`AscendLogitsProcessor.forward` will raise an error. This PR fixes it by
directly invoking `LogitsProcessor.forward(self, ...)`;
2. The shape mismatch in `add_lora_logits` in punica_npu.py. The
`lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1,
lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes
respectively, but they are misunderstood in #1583---the last two
dimensions were assumed in reverse order, which causes errors in
`bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the
previous version to align with the implementation in punica_cpu.py in
vllm.
### Dependencies
This PR depends on changes introduced by #3044 (LoRA support for
`AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers).
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
The LoRA-related tests, e.g., test_ilama_lora.py and
test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as
`TransformersForCausalLM`, where `embedding_modules` attribute lacks
`lm_head`. However, `LlamaForCausalLM` and most other models include
both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute
contributes to `supported_lora_modules` when using LoRA in vllm.
Therefore, without `lm_head` in `embedding_modules`, current tests using
ilama-3.2-1B are unable to find the abve errors since
`LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using
Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check
whether these fixes can work. What's more, it's necessary to add more
comprehensive tests for LoRA.
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Fix quant_config input parameter bug in qwenvl series. Currently,
non-instantiated variables should be passed.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim
##### Installation steps
git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh
##### Generate w4a8 per-channel weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
- Pin vLLM commit to releases/v0.11.0 branch.
- Fix the break change by vLLM commit
d4d9899860
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
LoRA e2e test uses ilama-3.2-1B model. It uses transformers.py model
files. Its self-attention layer names end with "\*.attn", not
"\*.self_attn".
There are some other model attention layer names end with "*.attn", such
as baichuan.py, bert.py.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
Addresses a bug in DenseOptimRowParallelOp that occurs when tensor
parallelism is not used
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
### What this PR does / why we need it?
fix bugs when mtp>1, and reorder input batch when mtp is not accepted.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
fix oom in aclgraph.
1. In the current token dispatch implementation, tensors are mounted on
class instances to facilitate parameter passing between different
methods. This approach prevents automatic recycling of these tensors. In
some cases, it may lead to out-of-memory error. To address this issue,
we manually set these tensors to None to release corresponding memory.
2. The `profile_run` method is designed to accurately estimate the
maximum NPU memory usage during vLLM inference. However, in certain
scenarios, MoE models perform inference via MC2, which includes
communication and consumes additional NPU memory. This leads to
inaccurate estimation by the profile run. We address this by actively
triggering the MC2 during profile run for initialization.```.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
PR #2894 make ascend_scheduler_config.enabled always be `True` for
non-mla models,when `ascend_scheduler_config.enabled=True `, it will
always initialize `AscendScheduler` which is a subclass of `Scheduler`,
but when we enbale async_scheduling,we need to initialize
`AsyncScheduler` in vllm, this will make async_scheduling can't be
enabled.
### Does this PR introduce _any_ user-facing change?
not-related
### How was this patch tested?
when user set `async_scheduling`, it means user don't want to use
`AscendScheduler`, so we shouldn't set `ascend_scheduler_config.enabled
= True`
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
What this PR does / why we need it?
The Qwen3 moe MC2 graph currently has two redundant computational
operator implementations. After npu_moe_distribute_dispatch_v2, the
cumsum and cast operations have been added. By using
expert_token_nums_type=0 and not converting weight_scale to float32,
these two operators can be eliminated, thereby improving inference
performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
No need
vLLM version: v0.10.2
vLLM main:
f225ea7dd9
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: florenceCH <gaoxiang120@huawei.com>
Co-authored-by: florenceCH <gaoxiang120@huawei.com>
### What this PR does / why we need it?
Upgrade vLLM to newest commit
- Fix the aclgraph doesn't work problem, caused by
24fab45d96
- Fix PoolerOutput import error, caused by
755ed7b05b
- Fix the aclgraph weight load error to keep the same with torchair fix.
4492e3a554
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All test should pass
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (#2788)"
### What this PR does / why we need it?
This reverts commit 6995a7bc5b. We'll add
it back once the issue is fixed.
related issue: https://github.com/vllm-project/vllm-ascend/issues/3195
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
### What this PR does / why we need it?
This PR is for the adaptation and optimization of qwen3_vl and
qwen3_vl_moe on the Ascend platform.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
It is a quick bugfix for the memory explosion issue that requires
further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
`hidden_states` were not released in time.
The PR temporarily resolves the issue by manually clearing the cache,
and further refactoring will be conducted subsequently.
Before the modification, the dummy_run's memory showed an accumulation
issue.
<img width="1796" height="207" alt="image"
src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0"
/>
After modification, it can be observed that the memory is released
promptly.
And it was verified that the model responded normally after a single
data input.
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
Upgrade vLLM to newest commit.
1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
vllm-ascend support [msMonitor
](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect
performance of vllm-ascend
### Does this PR introduce _any_ user-facing change?
1.add env MSMONITOR_USE_DAEMON;
2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1
before run vllm-ascend model;
3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set
### How was this patch tested?
1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set
MSMONITOR_USE_DAEMON=0, model will run successfully;
2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor
tool to collect profile data;
3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and
VLLM_TORCH_PROFILER_DIR, will raise error
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: mei-feiyao <1332490378@qq.com>
### What this PR does / why we need it?
Remove useless PD check in deepseek
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
When MTP and oprojTP are enabled, it triggers the recompilation of the
torchair graph, leading to a decrease in performance, and this PR fixes
this issue.
- vLLM version: v0.10.2
- vLLM main:
486c5599e3
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
To cut down the memory usage of large weight matrices, we often rely on
various linear operations:
- `ReplicatedLinear`: Stores the entire matrix, consuming excessive
memory.
- `RowParallelLinear`: Requires an `all_reduce` to merge answer,
introducing additional communication overhead and potential accuracy
loss. Each token is handled across multiple devices rather than a single
device, which is undesirable in SP scenario.
- ...
Furthermore, in multi-way Data Parallelism (DP) configurations, layers
typically store redundant weight copies.
This PR introduces a shared-weight plugin for layers inheriting from
`LinearBase`. It offers the following advantages:
- It evenly distributes a set of layers with identical structures across
devices. Each layer retains its complete weights, eliminating redundant
memory usage.
- It supports asynchronous broadcasting to prefetch weights for upcoming
layers.
- It preserves the custom `process_weights_after_loading()` method to
make keeping NZ format possible.
- It is compatible with any linear class that inherits from
`LinearBase`, thereby preserving all the features of the original linear
implementation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM main:
f4a948f33f
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com>