### What this PR does / why we need it?
Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp
parallel example.
For large-parameter models of Qwen3-30B and above, weight loading alone
takes 4 to 5 minutes. Therefore, the 5-minute timeout in the current
example code implementation is too short, causing some DP instances to
be killed prematurely and eventually stuck in the DP synchronization
all-reduce operation.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0
- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
When running DP in a non-equilibrium scenario, which means there is some
dp groups executing `dummy_run`, we need to make sure it running the
same mode as other dp, thus improving then performance in dp scenario
### How was this patch tested?
Tested by adding log in `_dummy_run`
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Currently we run an extra profile_run with `num_tokens ==
self.mc2_tokens_capacity`. However, when setting `max_num_batched_tokens
< self.mc2_tokens_capacity`, this will trigger an assertion error that
requires num_tokens in `_dummy_run` to be smaller than
`max_num_batched_tokens`. This PR skips this extra `profile_run` if
`self.max_num_tokens <= self.mc2_tokens_capacity` so as to avoid this
bug.
This PR fixes a bug that `kernel_block_sizes` never equals to
`[self.cache_config.block_size]`. `kernel_block_sizes` is type of
List[List[int]], so the condition should be `kernel_block_sizes !=
[[self.cache_config.block_size]]`. This also helps to resolve a issue
that cpu_offload_gb cannot be enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Bump version to v0.11.0rc2 and prepare vLLM Ascend v0.11.0rc0
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
- Fixes Qwen3-Next because of vllm #24982
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
def main():
prompts = [
"窗前明月光,",
"The president of the United States is Mr.",
"The capital of France is",
"The future of AI is",
"感时花溅泪,",
"家书抵万金啥意思?",
"plz tell me a story: ",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
trust_remote_code=True,
max_model_len=256,
gpu_memory_utilization=0.7,
block_size=64
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
Fix CI by addressing max_split_size_mb config
### Does this PR introduce _any_ user-facing change?
No, test onyl
### How was this patch tested?
Full CI passed, espcially eagle one
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add restriction conditions to the ApplyTopPTopK operator : 1 <= K <=1024
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
### What this PR does / why we need it?
add faqs:install vllm-ascend will overwrite existing torch-npu
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
Fix the error "cur batch_size is invalid" during profile_run in the
torchair scenario.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
we add a patch for model weight loader to avoid using vLLM weight loader
v2, since v2 will lead unknown issue for torchair. While this patch make
some unknown memory usage problem. To quick fix the problem, let's
expend the `max_split_size_mb` to a larger value to avoid weight load
oom issue.
Further solution is to remove the patch and address weight loader v2
from vLLM.
Closes: https://github.com/vllm-project/vllm-ascend/issues/3251
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix dp+ep+tp inplace copy error when sp chunked the `hidden_states`.
### How was this patch tested?
test locally with the following scripts
```bash
python examples/offline_data_parallel.py \
--model="Qwen/Qwen3-30B-A3B" \
--dp-size=2 \
--tp-size=2 \
--enable-expert-parallel
```
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This PR provides user guide documents for Qwen3-VL 4B and
Qwen3-VL-235B-A22B.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
---------
Signed-off-by: booker123456 <945658361@qq.com>
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR #2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Before optimizing,the rmsnorm time in one decoding is 531.5us. After
optimizing,the rmsnorm time in one decoding is 105us.
I closed the previous
PR(https://github.com/vllm-project/vllm-ascend/pull/2456) by mistake and
resubmitted it now
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: socrahow <suzihao4@h-partners.com>
### What this PR does / why we need it?
Running multimodal model with ascend scheduler may cause assert error
【assert (request.num_tokens - request.num_computed_tokens) == 1】
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: fan2956 <zhoufan53@huawei.com>
### What this PR does / why we need it?
- Fixes the bug that Multiple calls (maybe >100) to eagle3-qwen3-8b often incurs "attn_mask index out of range" error
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
```
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --served-model-name Eagle3 --port 8000 --model Qwen/Qwen3-8B --seed 42 -tp 1 --speculative_config '{"model": "Tengyunw/qwen3_8b_eagle3", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'
```
Co-authored-by: liuruijin17
[ricklrj@outlook.com](mailto:ricklrj@outlook.com)
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: Icey <1790571317@qq.com>
### What this PR does / why we need it?
1. Solved the issue where sizes capture failed for the Qwen3-32b-int8
model when aclgraph, dp1, and tp4 were enabled.
2. Added the exception thrown when sizes capture fails and provided a
solution
3. Add this common problem to the FAQ doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Relying on #3044, this PR aims to further fix:
1. The forward error occured when `LogitsProcessorWithLoRA` calls
`AscendLogitsProcessor.forward`. Since `LogitsProcessorWithLoRA`
bypasses the MRO to call it, `super().forward(...)` in
`AscendLogitsProcessor.forward` will raise an error. This PR fixes it by
directly invoking `LogitsProcessor.forward(self, ...)`;
2. The shape mismatch in `add_lora_logits` in punica_npu.py. The
`lora_a_stacked` and `lora_b_stacked` are organized as [num_loras, 1,
lora_rank, hidden_size] and [num_loras, 1, vocab_size, lora_rank] shapes
respectively, but they are misunderstood in #1583---the last two
dimensions were assumed in reverse order, which causes errors in
`bgmv_shrink` and `bgmv_expand`. This PR fixes it by reverting it to the
previous version to align with the implementation in punica_cpu.py in
vllm.
### Dependencies
This PR depends on changes introduced by #3044 (LoRA support for
`AscendQKVParallelLinear` and `AscendMergedQKVParallelLinear` layers).
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
The LoRA-related tests, e.g., test_ilama_lora.py and
test_ilama_lora_tp2.py, use ilama-3.2-1B, and this model is regarded as
`TransformersForCausalLM`, where `embedding_modules` attribute lacks
`lm_head`. However, `LlamaForCausalLM` and most other models include
both `embed_tokens` and `lm_head` in `embedding_modules`. This attribute
contributes to `supported_lora_modules` when using LoRA in vllm.
Therefore, without `lm_head` in `embedding_modules`, current tests using
ilama-3.2-1B are unable to find the abve errors since
`LogitsProcessorWithLoRA` replacing `lm_head` is skipped. Simply using
Meta-Llama-3.1-8B-Instruct can reproduce the above errors and check
whether these fixes can work. What's more, it's necessary to add more
comprehensive tests for LoRA.
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Fix quant_config input parameter bug in qwenvl series. Currently,
non-instantiated variables should be passed.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
Add vLLM 0.11.0 release hourly job to monitor release branch changes
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim
##### Installation steps
git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh
##### Generate w4a8 per-channel weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
- Pin vLLM commit to releases/v0.11.0 branch.
- Fix the break change by vLLM commit
d4d9899860
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
LoRA e2e test uses ilama-3.2-1B model. It uses transformers.py model
files. Its self-attention layer names end with "\*.attn", not
"\*.self_attn".
There are some other model attention layer names end with "*.attn", such
as baichuan.py, bert.py.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
pytest -sv tests/e2e/singlecard/test_ilama_lora.py
pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py
- vLLM version: v0.10.2
- vLLM main:
17b4c6685c
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
Add e2e test related to weight updates in RL scenarios.
Due to CI issues, the newly added Python test files cannot locate the
correct path. As a temporary solution, use absolute paths to add test
cases.
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>
### What this PR does / why we need it?
Addresses a bug in DenseOptimRowParallelOp that occurs when tensor
parallelism is not used
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
### What this PR does / why we need it?
fix bugs when mtp>1, and reorder input batch when mtp is not accepted.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
by ci
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
---------
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
fix oom in aclgraph.
1. In the current token dispatch implementation, tensors are mounted on
class instances to facilitate parameter passing between different
methods. This approach prevents automatic recycling of these tensors. In
some cases, it may lead to out-of-memory error. To address this issue,
we manually set these tensors to None to release corresponding memory.
2. The `profile_run` method is designed to accurately estimate the
maximum NPU memory usage during vLLM inference. However, in certain
scenarios, MoE models perform inference via MC2, which includes
communication and consumes additional NPU memory. This leads to
inaccurate estimation by the profile run. We address this by actively
triggering the MC2 during profile run for initialization.```.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: WithHades <244036962@qq.com>
### What this PR does / why we need it?
PR #2894 make ascend_scheduler_config.enabled always be `True` for
non-mla models,when `ascend_scheduler_config.enabled=True `, it will
always initialize `AscendScheduler` which is a subclass of `Scheduler`,
but when we enbale async_scheduling,we need to initialize
`AsyncScheduler` in vllm, this will make async_scheduling can't be
enabled.
### Does this PR introduce _any_ user-facing change?
not-related
### How was this patch tested?
when user set `async_scheduling`, it means user don't want to use
`AscendScheduler`, so we shouldn't set `ascend_scheduler_config.enabled
= True`
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
What this PR does / why we need it?
The Qwen3 moe MC2 graph currently has two redundant computational
operator implementations. After npu_moe_distribute_dispatch_v2, the
cumsum and cast operations have been added. By using
expert_token_nums_type=0 and not converting weight_scale to float32,
these two operators can be eliminated, thereby improving inference
performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
No need
vLLM version: v0.10.2
vLLM main:
f225ea7dd9
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: florenceCH <gaoxiang120@huawei.com>
Co-authored-by: florenceCH <gaoxiang120@huawei.com>
### What this PR does / why we need it?
Upgrade vLLM to newest commit
- Fix the aclgraph doesn't work problem, caused by
24fab45d96
- Fix PoolerOutput import error, caused by
755ed7b05b
- Fix the aclgraph weight load error to keep the same with torchair fix.
4492e3a554
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All test should pass
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (#2788)"
### What this PR does / why we need it?
This reverts commit 6995a7bc5b. We'll add
it back once the issue is fixed.
related issue: https://github.com/vllm-project/vllm-ascend/issues/3195
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
### What this PR does / why we need it?
This PR is for the adaptation and optimization of qwen3_vl and
qwen3_vl_moe on the Ascend platform.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: booker123456 <945658361@qq.com>
### What this PR does / why we need it?
`ready` label now is used for trigger full e2e test now. If a PR is
ready and merge conflict then, no need to drop the ready label.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Just a github action change. No need for function test.
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Revise the EPLB feature guide content.Add eplb params to ascend config.
### Does this PR introduce any user-facing change?
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
It is a quick bugfix for the memory explosion issue that requires
further refactoring.
The dummy_run in eager mode may lead to OOM and the reason is that
`hidden_states` were not released in time.
The PR temporarily resolves the issue by manually clearing the cache,
and further refactoring will be conducted subsequently.
Before the modification, the dummy_run's memory showed an accumulation
issue.
<img width="1796" height="207" alt="image"
src="https://github.com/user-attachments/assets/05e2b04c-2f99-4085-9eda-c78b7d9a57b0"
/>
After modification, it can be observed that the memory is released
promptly.
And it was verified that the model responded normally after a single
data input.
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
Correct the vllm interface e2e test Base container image name
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Tests in vllm ci pipeline
- vLLM version: v0.10.2
- vLLM main:
52d0cb8458
Signed-off-by: leo-pony <nengjunma@outlook.com>
Upgrade vLLM to newest commit.
1. Remove the useless func get_state_cls, it has been removed from vLLM
already.
e6750d0b18
2. Fix ut broken by
6160ba4151
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
vllm-ascend support [msMonitor
](https://gitcode.com/Ascend/mstt/tree/master/msmonitor)tool to collect
performance of vllm-ascend
### Does this PR introduce _any_ user-facing change?
1.add env MSMONITOR_USE_DAEMON;
2.user cann enable msMonitor tool by setting MSMONITOR_USE_DAEMON=1
before run vllm-ascend model;
3.MSMONITOR_USE_DAEMON and VLLM_TORCH_PROFILER_DIR cannot both set
### How was this patch tested?
1.run vllm-ascend model while not set MSMONITOR_USE_DAEMON=1 or set
MSMONITOR_USE_DAEMON=0, model will run successfully;
2.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1, run msMonitor
tool to collect profile data;
3.run vllm-ascend model while set MSMONITOR_USE_DAEMON=1 and
VLLM_TORCH_PROFILER_DIR, will raise error
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: mei-feiyao <1332490378@qq.com>
### What this PR does / why we need it?
Remove useless PD check in deepseek
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
When MTP and oprojTP are enabled, it triggers the recompilation of the
torchair graph, leading to a decrease in performance, and this PR fixes
this issue.
- vLLM version: v0.10.2
- vLLM main:
486c5599e3
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
Add OOT platform E2E test case to be run in the vllm buildkite pipeline.
Note: added test case is not run in vllm-ascend CI.
### Does this PR introduce _any_ user-facing change?
NA
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
Signed-off-by: leo-pony <nengjunma@outlook.com>