### What this PR does / why we need it?
Remove extra MLAPO installation step for DeepSeek-V3.2.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
This PR adds some qwen3-235b-w8a8 cases qwen3-30b-w8a8 cases, we need
test them daily
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
Corrected the errors in the information
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
In the DCP-PCP graph mode scenario, there is a shape issue with multiple
batches. This PR fixes this problem.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The code bug caused an empty bubble. When the npu_paged_cache_load
operator was called, it forcibly transferred seq_len2 to the device,
which triggered synchronization and interrupted the CPU operator's
launch stream.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
1. Fix proxy format processing errors.
2. Layer-wise connector performance optimization.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
enable sleepmode level2 e2e test and add the check logic to ensure the
nz is not enabled.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
use e2e tests
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangx700 <wangxin700@huawei.com>
### What this PR does / why we need it?
This pr fixes ci on eplb
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
Adapts mtp function to Qwen3-next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Add model feature matrix table.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
add new ut case for aclgraph in auto enable
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
1、in mla_v1 module, add torch_npu.npu_attention_update op when pcp and dcp
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
1、update prepare_finalize.py:fix A2 accuracy problem when pcp and dcp
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
This PR fixes a bug where the workspace of `_npu_paged_attention` in
setup is smaller than execution. For current implementation of
FULL_DECODE_ONLY with `_npu_paged_attention`, we use
`_npu_paged_attention_get_workspace` when capturing with `max_model_len`
as `seq_lens`. This assumes that PA with larger `seq_lens` inputs should
have larger workspace than smaller `seq_lens`. However, there are rare
cases where PA with smaller `seq_lens` incurs larger space. So I add
`get_workspace` directly into `update_attn_params`.
This change might introduce small(≈1%) performance degradation for low
num_tokens(such as 1) in decode phase, and there is no other known
memory issues. So I think this change is acceptable. We can remove this
if new attention op (such as `npu_fused_infer_attention_score`) does not
have such problems.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Fixes a compatible bug with `torch_npu.npu_fused_infer_attention_score`
which is discribed in
https://github.com/vllm-project/vllm-ascend/issues/4020.
@momo609 tells us this solution.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/4020.
We modify the code according to
https://github.com/vllm-project/vllm-ascend/pull/3918.
And run below codes:
```python
# run with Qwen3-next-mtp
prompts = [
"Who are you?",
]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,
distributed_executor_backend="mp",
gpu_memory_utilization=0.7,
speculative_config={
"method": "qwen3_next_mtp",
"num_speculative_tokens": 1,
},
max_model_len=4096)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Outputs:
```text
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
```
Now, `torch_npu.npu_fused_infer_attention_score` is compatible with
Qwen3-Next.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it?
Add Add constraints for sequence parallelism for unsupported scenarios:
1. tp_size > 1
2. enable_expert_parallel must be True for MoE model
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Since we have upgraded to CANN 8.3rc1, we will no longer use the
privately maintained Mooncake repository, but instead use the official
release released by Mooncake:
https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2 .
Next step: this is only a temporary solution. We will integrate mooncake
into the vllm-ascend base image later for easier use. see
https://github.com/vllm-project/vllm-ascend/pull/3989
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
This PR updates the acc test standard for some cases, we need it to
better maintain acc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
1、in attention_v1 module, convert bsnd t0 tnd when pcp and dcp
2、fix tochair bug: service startup problem
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
- global_segment_size and local_buffer_size use constants for unified
management.
- Newly added support for input formats ending with GB, MB, KB, and B,
while being compatible with existing input methods.
### Does this PR introduce _any_ user-facing change?
- Users can use new input methods
- The documentation has also been modified
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 李子琦 <liziqi_ing@163.com>
### What this PR does / why we need it?
Make kv-transfer env variable take effect and Fix load-balance proxy.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
Refactor accuracy test to nightly test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Add adxl timeout parameter in kv pool user guide, avoiding timeout error
when initializing connections between devices.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
Add developer guide of eplb
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
Add version policy for main branch to clear how vllm-ascend work with
vllm
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add aclgraph developer guide.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Refactor kv cache tensor initialization logic.
1. Unify the kvcache tensor initialization logic of deepseek and normal
models
2. spilt `initialize_kv_cache_tensors` into `_allocate_kv_cache_tensors`
and `_reshape_kv_cache_tensors`, following gpu modelrunner in vllm
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
1. prefill disaggregation scenario
4. deepseek + aclgraph/eager mode
5. qwen3 next
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
move quant before allgather in Allgather EP, rely on
https://github.com/vllm-project/vllm-ascend/pull/3334
Deepseek R1 W8A8 performance on A2 with
`HCCL_ALGO="level0:NA;level1:pipeline"`:
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|----------|----------|----------|
| 4k | 375.21 | 364.99 |
| 16k | 1465.23 | 1421.75 |
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This PR adds full graph for multimodal nightly test, we need to maintain
this senario
### How was this patch tested?
by running the test
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
### What this PR does / why we need it?
Set adxl engine as the default Mooncake backend, because Ascend
Transport is no longer maintained.
Update README to include instructions for installing the adxl backend
Mooncake.
### Does this PR introduce _any_ user-facing change?
Users need to compile and install the mooncake backend for adxl
according to the revised README instructions.
### How was this patch tested?
By CI.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
Add accuracy test for multiple models:
- Meta_Llama_3.1_8B_Instruct
- Qwen2.5-Omni-7B
- Qwen3-VL-8B-Instruct
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This PR fixes deepseek v3.2 mtp bug.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
All existed ci tests should pass.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it?
Add accuracy test for qwen3-8b-w8a8
- vLLM version: v0.11.0rc3
- vLLM main:
c9461e05a4
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Fix#3891.
The empty of `moe_comm_method` in the above issue is due to the wrong
check for MoE models. To be specific, the method `is_moe_model` only
checks whether a text-only model is a MoE model, without considering
multi-modal models, e.g., `VL` and `Omni`.
Check the config dict recursively to find if it has a key contains
"expert", without checking the model architecture.
It is worth noting that, we can't verify a model by if it contains
`FusedMoE` module because `is_moe_model` is called somewhere before the model loading, e.g., it's called when updating the ACLGraph config in
platform initialization.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Protect the scene where the first problem occurs. The execution should
be interrupted when the video memory application fails, rather than
waiting until an illegal address is accessed.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
NA
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
1、revert TND modify when dcp pcp, which is introduced by
f57bdb09fc
2、deal aclgraph pad border issue
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
This PR upgrade CANN from 8.2rc1 to 8.3rc1 and remove the CANN version
check logic.
TODO: we notice that UT runs failed with CANN 8.3 image. So the base
image for UT is still 8.2. We'll fix it later.
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
The current test cases lack end-to-end (e2e) testing for the
deepseek-v2-lite network in ge graph mode.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: CodeNine-CJ <chenjian343@huawei.com>
### What this PR does / why we need it?
correct bug to fix the value of max_num_tokens
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
- Add support for DeepSeek v3.2 in FULL_DECODE_ONLY mode.
- Add unit test for sfa_v1.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: 1Fire4 <wangdingyi2@huawei.com>
### What this PR does / why we need it?
In the `update_*attn_params` functions, the
`torch.npu.stream(update_stream)` context manager was previously located
inside the for-loop that updates parameters for each layer. This
resulted in redundant stream initiations for every layer, adding
unnecessary overhead.
This commit refactors the code by moving the stream context manager to
wrap the entire for-loop. This ensures that the update stream is
initiated only once per function call, rather than for each layer. This
change reduces 90us in each decode model.
update stream in every layer:
<img width="1720" height="383" alt="image"
src="https://github.com/user-attachments/assets/70e4cb69-5bc1-4180-a67d-c99132134be6"
/>
remove update stream in every layer:
<img width="1269" height="175" alt="image"
src="https://github.com/user-attachments/assets/0e290edb-b0ce-48fe-b032-1b924ade6ae5"
/>
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
### What this PR does / why we need it?
Because the previous commit hash was accidentally deleted or
overwritten. This patch correct the commit hash available for
https://github.com/AscendTransport/Mooncake to make nightly ci happy
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Upgrade torch-npu to the official release version 2.7.1
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>