Update developer doc for v0.11.0-dev. This PR mainly picks developer doc
from main to v0.11.0-dev. All related Feature work with 0.11.0 already.
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Cherry-pick from main
https://github.com/vllm-project/vllm-ascend/pull/4015.
Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.
Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Use group_list[0] to replace group_diff[0] in function
"cumsum_group_list" (moe_mlp.py).
The purpose is to modify it to the correct logic of converting cumsum to
count.
### Does this PR introduce _any_ user-facing change?
No
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it?
This PR fixes a bug in the moe_mlp module by correcting the arguments
passed to the torch_npu.npu_dequant_swiglu_quant function.It properly
converts group_list from a cumulative sum to counts for the group_index
parameter.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/main
---------
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>
Co-authored-by: Mercykid-bash <ruanche0218@gmail.com>
In the PD separation scenario, the D node does not need to perform get
operations, and therefore does not need to create ZeroMQ (ZMQ)
communication.
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
Delete wrong configuration in deepseek v3.2 documentation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
NA.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
## Description
This PR addresses two key issues in the MoE module when redundant
experts are enabled, and fixes a calculation precision bug in the
forward inference of quantized MLP:
### 1. Shape Mismatch in EPLB Expert Map Update
- **Root Cause**:
When redundant experts are turned on, a shape inconsistency occurs
during the expert map update in `Vllm_apaptor`:
- The shape of `self.expert_map_per_layer[layer_id]` is
`[num_physical_experts,]` (aligned with physical expert count).
- The shape of `updated_expert_map` is `[num_logical_experts,]` (aligned
with logical expert count).
- Indices in `self.expert_map_per_layer[layer_id]` that exceed the
logical expert count cannot be properly mapped, leading to tensor shape
mismatch errors.
- The same shape mismatch exists in the `log2phy` map update (between
`self.log2phy_map_per_layer[layer_id]` and `updated_log2phy_map`).
- **Fix**:
- Fix the shape initialization of `expert_map_per_layer` and
`log2phy_map_per_layer` to be consistently set to
`[num_physical_experts,]` across the module lifecycle.
- Align the shape of `updated_expert_map` and `updated_log2phy_map` with
the pre-initialized physical-expert-sized tensors during update
operations, ensuring shape consistency for index mapping.
### 2. Calculation Precision Issue in Quantized MoE MLP Forward
Inference
- **Root Cause**:
In the forward pass of `moe_mlp`, the
`torch_npu.npu_dequant_swiglu_quant` operator only accepts group lists
in **Count format** as input. However, the group list provided by
`quant_apply_mlp` was in **Cumsum format**, which caused operator input
format mismatch and degraded calculation precision.
- **Fix**:
- Convert the cumsum-formatted group list from `quant_apply_mlp` to
Count format before passing it to `torch_npu.npu_dequant_swiglu_quant`.
- Ensure the input format of the dequantization operator meets its
requirements, restoring the expected calculation precision for quantized
MoE MLP layers.
## Impact
- Resolves shape mismatch errors in EPLB expert/log2phy map updates when
redundant experts are enabled, ensuring stable expert routing.
- Fixes quantized MoE MLP forward precision issues on NPU, aligning
operator input formats with NPU kernel requirements.
- No breaking changes to existing interfaces; the fixes are
backward-compatible for scenarios without redundant experts enabled.
---------
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
With CANN 8.3 and corresponding PTA 2.7.1, `npu_top_k_top_p` supports
passing only k (1<=k<=1024) and p separately.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E performance test with only `top_k` and `p` seperately. This pr gains
0.2ms improvements in TPOT with `batch_size=16`.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Fix configuration errors in our documentation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
NA.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Qwen2.5-VL mrope precision problem would been solved once this pr is
merged
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
Test on G8600 with textVQA dataset
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: shaopeng-666 <lishaopeng21@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
In cann8.3, npu_moe_gating_top_k operator can support expert nums with
384, so kimi can use the operator to get better preformance.
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
When retrieving the quantization method for MOE (e.g., the quantization
file of DeepSeek v3.2 exp do not match the model's naming convention in
eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight
not in self.quant_description". However the quantization file is like :
```bash
"model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC",
"model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC",
```
Co-Authored-By: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
Earlier we fixed a similar issue for qwen2.5-vl 【
https://github.com/vllm-project/vllm-ascend/issues/4430 】, and then the
multimodal models in vllm v0.11.0 should all have this problem. Here, we
have specifically proposed a fix for qwen3-vl-moe.
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
### What this PR does / why we need it?
Add a custom op to acclerater the deepseek model. The fusion ops combine
the bmm and transpose together, which is applied to mla module.
Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86
### Does this PR introduce _any_ user-facing change?
No
---------
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
#3985 move stream context initialization before for-loops to improve
performance. However, we find that this might cause potential accuracy
drop when used with pd disaggregation. Thus we partly revert this change
when using pd disaggregation, and we shall fix this bug in th future.
### Does this PR introduce _any_ user-facing change?
No.
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
Fix eplb enable when using mtp float weights. It will be remove when
eplb supporting mtp and float weights.
### How was this patch tested?
Deepseek-V3 + MTP + EPLB in A3.
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Signed-off-by: offline893 <158537145+offline893@users.noreply.github.com>
Co-authored-by: offline0806 <3337230449@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
This pr is cherry-pick from :
https://github.com/vllm-project/vllm-ascend/pull/2958 and
https://github.com/vllm-project/vllm-ascend/pull/4340
Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern
Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`
CANN: depends on 8.3.RC1
Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before
---------
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
Redundant experts bugfix
The calculation logic for redundant experts has been fixed, allowing the
correct number of redundant experts to be calculated using the map.
Therefore, there is no longer a need to set the redundant expert
parameter when passing the map.
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.
---------
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Fix ngram lacking of input arg `dummy_compute_logits` error
### How was this patch tested?
CI passed with existing test.
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Resolve the interface compatibility issue of get_input_embeddings in MM,
because the get_input_embeddings func of other model does not have the
is_multimodal parameter
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
### What this PR does / why we need it?
This PR aims to remove env introduced by #3988 and use lock by default.
As described in https://github.com/vllm-project/vllm/issues/27858, we
have tested the writer lock method in various scenarios and the
performance is almost unaffected. Therefore, we believe that it would be
safe to enable the lock by default and remove the redundant env
`SHM_BARRIER` now.
After discussion, we decide to preserve env and set it as true by
default.
### Does this PR introduce _any_ user-facing change?
`SHM_BARRIER` is set as true by default.
### How was this patch tested?
by ci
---------
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
Previously, the dummy run executed compute_logits only once, regardless
of num_speculative_tokens. This caused execute_model to hang on
compute_logits when lm head tensor parallelism exceeded 1. The fix
ensures compute_logits executes correctly during dummy run, matching
num_speculative_tokens.
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
### What this PR does / why we need it?
The expert mapping table and weights of the dynamic EPLB were not
updated, causing the accuracy to be correct but not effective. This bug
has now been fixed.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
disable NZ for float weight case. This is only a quick fix for dev
branch.
For main branch, we'll consider more case to make it more common.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
qwen2.5 32B
<img width="441" height="221" alt="image"
src="https://github.com/user-attachments/assets/7ae18ffd-1ce2-43d9-9960-be45250ad0da"
/>
---------
Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
### What this PR does / why we need it?
To fix ops test, where `model_config` has been set to `None` and doesn't
has `hf_config` attribute, we have added a check for `model_config` to
guarantee it is not `None_Type`.
cherry-pick from main:
https://github.com/vllm-project/vllm-ascend/pull/4384.
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run
process will be triggered. When calling the update_attn_params function,
the num_tokens parameter needs to be passed, and this value is obtained
through positions.shape[0]. However, the multimodal model uses mRope
(multi-dimensional rotary positional embeddings), which causes the shape
of positions to be 2. As a result, the value obtained from
positions.shape[0] is incorrect. We solve this problem by replacing
positions.shape[0] with num_tokens.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
### What this PR does / why we need it?
Enable force_load_balance in aclgraph, solving OOM issues.
pick from https://github.com/vllm-project/vllm-ascend/pull/4366
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
add single node PD disaggregation instructions for Qwen 2.5VL model.
### Does this PR introduce _any_ user-facing change?
no
---------
Signed-off-by: mazhixin <mazhixin7@huawei.com>
Signed-off-by: mazhixin000 <mazhixinkorea@163.com>
Co-authored-by: mazhixin <mazhixin7@huawei.com>