### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
#### What this PR does / why we need it?
This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and
expert token numbers.
This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of
list.
This operator support couting how many token each local expert recieves
by expertTokensNum .
- vLLM version: v0.13.0
- vLLM main:
7157596103
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
#### Overview
This PR fixes a shape mismatch bug between `expert_placement_map` and
`log2phy_expert_map` when **redundant experts** are enabled in the
vLLM-Ascend platform. The issue occurred during the initialization of
expert maps and their updates via EPLB (Expert Load Balancer)
adjustment, leading to potential tensor shape errors and incorrect
expert routing in distributed MoE deployments.
#### Key Changes
1. **Unify expert map shape calculation logic**
- Ensure the shape of `expert_placement_map` and `log2phy_expert_map`
strictly aligns with the total number of experts (including redundant
experts) during initialization.
- Update the shape adjustment logic in EPLB dynamic update process to
match the initial expert map dimensions.
2. **Add shape consistency checks**
- Add assertion statements to verify the shape consistency of the two
maps after initialization and EPLB adjustment, preventing silent shape
mismatches in subsequent operations.
#### Impact
- Resolves tensor shape errors when using redundant experts with EPLB on
Ascend platform.
- Ensures correct expert routing and load balancing for MoE models with
redundant expert configurations.
- No breaking changes to existing functionality; compatible with
non-redundant expert deployments.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Add LongCat-Flash support.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed
- vLLM version: v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: chuyuelin <923822139@qq.com>
Co-authored-by: chuyuelin <chuyuelin1@huawei.com>
### What this PR does / why we need it?
Unify the loading logic for expert_map and log2phy.
1. The map generated when enabling the redundancy expert is incorrect.
The community generation map function only accepts the number of global
experts. When we pass in the number of logical experts plus redundant
experts, the local expert ID of the last card will index to an expert ID
that does not exist. Now we ensure that the index points to a real
existing expert ID, and each expert can be accessed. Moreover, when
redundant experts are not enabled, the output of our function remains
consistent with the community's function.
2. The map we generate is based on the length of the physical expert,
but in reality, we only need to use the length of the logical expert.
Later on, we will need to pad it accordingly, so we can simply generate
a map with the length of the logical [expert.]
3. Unify the initialization logic across different scenarios and
simplify the code for fused_moe.
**Before refactoring**
- map path is not None:
expert map: get_rank_placement_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
- map path is None:
expert map: determine_expert_map from '_vllm.laye_r', The function does
not support the redundant experts of vllm-ascend.
log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The
function does not support the redundant experts of vllm-ascend.
**Refactoring**
eplb_utils.py
init_eplb_config
generate placement
generate expert map
generate log2phy
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 16
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16]
+++++++++++++++++++++++++++++++++++++++++
Improved map:
[16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 0
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
+++++++++++++++++++++++++++++++++++++++
Improved map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
dsr1 baselie:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
dsr1 eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Currently, `torch_npu.npu_grouped_matmul_swiglu_quant` can only support
weight nz, so we need to trans w13_weight, w2_weight to nz forcely.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR adds model-side integration for the previously introduced
experimental AscendC fused operator DispatchGmmCombineDecode, used in
MoE decoding.
The operator implementation itself was added in a prior PR[#4139
](https://github.com/vllm-project/vllm-ascend/pull/4139).
This change only adapts the model execution path to optionally use the
fused operator.
When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the
original MC2 path composed of multiple operators (A8W8 dispatch → GMM →
SwiGLU → GMM → combine) might be replaced by the single fused operator
DispatchGmmCombineDecode.
By default, the existing multi-operator MC2 implementation is preserved.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
Now `VLLM_ASCEND_ENABLE_NZ` will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;
And `VLLM_ASCEND_ENABLE_NZ`=1 by default.
All cases are shown in the table below:
| | W4A4 | W4A8 | W8A8 | fp16/bf16 | fp32 |
|---|---|---|---|---|---|
| trans nz | can't support nz | trans nz by default | trans nz by
default | trans nz when VLLM_ASCEND_ENABLE_NZ is 2 | can't support nz |
| transpose | only support not transpose case | only support transpose
case | only support transpose case | linear: only support not transpose
case<br>gmm: only support transpose case | same to fp16/bf16 |
Some exceptional cases:
1. MLAPO op need to do some additional processing on the weights,
including trans nz. If use MLAPO op, some weight will be transformed to
nz forcely;
2. MLA/SFA's weight `W_UV` will be used by op
`torch.ops._C_ascend.batch_matmul_transpose`, and this op can't support
nz currently;
### Does this PR introduce _any_ user-facing change?
Now fp16/bf16 weight will not trans nz by default.
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
- Renames the MoE comm enum value `MoECommType.FUSED_ALLTOALL` to
`MoECommType.FUSED_MC2` and updates all call sites.
- Updates `select_moe_comm_method` to optionally select `FUSED_MC2` on
Ascend A3 when:
- `enable_expert_parallel=True`
- quantization is `w8a8_dynamic`
- `EP <= 16`
- `dynamic_eplb` is disabled
- `is_mtp_model = False`
- Replaces the old “fused all-to-all” comm implementation with
`FusedMC2CommImpl`, using `TokenDispatcherWithMC2` /
`PrepareAndFinalizeWithMC2` and `dispatch_ffn_combine`.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Chen Chen <0109chenchen@gmail.com>
### What this PR does / why we need it?
moe multistream overlap to improve the performance.
### How was this patch tested?
--additional-config '{"multistream_overlap_gate": true}'
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
pick from https://github.com/vllm-project/vllm-ascend/pull/4736 to fix
the merge conflict
### What this PR does / why we need it?
Currently, the all_reduce operation in _sync_metadata_across_dp is
performed with gloo backend which is extremely time-consuming when
DPEngineCores are in different nodes. This operation cannot be ignored
by async scheduling in multi-node-scenarios with speculative decoding
(e.g., EAGLE, mtp).
This pr eliminates the all_reduce operation for D Nodes and change the
input parameter of MoEDispatch & MoeCombine operators to make MC2EP
support different num_tokens across all ranks.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios
while enabling async scheduling. This pr can remove cross-node
all_reduce with gloo backend and further reduce latency with correct
accuracy.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
This PR introduces the Ascend implementation of the
`dispatch_ffn_combine` kernel and wires it into the vLLM-Ascend runtime,
together with follow‑up fixes to ensure the kernel builds and runs
correctly in CI.
- Add full host and device implementation of the `dispatch_ffn_combine`
kernel under `csrc/dispatch_ffn_combine`, including tiling logic, MOE
routing helpers, and kernel utilities for quantized FFN dispatch.
- Integrate the new kernel with the PyTorch binding
(csrc/torch_binding.cpp, csrc/torch_binding_meta.cpp) and the Ascend
runtime (vllm_ascend/ascend_forward_context.py,
vllm_ascend/worker/model_runner_v1.py).
- Extend fused MoE communication and token dispatch support in
`vllm_ascend/ops/fused_moe`, adding methods/utilities needed by the new
dispatch path.
- Update quantization logic in vllm_ascend/quantization/w8a8_dynamic.py
to support the new FFN dispatch flow.
- Fix kernel build issues by adjusting `csrc/build_aclnn.sh`, CMake
configuration, and include/namespace usage in the new kernel files.
- Add an end‑to‑end nightly test
`tests/e2e/nightly/ops/test_dispatch_ffn_combine.py` and helper
utilities in `vllm_ascend/utils.py` to validate the new kernel.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0
---------
Signed-off-by: mojave2 <chenchen145@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
The M4 quantization method in ModelSlim adds bias to model weights that
originally do not have a linear bias. PR #4235 supported PD-MIX
quantization and M4 quantization, adding bias to `w8a8.py` and
`w8a8_dynamic.py`, and implementing adaptations in `ops/linear.py` to
prevent it from being reset to `None` by
`self.register_parameter("bias", None)`. However, this modification
introduced an issue where the bias was still being reset to `None` in
certain scenarios, causing errors during service startup. Therefore,
support for M4 quantization is temporarily being reverted in this PR.
___
- vLLM version: v0.11.2
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
In PD-separated deployment scenarios:
* MoE layers use dynamic quantization exclusively.
* For the Attention module, Prefill (P) nodes use **dynamic**
quantization, while Decode (D) nodes use **static** quantization.
In PD-mixed deployment scenarios:
* **All components fall back to dynamic quantization**, as it is
difficult to distinguish between Prefill and Decode tokens.
___
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: Slightwind <slightwindsec@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There is a lot hack code for v0.11.0, which makes the code hard to
upgrade to newer vLLM version. Since v0.11.0 will release soon. Let's
drop v0.11.0 support first. Then we'll upgrade to v0.11.2 soon.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
In PR https://github.com/vllm-project/vllm-ascend/pull/3420, we
initially placed the quantization type (quant_type) in the MoECommMethod
class. However, since MoECommMethod follows a singleton pattern, it
couldn't accommodate scenarios where different layers in the model might
use different quantization approaches (e.g., MTP modules using
floating-point computation while the main model employs quantized
computation).
In this PR, we've moved the quantization type to the AscendFusedMoe
class and pass it as a parameter to MoECommMethod.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```bash
export HCCL_BUFFSIZE=1024
export VLLM_VERSION=0.11.0
vllm serve /home/data/DeepSeek-R1_w8a8/ \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--served-model-name dsv3 \
--max-model-len 32768 \
--max-num-batched-tokens 4096 \
--max-num-seqs 16 \
--quantization ascend \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}'
```
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
move quant before allgather in Allgather EP, rely on
https://github.com/vllm-project/vllm-ascend/pull/3334
Deepseek R1 W8A8 performance on A2 with
`HCCL_ALGO="level0:NA;level1:pipeline"`:
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|----------|----------|----------|
| 4k | 375.21 | 364.99 |
| 16k | 1465.23 | 1421.75 |
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
1. Rename common_fused_moe.py to fused_moe.py.
2. Rename fused_moe_prepare_and_finalize.py / FusedMoEPrepareAndFinalize
to prepare_finalize.py / PrepareAndFinalize.
3. Rename vllm_ascend/ops/moe to vllm_ascend/ops/fused_moe.
4. Move vllm_ascend/ops/fused_moe.py to
vllm_ascend/ops/fused_moe/fused_moe.py
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
1. Replace manual memory cleanup with passing parameter.
2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated
code.
3. Fix MC2 bug introduced in
https://github.com/vllm-project/vllm-ascend/pull/3365
4. Unify aclgraph & eager in W8A8_dynamic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
**Problem Description:**
The existing implementation for the w4a8-dynamic linear method only
supports the old quantization format from msmodelslim. When attempting
to load models quantized with the new version, vLLM encounters errors
due to mismatched tensor shapes and unprocessed quantization parameters.
Relavant issues:
- https://github.com/vllm-project/vllm-ascend/issues/3192
- https://github.com/vllm-project/vllm-ascend/issues/3152
**Proposed Changes:**
1. Add support for w4a8 dynamic(new format) in
AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod
2. Add unit tests and e2e tests for w4a8 dynamic new and old format
models
<details>
<summary><b>details</b></summary>
1. **Support for new w4a8-dynamic format:**
* Detects quantization format by reading the "version" field in
quant_description to ensure backward compatibility.
* Handles the new pre-packed weight format (`2x int4` in an `int8`),
which has a halved dimension. It tells the vLLM loader how to unpack it
using `_packed_dim` and `_packed_factor`.
* Supports the new `scale_bias` parameter, setting its shape based on
the layer type, as required by msmodelslim. For api consistency and
future use, the `layer_type` parameter was also added to other
quantization methods.
* Updates the weight processing logic: new format weights are handled
with `.view(torch.int32)` since they're pre-packed, while old ones are
processed with `npu_convert_weight_to_int4pack`.
2. **New unit and E2E tests:**
* Added unit tests that verify the logic for both the old and new
formats.
* Split the distributed E2E test to confirm that both old and new format
models work correctly.
</details>
Theoretically, these changes will provide support for all common new
version w4a8(dynamic) models from msmodelslim.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
I implement relevant unit tests and e2e tests and test the changes with
following commands:
```bash
# unit tests
python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v
# e2e tests
pytest tests/e2e/singlecard/test_quantization.py -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s
```
I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic
format:
```
vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384
```
All tests mentioned passed locally.
**NOTE: I use quantization model from my own repo in
test_offline_inference_distributed.py**. Here is the description:
[Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary)
(including quantization steps).This should be replaced by a model in
vllm-ascend ci modelscope repo.
Thanks for reading!
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: Anionex <1005128408@qq.com>
This PR adds support for redundant experts in the EPLB.
Key points:
- Use global_num_experts = num_experts + num_redundant_experts
consistently.
- Backward compatible when num_redundant_experts=0.
Tested
On a 16-rank setup (W8A8) with static EPLB and expert_map_path,
verifying router logits shape and successful requests.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: yechao237 <yechao20180411@gmail.com>
What this PR does / why we need it?
1.Record expert map without dynamic eplb.
2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb.
3.change eplb doc
Does this PR introduce any user-facing change?
How was this patch tested?
Qwen3_moe in A3.
- vLLM version: v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
The `row_idx` parameter is no longer used since
PR[#2689](https://github.com/vllm-project/vllm-ascend/pull/2689), so
remove it across multiple files to remove unnecessary calculations and
parameter passing.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
accuracy test passed for Qwen3 235B and DeepSeek V3 671B after this PR.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: CaranLic <740821011@qq.com>
### What this PR does / why we need it?
Currently, when executing to the Linear layer of models in vLLM-Ascend,
the weights format is ND in unquantized case and skipped ascend case.
This PR supplements the execution logic for Linear layer. We use a new
global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and
CANN version is 8.3, the weights of the Linear layer will be converted
to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also
use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as
w8a8-quantized case.
### Does this PR introduce _any_ user-facing change?
Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ
format, you should set VLLM_ASCEND_ENABLE_NZ=1.
### How was this patch tested?
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
### What this PR does / why we need it?
When using dynamic eplb,it will be blocking by nz tensor.We fix these
prolems by clone src tensor and recv tensor.
### Does this PR introduce any user-facing change?
### How was this patch tested?
Qwen3_moe in A3.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0
---------
Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
### What this PR does / why we need it?
Due to the special input data during the dummy run, the majority of
tokens are distributed on DP0TP0, which results in insufficient
available KV cache on DP0TP0.
This PR changes the `topk_ids` of the dummy_run input from all zeros to
random values.
This is a naive implementation for experts load balance so as to avoid
accumulating too much tokens on a single rank.
### How was this patch tested?
model: DeepSeek-v3-w8a8
```bash
vllm serve DeepSeek-v3-w8a8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--enforce-eager \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--disable-log-stats \
--max-num-seqs 18 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--additional-config \
'{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
```
The Available memory: **2728672256** -> **6771544064**
KV Cache size: **38144** -> **95232** tokens
After enabling load balance
- vLLM version: v0.11.0
---------
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it?
1. Replace prepare/finalize operation in fused_moe.py by
moe_comm_method.prepare()/finalize()
2. Replace unified_fused_experts by moe_comm_method.fused_experts() in
fused_moe.py/w8a8_dynamic.py/w4a8_dynamic.py
3. Add calling _select_moe_comm_method in spec-decode proposers.
4. Currently, w4a8_dynamic does not support gatherep, use all2allv
instead.
5. Remove redundant code.
### Does this PR introduce _any_ user-facing change?
AllgatherEP switch is disabled in aclgraph/eager mode, just follow the
rules in modelrunner_v1._select_moe_comm_method()
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.2
- vLLM main:
7f6f2c1182
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
This PR sets the default format of GMM w2_weight in w8a8_dynamic to be
NZ to improve performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: main
- vLLM main:
e40827280b
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
1. Move prepare/finalize operation from moe_comm_method to
/ops/moe/fused_moe_prepare_and_finalize
2. Adapt to token_dispatcher in moe_comm_method
3. Move
moe_comm_method/experts_selector/token_dispatcher/fused_moe_prepare_and_finalize
to /ops/moe
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55
Signed-off-by: weichen <calvin_zhu0210@outlook.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>
### What this PR does / why we need it?
* **Unify execution paths:** Consolidates the quantized and
non-quantized execution paths into a single `fused_experts` function,
removing duplicated logic and making the control flow clearer and easier
to maintain.
* **W8A8 dynamic quantization:** Adds support for W8A8 dynamic
quantization inside the unified MoE kernel. Communication routines are
updated to correctly handle dynamic quantization scales for activations.
* **Weight pre-processing:** Prae-transpose the `w13` and `w2` weight
matrices (as implemented in PR #2025) so that quantized and
non-quantized models follow the same code path for the MoE gating,
up-projection, and down-projection operations.
* **All-to-all communication:** Adds an `all-to-all` collective
communication pattern. For large token counts on modern hardware,
`all-to-all` is more efficient than the previous `all-gather` strategy.
However, `all-to-all` is not really captured and replayed due to
multiple D2H operations which will trigger synchronization, and thus
raise error when capture graphs. We only use `all-to-all` when fallback
to `compiled_graph_for_general_shape`.
* **Dynamic communication selection:** The model runner now selects the
optimal MoE communication method (`mc2`, `allgather`, or `alltoall`) at
runtime based on token count and the Ascend SoC version.
* **Limitation:** `all-gather` is not yet supported for quantized
models, which means there is still something left to do on A2.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
No further test cases needed.
- vLLM version: v0.10.1.1
- vLLM main:
d660c98c1b
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
In a mixed-precision scenario, quant_config is not None, but MoE needs
to perform unquantized computation; however, quantized computation is
currently being used. Therefore, we put the with_quant logic into
forward, avoid misjudging in mix-precision scenarios.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut
- vLLM version: v0.10.1.1
- vLLM main:
98ac0cb32d
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
After moved torchair related quantization section into
torchair_quantization, split the torchair from the origin quantization
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
vLLM version: main
vLLM main:
ab9f2cfd19
- vLLM version: v0.10.1.1
- vLLM main:
69244e67e6
Signed-off-by: hust17yixuan <303660421@qq.com>
### What this PR does / why we need it?
Integrate the arange operator to reduce the time spent and improve
performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
56dcf4e7e9
---------
Signed-off-by: s30076806 <songjiayang2@h-partners.com>
### What this PR does / why we need it?
this pr refactor select_experts of moe module
i merge implementations of quantitative and non-quantitative method in a
new class
use such as vllm like ExpertsSelector.select_experts
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in qwen3-moe and all ut.
- vLLM version: v0.10.0
- vLLM main:
e18859298d
Signed-off-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>
### What this PR does / why we need it?
Remove redundant imported `envs`, using `envs_ascend` instead.
```python
import vllm.envs as envs_vllm
import vllm_ascend.envs as envs_ascend
```
- vLLM version: v0.10.0
- vLLM main:
71683ca6f6
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and
`tests/ut/quantization/test_quantizer.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC`
to test deepseek w4a8_dynamic quantized model
#### 1.How to get weights using Modelslim
##### Installation steps
Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh
##### The required transformers environment
transformers>=4.48.2
##### Generate w4a8 weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format
##### Adapt to vllm-ascend
Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and add `"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3`; `quantization_config` is removed;
tips:The group_size and weights match. If the w4a8 weights are not
generated using msmodelslim, you can check the group_size in
quantization_config in config.json.
#### 2.How to run w4a8
##### a.How to run eager mode
export VLLM_USE_V1=1 # v1
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120 --max-num-seqs 128 --enforce-eager
##### b.How to run graph mode
export VLLM_USE_V1=1 # v1
export HCCL_BUFFSIZE=1024
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
eg: python -m vllm.entrypoints.openai.api_server
--model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 5120
--additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
- vLLM version: v0.10.0
- vLLM main:
c494f96fbc
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
This PR significantly optimizes performance for quantized Mixture of
Experts (MoE) layers by changing the order of quantization and
communication operations.
In the previous implementation, the `all2all` operation was performed on
unquantized `hidden_states` (in FP16/BF16) *before* quantization,
resulting in substantial communication overhead. By performing
quantization on each EP rank **first** and then sending the much smaller
quantized data, we reduce the communication volume by nearly 50%.
Additionally, this PR includes a minor optimization to cast `int` inputs
to `float` for the `argsort` operation, forcing it to run on a faster
NPU core instead of the AICPU.
These changes lead to a clear and significant performance gain in MoE
quantization scenarios.
- vLLM version: v0.10.0
- vLLM main:
7175817637
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py`
Adding e2e case in
`tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC`
to test qwen3 w4a8_dynamic quantized model
Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`
1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```
2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--model vllm-ascend/Qwen3-8B-W4A8 \
--port 8000 \
--quantization ascend \
--tensor_parallel_size 2 \
--enforce-eager
```
- vLLM version: v0.10.0
- vLLM main:
4cd7fe6cea
---------
Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
This PR designs the shared expert multi-stream parallelism of
w8a8-dynamic-quantized MoE stage in more detail to achieve better
performance.
- vLLM version: v0.10.0
- vLLM main:
2cc571199b
Signed-off-by: whx-sjtu <2952154980@qq.com>