### What this PR does / why we need it?
Refactor `vllm_ascend/ops/fused_moe` to replace scattered MoE business
`**kwargs` with typed request objects and explicit stage boundaries.
- Prepare, dispatch, MLP, and quant stages now have clearer ownership.
- Main MoE path no longer depends on business `kwargs.get(...)` lookups.
- Comm and dispatcher interfaces are request-only on the main path.
- UTs can assert stage-level fields directly instead of inferring
behavior indirectly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This PR adapts the `w8a8_mxfp8` quantization method to support Qwen
Vision-Language (VL) models. Key changes include:
- Reshaping multi-dimensional input tensors to 2D before the quantized
matrix multiplication.
- Reshaping the 2D output back to its original multi-dimensional format.
- Adding specific output reshaping for the visual components of Qwen VL
models.
- Casting the bias tensor to `float32` to comply with the
`npu_quant_matmul` kernel requirements.
These changes are necessary to enable `w8a8_mxfp8` quantization for
models with multi-modal inputs like Qwen VL.
### Does this PR introduce _any_ user-facing change?
No, this is a backend enhancement to extend quantization support to new
model architectures. There are no user-facing API or behavior changes.
### How was this patch tested?
CI is expected to pass. Manual testing should be performed with a Qwen
VL model using `w8a8_mxfp8` quantization to verify correctness and
performance.
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: ksiyuan <ksiyuan@umich.edu>
### What this PR does / why we need it?
This PR fixes the layer name mapping logic in `AscendModelSlimConfig`
for quantization config loading.
1. **kimi_k2 model layer name mapping issue**: The `kimi_k2` model has a
unique layer naming convention that differs from the standard
`hf_to_vllm` mapping. One layer was defined in the mapper but was not
being correctly applied, causing quantization config lookup failures.
2. **Manual mapping registration timing issue**: The manual mapping
check in `apply_vllm_mapper` was executed before `vllm_config` was
initialized, causing `model_type` to be unavailable. This prevented some
models with manual mappings from being correctly registered.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
Tested with `kimi_k2` model to verify the special layer name mapping
works correctly. Also tested with other models that have manual mappings
defined in `QUANT_MODEL_PREFIX_MAPPINGS` to ensure the registration
timing fix works properly.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Adapt to the model type of Qwen3-VL-8B-Instruct-W8A8
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>
### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints:
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Reapply the auto-detect quantization format feature (originally in
#6645, reverted in #6873) and extend it to support remote model
identifiers (e.g., `org/model-name`).
Changes:
- Reapply auto-detection of quantization method from model files
(`quant_model_description.json` for ModelSlim, `config.json` for
compressed-tensors)
- Add `get_model_file()` utility to handle file retrieval from both
local paths and remote repos (HuggingFace Hub / ModelScope)
- Update `detect_quantization_method()` to accept remote repo IDs with
optional `revision` parameter
- Update `maybe_update_config()` to work with remote model identifiers
- Add platform-level `auto_detect_quantization` support
- Add unit tests and e2e tests for both local and remote model ID
scenarios
Closes#6836
### Does this PR introduce _any_ user-facing change?
Yes. When `--quantization` is not explicitly specified, vllm-ascend will
now automatically detect the quantization format from the model files
for both local directories and remote model IDs.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This PR aims to support aclgraph for model runner v2, please see RFC
#5208. The PR contains these modifications:
- adapt to newest commit of vllm main branch.
- supply a unified interface of extra forward context for both model
runner v1 and model runner v2.
- implement graph mode for main model.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
…the quantization layer name
### What this PR does / why we need it?
This PR modifies the loading logic for layer name prefixes in quantized
models. The goal is to reduce or eliminate the need for point-to-point
(hardcoded) modifications by leveraging the built-in mapper mechanism
already provided in vLLM's model code. For models that do not yet have a
corresponding mapper, the original point-to-point modification approach
has been retained to ensure backward compatibility.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The changes were validated using an offline deployment script to launch
and verify multiple multimodal models. Testing confirmed that the
updated loading logic correctly handles layer name prefixes across
different model architectures, with no regression in model
initialization or inference behavior.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
### What this PR does / why we need it?
This is a bug fix to resolve the issue where the MOE model fails to load
quantized weights in w4a8 format when EP is not enabled.The parameters
["weight_scale_second", "weight_offset_second", "scale_bias"] shall be
parsed in per-group mode, regardless of other conditions.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
### What this PR does / why we need it?
This pull request introduces support for the Qwen3.5 MoE model on Ascend
devices. The key changes are:
* **Quantization Configuration for Qwen3.5 MoE**: Adds necessary prefix
mappings and packed module definitions for `qwen3_5_moe` in
`vllm_ascend/quantization/modelslim_config.py` to enable ModelSlim
quantization.
* **Triton Kernel Fix**: Corrects a bug in the `fused_gdn_gating` Triton
kernel. The calculation for `BLK_BATCHES` had an operator precedence
issue which is now resolved. The calculation has also been made more
robust with added clamping to prevent potential out-of-bounds memory
access in the unified buffer.
These changes enable the correct and efficient execution of Qwen3.5 MoE
models on Ascend hardware.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI should be used to verify the correctness of these changes. It is
recommended to run tests with the Qwen3.5 MoE model to ensure the new
configurations and the kernel fix work as expected.
Signed-off-by: xmpp777 <yangming2@huawei.com>
### What this PR does / why we need it?
We need to support quant config in glm46v
.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
We used the 'Ascend/msit' quantization method to test the w8a8 weights.
Successfully ran on NPU using vllm-ascend by the w8a8 weights.
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com>
### What this PR does / why we need it?
This pull request addresses a bug related to the fused mc2 functionality
within the EPLB (Expert Parallelism Load Balancing) system, specifically
impacting quantization and MoE communication.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
Signed-off-by: Spicy-Stick <873805887@qq.com>
Signed-off-by: root <root@localhost.localdomain>
### What this PR does / why we need it?
[bugfix]Qwen-Omni quantization bugfix
fix Qwen-Omni quantization weight mapping to float weight
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
### What this PR does / why we need it?
[bugfix]Qwen-Omni quantization model_type bugfix
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
### What this PR does / why we need it?
This pull request is for quantization adaptation of Qwen3Omni, and it
achieves operator-level optimization and AUT (Auto-Quantization Tuning)
component optimization through patch-based modifications.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1
---------
Signed-off-by: tanhaoan333 <tanhaoan@huawei.com>
### What this PR does / why we need it?
`mxfp_compat` only provides dtype/symbol compatibility helpers for
different `torch_npu` versions, but it was placed under
`vllm_ascend.quantization`. Importing it from device/ops paths could
trigger `quantization/__init__.py` and pull in heavy quantization method
dependencies, increasing startup coupling and causing import-cycle risk
(especially on 310P paths).
### Does this PR introduce _any_ user-facing change?
No functional behavior change intended.
### How was this patch tested?
CI passed.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
## Summary
- Add automatic quantization format detection, eliminating the need to
manually specify `--quantization` when serving quantized models.
- The detection inspects only lightweight JSON files
(`quant_model_description.json` and `config.json`) at engine
initialization time, with no `.safetensors` reads.
- User-explicit `--quantization` flags are always respected;
auto-detection only applies when the flag is omitted.
## Details
**Detection priority:**
1. `quant_model_description.json` exists → `quantization="ascend"`
(ModelSlim)
2. `config.json` contains `"quant_method": "compressed-tensors"` →
`quantization="compressed-tensors"` (LLM-Compressor)
3. Neither → default float behavior
**Technical approach:**
Hooked into `NPUPlatform.check_and_update_config()` to run detection
after `VllmConfig.__post_init__`. Since `quant_config` is already `None`
at that point, we explicitly recreate it via
`VllmConfig._get_quantization_config()` to trigger the full quantization
initialization pipeline.
## Files Changed
| File | Description |
|------|-------------|
| `vllm_ascend/quantization/utils.py` | Added
`detect_quantization_method()` and `maybe_auto_detect_quantization()` |
| `vllm_ascend/platform.py` | Integrated auto-detection in
`check_and_update_config()` |
| `vllm_ascend/quantization/modelslim_config.py` | Improved error
handling for weight loading |
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
This PR supports the Kimi-K2.5 models on the NPU of bf16 and w4a8
weights.
The corresponding PR in the vllm community has been merged:
https://github.com/vllm-project/vllm/pull/34501
### Does this PR introduce _any_ user-facing change?
- No.
### How was this patch tested?
We test the Kimi-K2.5 weights. The weights path:
https://modelscope.cn/models/Eco-Tech/Kimi-K2.5-W4A8
Successfully ran on 910B NPU using vllm-ascend by the w4a8 weights.
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: LoganJane <LoganJane73@hotmail.com>
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
978a37c823
---------
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
## Summary
- Remove unused `set_rotation_config` and `apply_rotation` methods from
`AscendW4A4LaosDynamicLinearMethod`
- Remove unused `rotation_type` field and associated conditional
quantization parameters (`heads_rotation`, `kronecker_rotation_n`,
`kronecker_rotation_m`)
These rotation-related functions and parameters are never called in the
current W4A4 LAOS dynamic quantization workflow.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Support Moe model W4A8 dynamic weight.
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Fix Qwen3VL dense quant model load weights Error.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
The Qwen3VL quantized model service initialized successfully. Inference
requests are processed correctly, and valid responses are returned.
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
### What this PR does / why we need it?
[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
#### Perf
> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.
test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536
The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|
Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.
#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.
Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This PR fixes Minimax model loading in vLLM Ascend backend by:
Adding model type check for "minimax" and "minimax_m2" to replace "mlp"
prefix with "block_sparse_moe"
Implementing special handling for Minimax expert layer naming
conventions
Adding Minimax configuration to packed_modules_model_mapping for proper
qkv_proj and experts module handling
Without these changes, Minimax models fail to load on Ascend devices due
to incompatible layer naming and module packing.
### Does this PR introduce _any_ user-facing change?
Yes. Users can now successfully load and run Minimax models on Ascend
hardware with vLLM. This enables inference capabilities for this model
family on Ascend devices.
### How was this patch tested?
Local Testing:
Verified model loading for minimax-xxx and minimax_m2-xxx model variants
on Atlas 800I A2 hardware
Tested inference with sample prompts using vLLM's OpenAI-compatible API
server
Benchmark Validation:
Compared throughput and latency metrics against GPU baseline
Verified memory usage stays within expected limits for different batch
sizes
Tested multi-card inference scenarios with tensor parallelism
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
---------
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
### What this PR does / why we need it?
support mxfp8 quantization (qwen liner layer)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangyao <iwangyao@outlook.com>
#### What this PR does / why we need it?
This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and
expert token numbers.
This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of
list.
This operator support couting how many token each local expert recieves
by expertTokensNum .
- vLLM version: v0.13.0
- vLLM main:
7157596103
More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
#### Overview
This PR fixes a shape mismatch bug between `expert_placement_map` and
`log2phy_expert_map` when **redundant experts** are enabled in the
vLLM-Ascend platform. The issue occurred during the initialization of
expert maps and their updates via EPLB (Expert Load Balancer)
adjustment, leading to potential tensor shape errors and incorrect
expert routing in distributed MoE deployments.
#### Key Changes
1. **Unify expert map shape calculation logic**
- Ensure the shape of `expert_placement_map` and `log2phy_expert_map`
strictly aligns with the total number of experts (including redundant
experts) during initialization.
- Update the shape adjustment logic in EPLB dynamic update process to
match the initial expert map dimensions.
2. **Add shape consistency checks**
- Add assertion statements to verify the shape consistency of the two
maps after initialization and EPLB adjustment, preventing silent shape
mismatches in subsequent operations.
#### Impact
- Resolves tensor shape errors when using redundant experts with EPLB on
Ascend platform.
- Ensures correct expert routing and load balancing for MoE models with
redundant expert configurations.
- No breaking changes to existing functionality; compatible with
non-redundant expert deployments.
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Add LongCat-Flash support.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed
- vLLM version: v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: chuyuelin <923822139@qq.com>
Co-authored-by: chuyuelin <chuyuelin1@huawei.com>
### What this PR does / why we need it?
Unify the loading logic for expert_map and log2phy.
1. The map generated when enabling the redundancy expert is incorrect.
The community generation map function only accepts the number of global
experts. When we pass in the number of logical experts plus redundant
experts, the local expert ID of the last card will index to an expert ID
that does not exist. Now we ensure that the index points to a real
existing expert ID, and each expert can be accessed. Moreover, when
redundant experts are not enabled, the output of our function remains
consistent with the community's function.
2. The map we generate is based on the length of the physical expert,
but in reality, we only need to use the length of the logical expert.
Later on, we will need to pad it accordingly, so we can simply generate
a map with the length of the logical [expert.]
3. Unify the initialization logic across different scenarios and
simplify the code for fused_moe.
**Before refactoring**
- map path is not None:
expert map: get_rank_placement_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.
- map path is None:
expert map: determine_expert_map from '_vllm.laye_r', The function does
not support the redundant experts of vllm-ascend.
log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The
function does not support the redundant experts of vllm-ascend.
**Refactoring**
eplb_utils.py
init_eplb_config
generate placement
generate expert map
generate log2phy
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 16
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16]
+++++++++++++++++++++++++++++++++++++++++
Improved map:
[16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 0
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
+++++++++++++++++++++++++++++++++++++++
Improved map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
dsr1 baselie:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
dsr1 eplb:
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
### What this PR does / why we need it?
Remove unnecessary attributes from set_ascend_forward_context
1.prefetch_stream
2.weight_prefetch_method
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Currently, `torch_npu.npu_grouped_matmul_swiglu_quant` can only support
weight nz, so we need to trans w13_weight, w2_weight to nz forcely.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
This PR adds model-side integration for the previously introduced
experimental AscendC fused operator DispatchGmmCombineDecode, used in
MoE decoding.
The operator implementation itself was added in a prior PR[#4139
](https://github.com/vllm-project/vllm-ascend/pull/4139).
This change only adapts the model execution path to optionally use the
fused operator.
When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=2 is set, the
original MC2 path composed of multiple operators (A8W8 dispatch → GMM →
SwiGLU → GMM → combine) might be replaced by the single fused operator
DispatchGmmCombineDecode.
By default, the existing multi-operator MC2 implementation is preserved.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangqiankun <wangqiankun13@huawei.com>