xc-llm-ascend

Files

Feng-xiaosuo c316679e65 adapt to minimax_m2 (#5624 )

### What this PR does / why we need it?
This PR fixes Minimax model loading in vLLM Ascend backend by:

Adding model type check for "minimax" and "minimax_m2" to replace "mlp"
prefix with "block_sparse_moe"
Implementing special handling for Minimax expert layer naming
conventions
Adding Minimax configuration to packed_modules_model_mapping for proper
qkv_proj and experts module handling
Without these changes, Minimax models fail to load on Ascend devices due
to incompatible layer naming and module packing.

### Does this PR introduce _any_ user-facing change?
Yes. Users can now successfully load and run Minimax models on Ascend
hardware with vLLM. This enables inference capabilities for this model
family on Ascend devices.

### How was this patch tested?
Local Testing:
Verified model loading for minimax-xxx and minimax_m2-xxx model variants
on Atlas 800I A2 hardware
Tested inference with sample prompts using vLLM's OpenAI-compatible API
server

Benchmark Validation:
Compared throughput and latency metrics against GPU baseline
Verified memory usage stays within expected limits for different batch
sizes
Tested multi-card inference scenarios with tensor parallelism

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

---------

Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>

2026-01-10 23:01:35 +08:00

compressed_tensors

[Feat] Support native Kimi-K2-Thinking native W4A16 quantized experts weights (#4516 )

2025-12-10 15:58:52 +08:00

__init__.py

[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 )

2025-02-21 17:07:37 +08:00

quant_config.py

adapt to minimax_m2 (#5624 )

2026-01-10 23:01:35 +08:00

utils.py

support mxfp8 quantization (qwen dense) (#5723 )