[main][quantization] Support deepseek w4a8 per-channel quantization (#3011)

### What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
#### How to get weights using Modelslim

##### Installation steps

git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### Generate w4a8 per-channel weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
This commit is contained in:
Wang Kunpeng
2025-09-27 21:01:16 +08:00
committed by GitHub
parent e9359bd8fa
commit 859e861d92
6 changed files with 299 additions and 196 deletions

View File

@@ -108,18 +108,19 @@ Please convert DeepSeek series models using `br_release_MindStudio_8.1.RC2_TR5_2
### 3. When converting deepseek series models with modelslim, what should you pay attention?
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
The operation steps are as follows:
1. Search in the CANN package directory used, for example:
find /usr/local/Ascend/ -name fusion_config.json
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
2. Add `"AddRmsNormDynamicQuantFusionPass":"off",` and `"MultiAddRmsNormDynamicQuantFusionPass":"off",` to the fusion_config.json you find, the location is as follows:
```bash
{
"Switch":{
"GraphFusion":{
"AddRmsNormDynamicQuantFusionPass":"off",
"MultiAddRmsNormDynamicQuantFusionPass":"off",
```