5 Commits

Author SHA1 Message Date
zhangxinyuehfad
0d094531b4 [bugfix] Fixed the bug in retrieving the quantization method for mlp.… (#4797)
When retrieving the quantization method for MOE (e.g., the quantization
file of DeepSeek v3.2 exp do not match the model's naming convention in
eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight
not in self.quant_description". However the quantization file is like :
```bash
  "model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC",
  "model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC",
```

Co-Authored-By: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
2025-12-09 08:47:19 +08:00
Slightwind
4f6d60eb06 [Feature] Add W4A4 Flat Quantization support (#3427)
Introduce W4A4 Flat Quantization for better model compression and
inference efficiency on Ascend devices.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-13 23:20:16 +08:00
22dimensions
f5a97e8fa5 [Quantization] register AscendQuantRMSNorm for quantization (#2856)
### What this PR does / why we need it?

modelslim will generate self.bias for rms norm in quantization, since
RMSNorm in vllm has no this parameter, so its nesscesary
to create a AscendQuantRmsNorm.
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

tested by deepseek-v3.1-w8a8

<img width="2496" height="592" alt="image"
src="https://github.com/user-attachments/assets/004c6e76-3d7a-4a1f-b59f-a14304012663"
/>


- vLLM version: main
- vLLM main:
d6249d0699

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-11 23:14:02 +08:00
22dimensions
d51694a77b [2/N][Refactor][Quantization] clean quantization patch (#2785)
### What this PR does / why we need it?
quantization patch is unused code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested by CI

- vLLM version: v0.10.1.1
- vLLM main:
f4962a6d55

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-08 17:31:53 +08:00
22dimensions
37f5a29cd4 [1/N][Refactor][Quantization] remove redundant quantizer class (#2680)
### What this PR does / why we need it?

AscendQuantizer/LLMQuantizer class is used to select quant method based
on quant config and some other arguments,
but it is more simple and clean replacing these classes with map. So i
remove them.

### Does this PR introduce _any_ user-facing change?
No 

### How was this patch tested?

ut and e2e test


- vLLM version: v0.10.1.1
- vLLM main:
6997a25ac6

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
2025-09-04 11:35:14 +08:00