[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)

Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
This commit is contained in:
pichangping
2026-03-16 22:49:05 +08:00
committed by GitHub
parent a6f6e919e6
commit 3f39ac9c8d
15 changed files with 1112 additions and 161 deletions

View File

@@ -1211,3 +1211,36 @@ def get_rope_dim(vllm_config):
rope_dim = int(model_config.hf_text_config.rotary_dim)
return rope_dim
def calc_split_factor(num_list: list[int]):
total = sum(num_list)
split_factor_list = []
for num in num_list:
split_factor_list.append(total / num)
return split_factor_list
# NOTE: The last two dimensions of ND are transferred to NZ
def trans_nd_to_nz(cache_tensor: torch.Tensor):
assert len(cache_tensor.shape) >= 2
batch = cache_tensor.shape[:-2]
a, b = cache_tensor.shape[-2], cache_tensor.shape[-1]
dtype = cache_tensor.dtype
if dtype == torch.int8:
a0, b0 = 16, 32
else:
a0, b0 = 16, 16
nz_shape = list(batch) + [math.ceil(b / b0), math.ceil(a / a0), a0, b0]
# Generate the axis order for the transpose operation.
offset = len(cache_tensor.shape) - 2
base = [2, 0, 1, 3]
array_trans = [i for i in range(offset)] + [i + offset for i in base]
# Perform shape transformation and transpose operation.
*_, n1, m1, m0, n0 = nz_shape
cache_tensor = cache_tensor.reshape(nz_shape[:-4] + [m1, m0, n1, n0])
cache_tensor = cache_tensor.permute(*array_trans)
return cache_tensor