xc-llm-ascend

Files

Slightwind 18eefc23c3 [feature] Support W8A8 PD-Mix Quantization (#4235 )

In PD-separated deployment scenarios:

* MoE layers use dynamic quantization exclusively.
* For the Attention module, Prefill (P) nodes use **dynamic**
quantization, while Decode (D) nodes use **static** quantization.

In PD-mixed deployment scenarios:
* **All components fall back to dynamic quantization**, as it is
difficult to distinguish between Prefill and Decode tokens.
___

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: Slightwind <slightwindsec@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

2025-11-30 11:57:26 +08:00

compressed_tensors

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 )

2025-11-28 14:09:39 +08:00

__init__.py

[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 )

2025-02-21 17:07:37 +08:00

quant_config.py

[Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036 )

2025-11-28 14:09:39 +08:00

utils.py

[feature] Support W8A8 PD-Mix Quantization (#4235 )