xc-llm-ascend

Files

sunbaosong d6bfae8eee support 32K model len on deepseek r1 W8A8 (#728 )

### What this PR does / why we need it?

Optimize NPU memory usage.
https://github.com/vllm-project/vllm-ascend/issues/723

vllm v0.8.4.rc2 and DeepSeek R1 can only support a model length of 16K.
When attempting to run with a model length of 32K, an "Out of Memory"
(OOM) error will occur.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: sunbaosong <13793883820@163.com>

2025-05-06 10:12:07 +08:00

__init__.py

[Core] Cherry pick from 0.7.1 to keep the main code newest (#127 )

2025-02-21 17:07:37 +08:00

func_wrapper.py

[quantization] Support w8a8 quantization (#580 )

2025-04-20 18:14:05 +08:00

quant_config.py

[Feature] Add quant description file for new quant model generated by modelslim (#719 )

2025-04-30 16:51:56 +08:00

quantizer.py

[quantization] Support w8a8 quantization (#580 )