[BugFix] Fix bugs when using ascend quantization (#275)

### What this PR does / why we need it?
It fixes following bugs:
1. When searching a specific linear quantization implementation from a
tool (such as MindIE-Turbo), the mapping of packed linear is required to
identify correponding quant type.
2. The exception is narrowed down to ImportError when importing
MindIETurboQuantizer to better throw other errors.
3. The api of AscendKVCacheMethod.apply is aligned with that in
AscendAttentionBackendImpl.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By performing offline inference:

![image](https://github.com/user-attachments/assets/d63804cf-c060-451f-9cb0-d012e06b5333)

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
This commit is contained in:
Angazenn
2025-03-12 11:33:21 +08:00
committed by GitHub
parent 5c7a95b01d
commit 7330416de3
3 changed files with 53 additions and 20 deletions

View File

@@ -744,10 +744,19 @@ class AscendAttentionBackendImpl(AttentionImpl):
block_tables = attn_metadata.decode_metadata.block_tables if attn_metadata.decode_metadata else None
# Details of kv_cache arrangement in attention quantization
# are implemented by quant_method.
layer.quant_method.apply(layer, query, key, value, self.key_cache,
self.value_cache, self.scale,
self.seq_lens_tensor_cpu, block_tables,
isPrefill, attn_metadata, output)
layer.quant_method.apply(
layer,
query,
key,
value,
self.key_cache,
self.value_cache,
self.scale,
block_tables,
isPrefill,
attn_metadata,
output,
seq_lens_tensor_cpu=self.seq_lens_tensor_cpu)
else:
if self.key_cache is not None:
torch_npu._npu_reshape_and_cache(key=key,