[v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748)

### What this PR does / why we need it? cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736 **Error information** When the quantized weights in CompressedTensors format of the kimi-k2 model are used, the following error is reported: `AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'` **Error Cause** Currently, FA3 quantization supports only the weights of modelslim quantization. The added methods are not defined in AscendCompressedTensorsConfig. **Solution** Before invoking related methods, check whether the FA3 feature is enabled. Additionally, the unused `get_scaled_act_names` method and its corresponding unit test have been removed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests were updated by removing a deprecated test case, and the refactored logic was reviewed for correctness. Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-28 17:03:56 +08:00
parent c1cefd26de
commit 5df2ddd8db
5 changed files with 14 additions and 12 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -109,6 +109,7 @@ from vllm_ascend.eplb.utils import model_register
 from vllm_ascend.ops.rotary_embedding import set_cos_and_sin, update_cos_sin
 from vllm_ascend.patch.worker.patch_draft_quarot import patch_load_weights
 from vllm_ascend.patch.worker.patch_module import patch_torch_npu_argsort
+from vllm_ascend.quantization.utils import enable_fa_quant
 from vllm_ascend.sample.sampler import AscendSampler
 from vllm_ascend.spec_decode import get_spec_decode_method
 from vllm_ascend.spec_decode.draft_proposer import AscendDraftModelProposer
@@ -2763,7 +2764,7 @@ class NPUModelRunner(GPUModelRunner):
                            k_dim,
                            v_dim,
                        ]
-                        if self.is_kv_consumer and self.vllm_config.quant_config is not None:
+                        if self.is_kv_consumer and enable_fa_quant(self.vllm_config):
                            k_tensor_split_factor, v_tensor_split_factor = (
                                self.vllm_config.quant_config.get_kv_quant_split_factor(layer_name, kv_head_dim_list)
                            )
@@ -2950,7 +2951,7 @@ class NPUModelRunner(GPUModelRunner):
                            v_dim,
                        )
                    k_cache_dtype = v_cache_dtype = current_kv_cache_spec.dtype
-                    if self.is_kv_consumer and self.vllm_config.quant_config is not None:
+                    if self.is_kv_consumer and enable_fa_quant(self.vllm_config):
                        k_cache_dtype, v_cache_dtype = self.vllm_config.quant_config.get_kv_quant_dtype(
                            layer_name, current_kv_cache_spec.dtype, self.model_config
                        )