[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)
### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
This commit is contained in:
@@ -305,3 +305,14 @@
|
||||
# https://github.com/vllm-project/vllm/pull/34336
|
||||
# Future Plan:
|
||||
# Remove this patch when vLLM merges the PR.
|
||||
# ** 16. File: worker/patch_qwen3_quarot.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.model_executor.models.llama_eagle3.Eagle3LlamaForCausalLM.load_weights`
|
||||
# Why:
|
||||
# vllm-ascend reused the loading logic of drafter model from vllm,
|
||||
# but vllm doesn't need to apply to Ascend quantization.
|
||||
# How:
|
||||
# Dynamically replace the `load_weights` function at runtime,
|
||||
# and fix `target_config` into the new implementation with a closure.
|
||||
# Future Plan:
|
||||
# Remove this patch when vLLM merges the PR.
|
||||
|
||||
Reference in New Issue
Block a user