Remove chunked_prefill_for_mla and fix ring_mla bug (#2781)
### What this PR does / why we need it?
Remove chunked prefill for mla branch in mla , and change dtype of
prefill_mask to avoid accuracy problem
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
ef7eefe17a
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
This commit is contained in:
@@ -148,10 +148,6 @@ msgid ""
|
||||
" to be passed in."
|
||||
msgstr "在为MOE模型使用专家负载均衡时,需要传入专家映射路径。"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`chunked_prefill_for_mla`"
|
||||
msgstr "`chunked_prefill_for_mla`"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`False`"
|
||||
msgstr "`False`"
|
||||
|
||||
@@ -30,7 +30,6 @@ The following table lists the additional configuration options available in vLLM
|
||||
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
|
||||
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf or ut/e2e test case. |
|
||||
| `expert_map_path` | str | `None` | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
|
||||
| `chunked_prefill_for_mla` | bool | `False` | Whether to enable the fused operator-like chunked_prefill. |
|
||||
| `enable_prefetch` | bool | `False` | Whether to enable weight prefetch. |
|
||||
| `kv_cache_dtype` | str | `None` | When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
|
||||
|
||||
Reference in New Issue
Block a user