[Feature] Support moe multi-stream for aclgraph. (#2946)
This PR puts the calculation of shared experts into a separate stream,
overlaping with routing experts.
- vLLM version: v0.10.2
- vLLM main:
fbd6523ac0
---------
Signed-off-by: whx-sjtu <2952154980@qq.com>
This commit is contained in:
@@ -195,8 +195,8 @@ msgid ""
|
||||
msgstr "是否将MLA的向量操作放到另一个流中。此选项仅对使用MLA的模型(例如,DeepSeek)有效。"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid "`enable_multistream_moe`"
|
||||
msgstr "`enable_multistream_moe`"
|
||||
msgid "`multistream_overlap_shared_expert`"
|
||||
msgstr "`multistream_overlap_shared_expert`"
|
||||
|
||||
#: ../../user_guide/configuration/additional_config.md
|
||||
msgid ""
|
||||
|
||||
@@ -35,6 +35,7 @@ The following table lists the additional configuration options available in vLLM
|
||||
| `enable_shared_expert_dp` | bool | `False` | When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
|
||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
||||
| `multistream_overlap_shared_expert`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on moe models with shared experts. |
|
||||
|
||||
The details of each config option are as follows:
|
||||
|
||||
@@ -45,7 +46,6 @@ The details of each config option are as follows:
|
||||
| `enabled` | bool | `False` | Whether to enable torchair graph mode. Currently only DeepSeek series models and PanguProMoE are supported to use torchair graph mode |
|
||||
| `mode` | str | `None` | When using reduce-overhead mode for torchair, mode needs to be set |
|
||||
| `enable_multistream_mla`| bool | `False` | Whether to put vector ops of MLA to another stream. This option only takes effects on models using MLA (e.g., DeepSeek). |
|
||||
| `enable_multistream_moe`| bool | `False` | Whether to enable multistream shared expert. This option only takes effects on DeepSeek moe models. |
|
||||
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
|
||||
| `enable_frozen_parameter` | bool | `True` | Whether to fix the memory address of weights during inference to reduce the input address refresh time during graph execution. |
|
||||
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
|
||||
@@ -74,13 +74,13 @@ An example of additional configuration is as follows:
|
||||
"use_cached_graph": True,
|
||||
"graph_batch_sizes": [1, 2, 4, 8],
|
||||
"graph_batch_sizes_init": False,
|
||||
"enable_multistream_moe": False,
|
||||
"enable_kv_nz": False
|
||||
},
|
||||
"ascend_scheduler_config": {
|
||||
"enabled": True,
|
||||
"enable_chunked_prefill": True,
|
||||
},
|
||||
"multistream_overlap_shared_expert": True,
|
||||
"refresh": False,
|
||||
}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user