[Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:
--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
This commit is contained in:
@@ -171,9 +171,6 @@ export TASK_QUEUE_ENABLE=1
|
||||
# Enable the AIVector core to directly schedule ROCE communication
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
|
||||
# Enable MLP prefetch for better performance.
|
||||
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
|
||||
|
||||
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
@@ -187,7 +184,7 @@ vllm serve /model/Qwen3-32B-W8A8 \
|
||||
--max-model-len 5500 \
|
||||
--max-num-batched-tokens 40960 \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--additional-config '{"pa_shape_list":[48,64,72,80]}' \
|
||||
--additional-config '{"pa_shape_list":[48,64,72,80], "weight_prefetch_config":{"enabled":true}}' \
|
||||
--port 8113 \
|
||||
--block-size 128 \
|
||||
--gpu-memory-utilization 0.9
|
||||
@@ -348,9 +345,7 @@ Weight prefetching optimizes memory usage by preloading weights into the cache b
|
||||
|
||||
In dense model scenarios, the MLP's gate_up_proj and down_proj linear layers often exhibit relatively high MTE utilization. To address this, we create a separate pipeline specifically for weight prefetching, which runs in parallel with the original vector computation pipeline, such as RMSNorm and SiLU, before the MLP. This approach allows the weights to be preloaded to L2 cache ahead of time, reducing MTE utilization during the MLP computations and indirectly improving Cube computation efficiency by minimizing resource contention and optimizing data flow.
|
||||
|
||||
It is important to emphasize that, since we use vector computations to hide the weight prefetching pipeline, the setting of the prefetch buffer size is crucial. If the buffer size is too small, the optimization benefits will not be fully realized, while a larger buffer size may lead to resource contention, resulting in performance degradation. To accommodate different scenarios, we have exposed two environment variables `VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE` and `VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE` to allow for flexible buffer size configuration based on the specific workload.
|
||||
|
||||
This optimization requires setting the environment variable `VLLM_ASCEND_ENABLE_PREFETCH_MLP = 1` to be enabled.
|
||||
Previously, the environment variables VLLM_ASCEND_ENABLE_PREFETCH_MLP used to enable MLP weight prefetch and VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE and VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE used to set the weight prefetch size for MLP gate_up_proj and down_proj were deprecated. Please use the following configuration instead: "weight_prefetch_config": { "enabled": true, "prefetch_ratio": { "mlp": { "gate_up": 1.0, "down": 1.0}}}. See User Guide->Feature Guide->Weight Prefetch Guide for details.
|
||||
|
||||
### 6. Zerolike Elimination
|
||||
|
||||
|
||||
Reference in New Issue
Block a user