[Perf][V1] Fully overlap model execution (#2783)
This PR is based on top of
[#23569](https://github.com/vllm-project/vllm/pull/23569) and
[#24219](https://github.com/vllm-project/vllm/pull/24219).
### What this PR does / why we need it?
This PR allows the model runner to function asynchronously when using
async scheduling. This allows full overlap of the cpu operations
(including prepare_inputs) and the model forward pass. This diff is
functional and does not support speculative decoding, PP, or guided
decoding.
Expected speedup is 5-10% over the current async scheduling.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
server
```
python -m vllm.entrypoints.openai.api_server --model=Qwen3-32B\
--trust-remote-code --enforce-eager \
--distributed-executor-backend=mp \
-tp=4 \
--port 8006 \
--max-model-len 32000 \
--block-size 128 \
--gpu-memory-utilization 0.99
```
client
```
python $TEST_PY --backend vllm --trust-remote-code --model Qwen3-32B \
--dataset-name random --random-input-len 2048 --random-output-len 2048 \
--ignore-eos\
--num-prompts 48 --max-concurrency 48 --request-rate inf --temperature 0 \
--metric-percentiles 90 --base-url http://localhost:8006 --save-result \
--result-dir $PROFILER_DIR
```
benchmark test based on Qwen3-32B TPOT result:
||forward async| scheduler async |sync|
|-|-|-|-|
|avg|41.73|41.86|44.20|
|improve0|0.3%|0|0|
|improve1|5.58%|0|0|
benchmark test based on Qwen2___5-VL-7B-Instruct TPOT result:
||forward async|sync|
|-|-|-|
|avg|23.22|29.16|
|improve|20.3%|0|
- vLLM version: main
- vLLM main:
e93f4cc9e3
Signed-off-by: jiangpeng36 <jiangpeng36@huawei.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Co-authored-by: jiangpeng36 <jiangpeng36@huawei.com>
Co-authored-by: Ronald1995 <ronaldautomobile@163.com>
This commit is contained in:
@@ -18,7 +18,7 @@
|
||||
#
|
||||
|
||||
import copy
|
||||
from typing import Optional
|
||||
from typing import Optional, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
@@ -38,8 +38,8 @@ from vllm.tasks import SupportedTask
|
||||
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, GiB_bytes
|
||||
from vllm.v1.core.sched.output import SchedulerOutput
|
||||
from vllm.v1.kv_cache_interface import KVCacheConfig, KVCacheSpec
|
||||
from vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, DraftTokenIds,
|
||||
ModelRunnerOutput)
|
||||
from vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, AsyncModelRunnerOutput,
|
||||
DraftTokenIds, ModelRunnerOutput)
|
||||
from vllm.v1.worker.worker_base import WorkerBase
|
||||
|
||||
from vllm_ascend.ascend_config import init_ascend_config
|
||||
@@ -191,7 +191,7 @@ class NPUWorker(WorkerBase):
|
||||
def execute_model(
|
||||
self,
|
||||
scheduler_output: "SchedulerOutput",
|
||||
) -> Optional[ModelRunnerOutput]:
|
||||
) -> Optional[Union[ModelRunnerOutput, AsyncModelRunnerOutput]]:
|
||||
intermediate_tensors = None
|
||||
if not get_pp_group().is_first_rank:
|
||||
intermediate_tensors = IntermediateTensors(
|
||||
@@ -220,7 +220,7 @@ class NPUWorker(WorkerBase):
|
||||
new_output.kv_connector_output = kv_connector_output
|
||||
return new_output
|
||||
|
||||
assert isinstance(output, ModelRunnerOutput)
|
||||
assert isinstance(output, (ModelRunnerOutput, AsyncModelRunnerOutput))
|
||||
return output
|
||||
|
||||
def load_model(self) -> None:
|
||||
|
||||
Reference in New Issue
Block a user