forked from EngineX-Cambricon/enginex-mlu370-vllm
add qwen3
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
# 背景
|
||||
|
||||
此示例用于在vLLM中演示chunked parallel pipeline功能,通过mlu_hijck机制将需要修改的代码劫持到当前目录,避免修改主仓库代码。
|
||||
|
||||
# 支持模型
|
||||
|
||||
- LlamaForCausalLM
|
||||
- CustomForCausalLM
|
||||
|
||||
# Demo运行方式
|
||||
|
||||
当前Chunked Parallel Pipeline仅支持通过AsyncLLMEngine方式用paged mode运行。
|
||||
|
||||
- 设置环境变量
|
||||
|
||||
```bash
|
||||
export CHUNKED_PIPELINE_PARALLEL_EN=true
|
||||
```
|
||||
|
||||
- 启动server进程
|
||||
```bash
|
||||
# 设置engine超时阈值。
|
||||
export VLLM_ENGINE_ITERATION_TIMEOUT_S=180
|
||||
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--port ${PORT} \
|
||||
--model ${MODEL_PATH} \
|
||||
--swap-space 16 \
|
||||
--pipeline-parallel-size ${PP_SIZE} \
|
||||
--max-num-batched-tokens ${MAX_TOKENS_NUM} \
|
||||
--enable-chunked-prefill \
|
||||
--worker-use-ray \
|
||||
--enforce-eager
|
||||
```
|
||||
|
||||
- 启动client进程
|
||||
这里以随机数为例,可以选用真实数据集。
|
||||
```bash
|
||||
python benchmarks/benchmark_serving.py \
|
||||
--backend vllm \
|
||||
--model ${MODEL_PATH} \
|
||||
--dataset-name random \
|
||||
--num-prompts ${NUM_PROMPT} \
|
||||
--port ${PORT} \
|
||||
--random-input-len ${INPUT_LEN} \
|
||||
--random-output-len 1 \
|
||||
--request-rate inf
|
||||
```
|
||||
Reference in New Issue
Block a user