chunked prefill support and memory opts
This commit is contained in:
@@ -9,20 +9,21 @@
|
||||
# - DO NOT install BI-V150 corex Triton 2.1.0 (pkgs/triton): that causes
|
||||
# GPU hang on BI-V100 because the Triton CUDA PTX kernels are incompatible.
|
||||
#
|
||||
# Chunked prefill note:
|
||||
# --enable-chunked-prefill is NOT supported by the vendor's vllm 0.6.3 for
|
||||
# has_inner_state=True models on BI-V100. It causes "Engine loop has died"
|
||||
# immediately on first request. Do NOT use that flag.
|
||||
# Long-context memory is instead handled by query-chunking inside
|
||||
# _forward_prefix_pytorch (see paged_attn.py, _ATTN_Q_CHUNK=256).
|
||||
# Important Note: Qwen3.6-27B must apply TP=4,PP=2 combination in order to deploy using 8 GPUs
|
||||
#
|
||||
# Recommended server start command:
|
||||
# python3 -m vllm.entrypoints.openai.api_server \
|
||||
# --model /workspace/models/Qwen3.6-27B --port 1111 \
|
||||
# --served-model-name llm --max-model-len 20000 \
|
||||
# --enforce-eager --trust-remote-code -tp 4 \
|
||||
# --gpu-memory-utilization 0.95
|
||||
# (No --enable-chunked-prefill, no --max-num-batched-tokens)
|
||||
# Recommended server start command for TP=4, context length: 50K, no chunked prefill mechanism:
|
||||
# CUDA_VISIBLE_DEVICES="4,5,6,7" VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
|
||||
# --model /workspace/models/Qwen3.6-27B --port 1111 --served-model-name llm \
|
||||
# --max-model-len 50000 --enforce-eager --trust-remote-code -tp 4 --gpu-memory-utilization 0.90 \
|
||||
# --max-num-seqs 1 --disable-log-requests --disable-frontend-multiprocessing \
|
||||
# --max-num-batched-tokens 50000
|
||||
|
||||
# Recommended server start command for TP=4 support 100K, need chunked prefill
|
||||
# CUDA_VISIBLE_DEVICES="4,5,6,7" VLLM_ENGINE_ITERATION_TIMEOUT_S=3600 python3 -m vllm.entrypoints.openai.api_server \
|
||||
# --model /workspace/models/Qwen3.6-27B --port 1111 --served-model-name llm \
|
||||
# --max-model-len 100000 --enforce-eager --trust-remote-code -tp 8 --gpu-memory-utilization 0.95 \
|
||||
# --max-num-seqs 1 --disable-log-requests --disable-frontend-multiprocessing \
|
||||
# --max-num-batched-tokens 4096 --enable-chunked-prefill
|
||||
|
||||
# --- paged_attn.py: replace forward_prefix with pure-PyTorch fallback -------
|
||||
# The Triton context_attention_fwd kernel hangs BI-V100 GPUs permanently
|
||||
|
||||
Reference in New Issue
Block a user