2026-05-29 16:49:33 +08:00
|
|
|
|
# BI-V100 patch script for Qwen3.6-27B (Qwen3_5 architecture)
|
|
|
|
|
|
#
|
|
|
|
|
|
# Triton situation on BI-V100:
|
|
|
|
|
|
# - Standard Triton 2.3.1 is already present in the image.
|
|
|
|
|
|
# - HAS_TRITON = False (hardcoded in vendor vllm), but Triton is still used
|
|
|
|
|
|
# for TP-mode cache management (custom_cache_manager / libentry).
|
|
|
|
|
|
# - The vendor's triton_utils/__init__.py, custom_cache_manager.py, libentry.py
|
|
|
|
|
|
# are already correct for standard Triton 2.3.1 — do NOT overwrite them.
|
|
|
|
|
|
# - DO NOT install BI-V150 corex Triton 2.1.0 (pkgs/triton): that causes
|
|
|
|
|
|
# GPU hang on BI-V100 because the Triton CUDA PTX kernels are incompatible.
|
|
|
|
|
|
#
|
|
|
|
|
|
# Chunked prefill note:
|
|
|
|
|
|
# --enable-chunked-prefill is NOT supported by the vendor's vllm 0.6.3 for
|
|
|
|
|
|
# has_inner_state=True models on BI-V100. It causes "Engine loop has died"
|
|
|
|
|
|
# immediately on first request. Do NOT use that flag.
|
|
|
|
|
|
# Long-context memory is instead handled by query-chunking inside
|
|
|
|
|
|
# _forward_prefix_pytorch (see paged_attn.py, _ATTN_Q_CHUNK=256).
|
|
|
|
|
|
#
|
|
|
|
|
|
# Recommended server start command:
|
|
|
|
|
|
# python3 -m vllm.entrypoints.openai.api_server \
|
|
|
|
|
|
# --model /workspace/models/Qwen3.6-27B --port 1111 \
|
|
|
|
|
|
# --served-model-name llm --max-model-len 20000 \
|
|
|
|
|
|
# --enforce-eager --trust-remote-code -tp 4 \
|
|
|
|
|
|
# --gpu-memory-utilization 0.95
|
|
|
|
|
|
# (No --enable-chunked-prefill, no --max-num-batched-tokens)
|
|
|
|
|
|
|
|
|
|
|
|
# --- paged_attn.py: replace forward_prefix with pure-PyTorch fallback -------
|
|
|
|
|
|
# The Triton context_attention_fwd kernel hangs BI-V100 GPUs permanently
|
|
|
|
|
|
# (standard Triton 2.3.1 PTX is not supported by the corex runtime either).
|
|
|
|
|
|
# Our paged_attn.py bypasses it entirely via _forward_prefix_pytorch, which
|
|
|
|
|
|
# also implements query-chunking (_ATTN_Q_CHUNK=256) to keep peak attention
|
|
|
|
|
|
# memory at O(256 × kv_len) instead of O(q_len × kv_len).
|
|
|
|
|
|
cp ./paged_attn.py /usr/local/corex/lib64/python3/dist-packages/vllm/attention/ops/paged_attn.py
|
|
|
|
|
|
|
|
|
|
|
|
# --- transformers: Qwen3_5 tokenizer / model files --------------------------
|
2026-05-21 16:37:24 +08:00
|
|
|
|
pip install transformers==4.55.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
|
|
|
|
|
|
cp -r ./qwen3_5 /usr/local/lib/python3.10/site-packages/transformers/models/
|
|
|
|
|
|
python3 ./patch_transformers_qwen3_5.py
|
|
|
|
|
|
|
2026-05-29 16:49:33 +08:00
|
|
|
|
# --- vllm model: Qwen3.6-27B (Qwen3_5 arch) --------------------------------
|
2026-05-21 16:37:24 +08:00
|
|
|
|
cp ./mamba_cache.py /usr/local/corex/lib/python3/dist-packages/vllm/model_executor/models/
|
2026-05-29 16:49:33 +08:00
|
|
|
|
cp ./qwen3_5.py /usr/local/corex/lib/python3/dist-packages/vllm/model_executor/models/qwen3_5.py
|
2026-05-21 16:37:24 +08:00
|
|
|
|
python3 ./patch_vllm_qwen3_5.py
|
|
|
|
|
|
|
2026-05-29 16:49:33 +08:00
|
|
|
|
# --- xformers: bypass cudnnFlashAttnForward (head_dim=256 > 128 limit) ------
|
|
|
|
|
|
# Injects _run_sdpa_fallback (pure matmul+softmax) into xformers.py.
|
|
|
|
|
|
# Required because head_dim=256 > 128 and ixformer flash attention either
|
|
|
|
|
|
# crashes (is_causal=True) or produces wrong output (attn_mask path).
|
|
|
|
|
|
# The fallback uses query_start_loc to derive actual query lengths, so it
|
|
|
|
|
|
# works correctly during profiling runs with chunked-prefill-style batches.
|
2026-05-21 16:37:24 +08:00
|
|
|
|
python3 ./patch_xformers_sdpa_seq.py
|