diff --git a/docs/source/developer_guide/contribution/index.md b/docs/source/developer_guide/contribution/index.md index 4f7d497..467a104 100644 --- a/docs/source/developer_guide/contribution/index.md +++ b/docs/source/developer_guide/contribution/index.md @@ -1,6 +1,7 @@ # Contributing ## Building and Testing + It's recommended to set up a local development environment to build vllm-kunlun and run tests before you submit a PR. @@ -21,10 +22,18 @@ python -m vllm.entrypoints.openai.api_server \ --no-enable-chunked-prefill \ --distributed-executor-backend mp \ --served-model-name your_modified_models \ - --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", - "vllm.unified_attention", "vllm.unified_attention_with_output", - "vllm.mamba_mixer2"]}' \ + --compilation-config '{"splitting_ops": ["vllm.unified_attention", + "vllm.unified_attention_with_output", + "vllm.unified_attention_with_output_kunlun", + "vllm.mamba_mixer2", + "vllm.mamba_mixer", + "vllm.short_conv", + "vllm.linear_attention", + "vllm.plamo2_mamba_mixer", + "vllm.gdn_attention", + "vllm.sparse_attn_indexer"]}' \ ``` + Please save a screenshot of your service running successfully, and attach an accuracy report. ### Submit the commit @@ -36,7 +45,6 @@ git commit -sm "your commit info" 🎉 Congratulations! You have completed the development environment setup. - ## PR Title and Classification Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following: @@ -46,7 +54,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat - `[ModelRunner]` for new features or optimization in model runner. - `[Platform]` for new features or optimization in platform. - `[Worker]` for new features or optimization in worker. -- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner) +- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner) - `[Kernel]` for changes affecting compute kernels and ops. - `[Bugfix]` for bug fixes. - `[Doc]` for documentation fixes and improvements. @@ -60,7 +68,7 @@ If the PR spans more than one category, please include all relevant prefixes. ## Others -If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers. +If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers. :::{toctree} :caption: Index diff --git a/docs/source/developer_guide/performance/performance_benchmark/profiling.md b/docs/source/developer_guide/performance/performance_benchmark/profiling.md index 1cb758a..8ce84e8 100644 --- a/docs/source/developer_guide/performance/performance_benchmark/profiling.md +++ b/docs/source/developer_guide/performance/performance_benchmark/profiling.md @@ -1,18 +1,23 @@ # Profiling - - ## 🔧 Action Plan(Three Phases) + ### Phase 1️⃣: Multi-Device Log Redirection Configuration + #### Background + By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in: + - It becomes impossible to distinguish which log originates from which device. - Timestamps become interleaved, making it difficult to analyze the temporal relationships. - Single-device bottlenecks are masked by global aggregation. #### Solution + During model initialization, create separate log files for each device. + #### Code Explanation (embedded in qwen2.py) + ```python import os # ← Ensure this is imported at the top of the file from vllm.distributed import get_tensor_model_parallel_rank # ← Import function to get the tensor model parallel rank @@ -30,38 +35,42 @@ class Qwen2Model(nn.Module): try: # Step 1: Get the current XPU device's rank (0~7) rank = get_tensor_model_parallel_rank() - + # Step 2: Create log directory (works with your get_kernel_time_ex.py) log_dir = "./xpu_logs" os.makedirs(log_dir, exist_ok=True) - + # Step 3: Generate a separate log file for each device log_file = os.path.join(log_dir, f"rank_{rank}.log") - + # Step 4: Core operation – redirect file descriptors # os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664) os.dup2(fd, 1) # Redirect stdout → rank_X.log os.dup2(fd, 2) # Redirect stderr → rank_X.log os.close(fd) # Close original file descriptor; redirection persists - + # Optional: print a confirmation message (will go into rank_X.log) print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}") - + except Exception as e: # Fallback mechanism: failure to redirect logs does not affect model loading print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True) # ========== End of log redirection code ========== ``` + #### ⚠️ Common Issues + **Q1**:Why not use Python's `logging` module? **A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required. **Q1**:Will logs be lost if the model fails to load?? **A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup. ### Phase 2️⃣: Profiling Environment Activation + #### 🚀 vLLM Launch + ```bash unset XPU_DUMMY_EVENT export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 @@ -96,14 +105,21 @@ USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \       --no-enable-chunked-prefill \       --distributed-executor-backend mp \       --served-model-name Qwen2.5-72B-Instruct \ -      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", - "vllm.unified_attention", "vllm.unified_attention_with_output", - "vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log +      --compilation-config '{"splitting_ops": ["vllm.unified_attention", + "vllm.unified_attention_with_output", + "vllm.unified_attention_with_output_kunlun", + "vllm.mamba_mixer2", + "vllm.mamba_mixer", + "vllm.short_conv", + "vllm.linear_attention", + "vllm.plamo2_mamba_mixer", + "vllm.gdn_attention", + "vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log ``` - #### 🚀 Client Load Testing + ```bash #!/bin/bash @@ -177,6 +193,7 @@ echo "==========================================================" ``` ### Phase 3️⃣: Log Analysis and Bottleneck Identification + ```text xpu_logs/ ├─ rank_0.log @@ -189,16 +206,19 @@ xpu_logs/ └─ rank_7.log ``` + #### 🔍 Script Workflow (op_log.py) + **Input**:Raw Kernel Logs (Sample Format) + ``` [XPURT_PROF] void xblas_xpu3::fc_cdnn_infer 123456 ns [XPURT_PROF] void kl3_all_reduce 987654 ns ``` + **Processing logic** :::::{tab-set} -::::{tab-item} op_log.py - +::::{tab-item} op_log.py ```python """ @@ -382,8 +402,6 @@ if __name__ == '__main__': ::::{tab-item} op_log.sh - - ```bash for i in {0..7}; do @@ -397,9 +415,12 @@ for i in {0..7}; do     head -n 6 analysis_rank${i}.log | tail -n 5 done ``` + :::: ::::: + #### 📈 Output Example (analysis_rank0.log) + ``` Filename: xpu_logs/rank_0.log -xpu option: 2 @@ -410,9 +431,11 @@ void xblas_xpu3::fc_cdnn_infer                                                                                                                          176134    14782.525712413793       27.506               void kl3_all_reduce_butterfly                                                                                                                164864    4197.28395862069         7.81            ``` + #### 🚨 Troubleshooting Guide -|Symptom|Cause|Solution| -|-|-|-| -|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set| -All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified| -|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production| + +| Symptom | Cause | Solution | +| -------------------------------------- | ----------------------------------- | ------------------------------------------------------------ | +| `xpu_logs` directory is empty | XPUAPI_DEBUG not enabled | Verify that the environment variable is correctly set | +| All 8 log files have identical content | Multi-process backend not activated | Ensure `--distributed-executor-backend` mp is specified | +| Throughput drops >15% | Profiling overhead too high | Enable profiling only during analysis; disable in production | diff --git a/docs/source/installation.md b/docs/source/installation.md index f96ee56..5adc502 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -32,6 +32,9 @@ if [ $XPU_NUM -gt 0 ]; then DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl" fi export build_image="wjie520/vllm_kunlun:v0.0.1" +# or export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32" + + docker run -itd ${DOCKER_DEVICE_CONFIG} \ --net=host \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ @@ -49,6 +52,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \ ### Install vLLM 0.11.0 ``` conda activate vllm_kunlun_0.10.1.1 +# or conda activate python310_torch25_cuda pip install vllm==0.11.0 --no-build-isolation --no-deps ``` @@ -79,6 +83,7 @@ bash xpytorch-cp310-torch251-ubuntu2004-x64.run ## Install the KL3-customized build of PyTorch(Only MIMO V2) ``` wget -O xpytorch-cp310-torch251-ubuntu2004-x64.run https://klx-sdk-release-public.su.bcebos.com/kunlun2aiak_output/1231/xpytorch-cp310-torch251-ubuntu2004-x64.run +bash xpytorch-cp310-torch251-ubuntu2004-x64.run ``` ## Install custom ops diff --git a/docs/source/tutorials/multi_xpu_GLM-4.5.md b/docs/source/tutorials/multi_xpu_GLM-4.5.md index da07b74..1e3a326 100644 --- a/docs/source/tutorials/multi_xpu_GLM-4.5.md +++ b/docs/source/tutorials/multi_xpu_GLM-4.5.md @@ -113,7 +113,16 @@ python -m vllm.entrypoints.openai.api_server \ --no-enable-chunked-prefill \ --distributed-executor-backend mp \ --served-model-name GLM-4.5 \ - --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", "vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.mamba_mixer2"]}' > log_glm_plugin.txt 2>&1 & + --compilation-config '{"splitting_ops": ["vllm.unified_attention", + "vllm.unified_attention_with_output", + "vllm.unified_attention_with_output_kunlun", + "vllm.mamba_mixer2", + "vllm.mamba_mixer", + "vllm.short_conv", + "vllm.linear_attention", + "vllm.plamo2_mamba_mixer", + "vllm.gdn_attention", + "vllm.sparse_attn_indexer"]}' > log_glm_plugin.txt 2>&1 & ``` If your service start successfully, you can see the info shown below: diff --git a/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md b/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md index 1f15a0f..824a6ae 100644 --- a/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md +++ b/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md @@ -16,7 +16,8 @@ if [ $XPU_NUM -gt 0 ]; then DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl" fi -export build_image="xxxxxxxxxxxxxxxxx" +export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32" + docker run -itd ${DOCKER_DEVICE_CONFIG} \ --net=host \ @@ -32,14 +33,12 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \ ### Preparation Weight -* Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights -* Modify the weights configuration.json file and add the fields quantization_config and compression_config. +- Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights +- Modify the weights configuration.json file and add the fields quantization_config and compression_config. ```json { - "architectures": [ - "Qwen3MoeForCausalLM" - ], + "architectures": ["Qwen3MoeForCausalLM"], "attention_dropout": 0.0, "decoder_sparse_step": 1, "eos_token_id": 151645, @@ -61,7 +60,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \ "num_key_value_heads": 8, "output_router_logits": false, "qkv_bias": false, - "rms_norm_eps": 1e-06, + "rms_norm_eps": 1e-6, "rope_scaling": null, "rope_theta": 10000000, "router_aux_loss_coef": 0.0, @@ -104,7 +103,6 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \ "sparsity_config": null } } - ``` ### Online Serving on Multi XPU @@ -128,5 +126,15 @@ python3 -m vllm.entrypoints.openai.api_server \ --enable-chunked-prefill=False \ --no-enable-prefix-caching \ --disable-log-requests \ - --gpu-memory-utilization 0.85 -``` \ No newline at end of file + --gpu-memory-utilization 0.9 \ + --compilation-config '{"splitting_ops": ["vllm.unified_attention", + "vllm.unified_attention_with_output", + "vllm.unified_attention_with_output_kunlun", + "vllm.mamba_mixer2", + "vllm.mamba_mixer", + "vllm.short_conv", + "vllm.linear_attention", + "vllm.plamo2_mamba_mixer", + "vllm.gdn_attention", + "vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log +``` diff --git a/docs/source/tutorials/single_xpu_Qwen3-8B.md b/docs/source/tutorials/single_xpu_Qwen3-8B.md index 154bc08..ac56c88 100644 --- a/docs/source/tutorials/single_xpu_Qwen3-8B.md +++ b/docs/source/tutorials/single_xpu_Qwen3-8B.md @@ -128,9 +128,16 @@ python -m vllm.entrypoints.openai.api_server \ --no-enable-chunked-prefill \ --distributed-executor-backend mp \ --served-model-name Qwen3-8B \ - --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", - "vllm.unified_attention", "vllm.unified_attention_with_output", - "vllm.mamba_mixer2"]}' \ + --compilation-config '{"splitting_ops": ["vllm.unified_attention", + "vllm.unified_attention_with_output", + "vllm.unified_attention_with_output_kunlun", + "vllm.mamba_mixer2", + "vllm.mamba_mixer", + "vllm.short_conv", + "vllm.linear_attention", + "vllm.plamo2_mamba_mixer", + "vllm.gdn_attention", + "vllm.sparse_attn_indexer"]}' \ ``` If your service start successfully, you can see the info shown below: