[Doc] Optimize the document (#136)

2026-01-22 14:12:44 +08:00
parent 58f570ddea
commit 9e13f23661
6 changed files with 100 additions and 40 deletions
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@@ -1,6 +1,7 @@
 # Contributing
 ## Building and Testing
 It's recommended to set up a local development environment to build vllm-kunlun and run tests
 before you submit a PR.
@@ -21,10 +22,18 @@ python -m vllm.entrypoints.openai.api_server \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name your_modified_models \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
+                                                "vllm.unified_attention_with_output_kunlun",
                                                "vllm.mamba_mixer2",
                                                "vllm.mamba_mixer",
                                                "vllm.short_conv",
                                                "vllm.linear_attention",
                                                "vllm.plamo2_mamba_mixer",
                                                "vllm.gdn_attention",
                                                "vllm.sparse_attn_indexer"]}' \
 ```
 Please save a screenshot of your service running successfully, and attach an accuracy report.
 ### Submit the commit
@@ -36,7 +45,6 @@ git commit -sm "your commit info"
 🎉 Congratulations! You have completed the development environment setup.
 ## PR Title and Classification
 Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
@@ -46,7 +54,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
 - `[ModelRunner]` for new features or optimization in model runner.
 - `[Platform]` for new features or optimization in platform.
 - `[Worker]` for new features or optimization in worker.
- `[Core]` for new features or optimization  in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
+- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
 - `[Kernel]` for changes affecting compute kernels and ops.
 - `[Bugfix]` for bug fixes.
 - `[Doc]` for documentation fixes and improvements.
@@ -60,7 +68,7 @@ If the PR spans more than one category, please include all relevant prefixes.
 ## Others
-If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers. 
+If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers.
 :::{toctree}
 :caption: Index
--- a/docs/source/developer_guide/performance/performance_benchmark/profiling.md
+++ b/docs/source/developer_guide/performance/performance_benchmark/profiling.md
@@ -1,18 +1,23 @@
 # Profiling
 ## 🔧 Action Plan（Three Phases）
 ### Phase 1️⃣: Multi-Device Log Redirection Configuration
 #### Background
 By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
 - It becomes impossible to distinguish which log originates from which device.
 - Timestamps become interleaved, making it difficult to analyze the temporal relationships.
 - Single-device bottlenecks are masked by global aggregation.
 #### Solution
 During model initialization, create separate log files for each device.
 #### Code Explanation (embedded in qwen2.py)
 ```python
 import os  # ← Ensure this is imported at the top of the file
 from vllm.distributed import get_tensor_model_parallel_rank  # ← Import function to get the tensor model parallel rank
@@ -30,38 +35,42 @@ class Qwen2Model(nn.Module):
        try:
            # Step 1: Get the current XPU device's rank (0~7)
            rank = get_tensor_model_parallel_rank()
-            
+
            # Step 2: Create log directory (works with your get_kernel_time_ex.py)
            log_dir = "./xpu_logs"
            os.makedirs(log_dir, exist_ok=True)
-            
+
            # Step 3: Generate a separate log file for each device
            log_file = os.path.join(log_dir, f"rank_{rank}.log")
-            
+
            # Step 4: Core operation – redirect file descriptors
            # os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
            fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
            os.dup2(fd, 1)  # Redirect stdout → rank_X.log
            os.dup2(fd, 2)  # Redirect stderr → rank_X.log
            os.close(fd)     # Close original file descriptor; redirection persists
-            
+
            # Optional: print a confirmation message (will go into rank_X.log)
            print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
-            
+
        except Exception as e:
            # Fallback mechanism: failure to redirect logs does not affect model loading
            print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
        # ========== End of log redirection code ==========
 ```
 #### ⚠️ Common Issues
 **Q1**:Why not use Python's `logging` module?
 **A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required.
 **Q1**:Will logs be lost if the model fails to load??
 **A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
 ### Phase 2️⃣: Profiling Environment Activation
 #### 🚀 vLLM Launch
 ```bash
 unset XPU_DUMMY_EVENT
 export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
@@ -96,14 +105,21 @@ USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
       --no-enable-chunked-prefill \
       --distributed-executor-backend mp \
       --served-model-name Qwen2.5-72B-Instruct \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log
+                                                "vllm.unified_attention_with_output_kunlun",
                                                "vllm.mamba_mixer2",
                                                "vllm.mamba_mixer",
                                                "vllm.short_conv",
                                                "vllm.linear_attention",
                                                "vllm.plamo2_mamba_mixer",
                                                "vllm.gdn_attention",
                                                "vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log
 ```
 #### 🚀 Client Load Testing
 ```bash
 #!/bin/bash
@@ -177,6 +193,7 @@ echo "=========================================================="
 ```
 ### Phase 3️⃣: Log Analysis and Bottleneck Identification
 ```text
 xpu_logs/
 ├─ rank_0.log
@@ -189,16 +206,19 @@ xpu_logs/
 └─ rank_7.log
 ```
 #### 🔍 Script Workflow (op_log.py)
 **Input**:Raw Kernel Logs (Sample Format)
 ```
 [XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
 [XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
 ```
 **Processing logic**
 :::::{tab-set}
-::::{tab-item} op_log.py 
+::::{tab-item} op_log.py
 ```python
 """
@@ -382,8 +402,6 @@ if __name__ == '__main__':
 ::::{tab-item} op_log.sh
 ```bash
 for i in {0..7}; do
@@ -397,9 +415,12 @@ for i in {0..7}; do
     head -n 6 analysis_rank${i}.log | tail -n 5
 done
 ```
 ::::
 :::::
 #### 📈 Output Example (analysis_rank0.log)
 ```
 Filename: xpu_logs/rank_0.log
 -xpu option: 2
@@ -410,9 +431,11 @@ void xblas_xpu3::fc_cdnn_infer<float16, float16, float16, float16, float,
 void kl3_all_reduce<float16>                                                                                                                          176134    14782.525712413793       27.506              
 void kl3_all_reduce_butterfly<float16>                                                                                                                164864    4197.28395862069         7.81           
 ```
 #### 🚨 Troubleshooting Guide
-|Symptom|Cause|Solution|
+
-|-|-|-|
+| Symptom                                | Cause                               | Solution                                                     |
-|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set|
+| -------------------------------------- | ----------------------------------- | ------------------------------------------------------------ |
-All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified|
+| `xpu_logs` directory is empty          | XPUAPI_DEBUG not enabled            | Verify that the environment variable is correctly set        |
-|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production|
+| All 8 log files have identical content | Multi-process backend not activated | Ensure `--distributed-executor-backend` mp is specified      |
 | Throughput drops >15%                  | Profiling overhead too high         | Enable profiling only during analysis; disable in production |
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -32,6 +32,9 @@ if [ $XPU_NUM -gt 0 ]; then
    DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
 fi
 export build_image="wjie520/vllm_kunlun:v0.0.1"
 # or export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32"
 docker run -itd ${DOCKER_DEVICE_CONFIG} \
    --net=host \
    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
@@ -49,6 +52,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
 ### Install vLLM 0.11.0
 ```
 conda activate vllm_kunlun_0.10.1.1
 # or conda activate python310_torch25_cuda
 pip install vllm==0.11.0 --no-build-isolation --no-deps
 ```
@@ -79,6 +83,7 @@ bash xpytorch-cp310-torch251-ubuntu2004-x64.run
 ## Install the KL3-customized build of PyTorch(Only MIMO V2)
 ```
 wget -O xpytorch-cp310-torch251-ubuntu2004-x64.run https://klx-sdk-release-public.su.bcebos.com/kunlun2aiak_output/1231/xpytorch-cp310-torch251-ubuntu2004-x64.run
 bash xpytorch-cp310-torch251-ubuntu2004-x64.run
 ```
 ## Install custom ops
--- a/docs/source/tutorials/multi_xpu_GLM-4.5.md
+++ b/docs/source/tutorials/multi_xpu_GLM-4.5.md
@@ -113,7 +113,16 @@ python -m vllm.entrypoints.openai.api_server \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name GLM-4.5 \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", "vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.mamba_mixer2"]}'  > log_glm_plugin.txt 2>&1 &
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention",
                                                "vllm.unified_attention_with_output",
                                                "vllm.unified_attention_with_output_kunlun",
                                                "vllm.mamba_mixer2",
                                                "vllm.mamba_mixer",
                                                "vllm.short_conv",
                                                "vllm.linear_attention",
                                                "vllm.plamo2_mamba_mixer",
                                                "vllm.gdn_attention",
                                                "vllm.sparse_attn_indexer"]}'  > log_glm_plugin.txt 2>&1 &
 ```
 If your service start successfully, you can see the info shown below:
--- a/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
+++ b/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
@@ -16,7 +16,8 @@ if [ $XPU_NUM -gt 0 ]; then
    DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
 fi
-export build_image="xxxxxxxxxxxxxxxxx" 
+export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32"
 docker run -itd ${DOCKER_DEVICE_CONFIG} \
    --net=host \
@@ -32,14 +33,12 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
 ### Preparation Weight
-* Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
+- Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
-* Modify the weights configuration.json file and add the fields quantization_config and compression_config.
+- Modify the weights configuration.json file and add the fields quantization_config and compression_config.
 ```json
 {
-  "architectures": [
+  "architectures": ["Qwen3MoeForCausalLM"],
    "Qwen3MoeForCausalLM"
  ],
  "attention_dropout": 0.0,
  "decoder_sparse_step": 1,
  "eos_token_id": 151645,
@@ -61,7 +60,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
  "num_key_value_heads": 8,
  "output_router_logits": false,
  "qkv_bias": false,
-  "rms_norm_eps": 1e-06,
+  "rms_norm_eps": 1e-6,
  "rope_scaling": null,
  "rope_theta": 10000000,
  "router_aux_loss_coef": 0.0,
@@ -104,7 +103,6 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
    "sparsity_config": null
  }
 }
 ```
 ### Online Serving on Multi XPU
@@ -128,5 +126,15 @@ python3 -m vllm.entrypoints.openai.api_server \
 --enable-chunked-prefill=False \
 --no-enable-prefix-caching \
 --disable-log-requests \
- --gpu-memory-utilization 0.85
+ --gpu-memory-utilization 0.9 \
-```
+ --compilation-config '{"splitting_ops": ["vllm.unified_attention",
                                                "vllm.unified_attention_with_output",
                                                "vllm.unified_attention_with_output_kunlun",
                                                "vllm.mamba_mixer2",
                                                "vllm.mamba_mixer",
                                                "vllm.short_conv",
                                                "vllm.linear_attention",
                                                "vllm.plamo2_mamba_mixer",
                                                "vllm.gdn_attention",
                                                "vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log
 ```
--- a/docs/source/tutorials/single_xpu_Qwen3-8B.md
+++ b/docs/source/tutorials/single_xpu_Qwen3-8B.md
@@ -128,9 +128,16 @@ python -m vllm.entrypoints.openai.api_server \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name Qwen3-8B \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
+                                                "vllm.unified_attention_with_output_kunlun",
                                                "vllm.mamba_mixer2",
                                                "vllm.mamba_mixer",
                                                "vllm.short_conv",
                                                "vllm.linear_attention",
                                                "vllm.plamo2_mamba_mixer",
                                                "vllm.gdn_attention",
                                                "vllm.sparse_attn_indexer"]}' \
 ```
 If your service start successfully, you can see the info shown below: