提交vllm0.11.0开发分支

2025-12-10 17:51:24 +08:00
parent deab7dd0b6
commit 7c22d621fb
175 changed files with 31856 additions and 8683 deletions
--- a/docs/source/developer_guide/contribution/contributing.md
+++ b/docs/source/developer_guide/contribution/contributing.md
@@ -0,0 +1,83 @@
+# Contributing
+
+## Building and Testing
+It's recommended to set up a local development environment to build vllm-kunlun and run tests
+before you submit a PR.
+
+#### Run models locally
+
+After completing Run lint setup which is shown in quicksatrt, you can run your changed locally:
+
+```{code-block} bash
+   :substitutions:
+
+python -m vllm.entrypoints.openai.api_server \
+      --host 0.0.0.0 \
+      --port 8356 \
+      --model your_modified_models \
+      --gpu-memory-utilization 0.9 \
+      --trust-remote-code \
+      --max-model-len 32768 \
+      --tensor-parallel-size 1 \
+      --dtype float16 \
+      --max_num_seqs 128 \
+      --max_num_batched_tokens 32768 \
+      --block-size 128 \
+      --no-enable-prefix-caching \
+      --no-enable-chunked-prefill \
+      --distributed-executor-backend mp \
+      --served-model-name your_modified_models \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention", 
+                                                "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output_kunlun",
+                                                "vllm.mamba_mixer2", 
+                                                "vllm.mamba_mixer", 
+                                                "vllm.short_conv", 
+                                                "vllm.linear_attention", 
+                                                "vllm.plamo2_mamba_mixer", 
+                                                "vllm.gdn_attention", 
+                                                "vllm.sparse_attn_indexer"]}' \ 
+```
+Please save a screenshot of your service running successfully, and attach an accuracy report.
+
+#### Submit the commit
+
+```bash
+# Commit changed files using `-s`
+git commit -sm "your commit info"
+```
+
+🎉 Congratulations! You have completed the development environment setup.
+
+
+## PR Title and Classification
+
+Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
+
+- `[Attention]` for new features or optimization in attention.
+- `[Communicator]` for new features or optimization in communicators.
+- `[ModelRunner]` for new features or optimization in model runner.
+- `[Platform]` for new features or optimization in platform.
+- `[Worker]` for new features or optimization in worker.
+- `[Core]` for new features or optimization  in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
+- `[Kernel]` for changes affecting compute kernels and ops.
+- `[Bugfix]` for bug fixes.
+- `[Doc]` for documentation fixes and improvements.
+- `[Test]` for tests (such as unit tests).
+- `[CI]` for build or continuous integration improvements.
+- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
+
+:::{note}
+If the PR spans more than one category, please include all relevant prefixes.
+:::
+
+## Others
+
+If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers. 
+
+:::{toctree}
+:caption: Index
+:maxdepth: 1
+testing
+multi_node_test
+:::
--- a/docs/source/developer_guide/contribution/index.md
+++ b/docs/source/developer_guide/contribution/index.md
@@ -1,70 +1,5 @@
 # Contributing

 ## Building and Testing
-It's recommended to set up a local development environment to build vllm-kunlun and run tests
-before you submit a PR.

-#### Run models locally
-
-After completing Run lint setup which is shown in quicksatrt, you can run your changed locally:
-
-```{code-block} bash
-   :substitutions:
-
-python -m vllm.entrypoints.openai.api_server \
-      --host 0.0.0.0 \
-      --port 8356 \
-      --model /your_modified_models\
-      --trust-remote-code \
-      --tensor-parallel-size 1 \
-      --no-enable-prefix-caching \
-      --no-enable-chunked-prefill \
-      --distributed-executor-backend mp \
-      --served-model-name your_modified_models \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
-```
-Please save a screenshot of your service running successfully, and attach an accuracy report.
-
-#### Submit the commit
-
-```bash
-# Commit changed files using `-s`
-git commit -sm "your commit info"
-```
-
-🎉 Congratulations! You have completed the development environment setup.
-
-
-## PR Title and Classification
-
-Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
-
- `[Attention]` for new features or optimization in attention.
- `[Communicator]` for new features or optimization in communicators.
- `[ModelRunner]` for new features or optimization in model runner.
- `[Platform]` for new features or optimization in platform.
- `[Worker]` for new features or optimization in worker.
- `[Core]` for new features or optimization  in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
- `[Kernel]` for changes affecting compute kernels and ops.
- `[Bugfix]` for bug fixes.
- `[Doc]` for documentation fixes and improvements.
- `[Test]` for tests (such as unit tests).
- `[CI]` for build or continuous integration improvements.
- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
-
-:::{note}
-If the PR spans more than one category, please include all relevant prefixes.
-:::
-
-## Others
-
-If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers. 
-
-:::{toctree}
-:caption: Index
-:maxdepth: 1
-testing
-multi_node_test
-:::
+Comming soon...
--- a/docs/source/developer_guide/evaluation/accuracy/accuracy_server.md
+++ b/docs/source/developer_guide/evaluation/accuracy/accuracy_server.md
@@ -88,24 +88,20 @@ if not os.path.exists(output_dir):  # Step 4: Check if the directory exists
 # dump the mixed data to a jsonl file
 dump_jsonl_data(mixed_data, output_path)  # Step 6: Securely write to the file
 ```
-
 Dataset composition visualization:
-
 ```
 ┌───────────────────────────────────────┐
 │       VL-Test (1000 samples)          │
 ├─────────────────┬─────────────────────┤
 │   PureText      │      Vision         │
-│   (333 samples) │    (667 samples)    │
+│   (333 样本)    │    (667 样本)        │
 ├─────────────────┼─────────────────────┤
 │ • mmlu_pro      │ • math_vista        │
 │ • ifeval        │ • mmmu_pro          │
 │ • gsm8k         │                     │
 └─────────────────┴─────────────────────┘
 ```
-
 #### 3.Test
-
 ```python
 from dotenv import dotenv_values

@@ -138,14 +134,13 @@ task_cfg = TaskConfig(

 run_task(task_cfg=task_cfg)
 ```
-
 Parameter Tuning Guide:

-| Parameter         | Current value | Effect                                   | Adjustment suggestions                                   |
-| ----------------- | ------------- | ---------------------------------------- | -------------------------------------------------------- |
-| `temperature`     | 0.6           | Control output diversity                 | Math problems ↓ 0.3 / Creative writing ↑ 0.9             |
-| `top_p`           | 0.95          | Filtering low-probability tokens         | Reduce "nonsense"                                        |
-| `eval_batch_size` | 5             | Number of requests processed in parallel | With sufficient video memory, it can be increased to 10. |
+| Parameter        | Current value | Effect  | Adjustment suggestions                |
+| ----------------- | ------ | --------------- | ----------------------- |
+| `temperature`     | 0.6    | Control output diversity  | Math problems ↓ 0.3 / Creative writing ↑ 0.9 |
+| `top_p`           | 0.95   | Filtering low-probability tokens | Reduce "nonsense"         |
+| `eval_batch_size` | 5      | Number of requests processed in parallel  | With sufficient video memory, it can be increased to 10.         |

 Run the test:

@@ -172,12 +167,11 @@ python accuracy.py 2>&1 | tee "$LOG_FILE"
 # ========================================
 EXIT_CODE=${PIPESTATUS[0]}
 if [ $EXIT_CODE -eq 0 ]; then
-    echo "✅ Evaluation completed! Log saved to: $LOG_FILE"
+    echo "✅ 评测完成! 日志已保存到: $LOG_FILE"
 else
-    echo "❌ Evaluation failed! Exit code: $EXIT_CODE Please check the log: $LOG_FILE"
+    echo "❌ 评测失败! 退出码: $EXIT_CODE 请查看日志: $LOG_FILE"
 fi
 ```
-
 #### 4.Common problem fixes

 ##### 4.1 NLTK resource missing fix
@@ -187,7 +181,6 @@ Resource punkt_tab not found.
 ```

 Solution：
-
 ```python
 import nltk
 import os
@@ -200,13 +193,13 @@ os.makedirs(download_dir, exist_ok=True)
 nltk.data.path.append(download_dir)

 # Step 3: Download necessary resources
-print("🔽 Start downloading punkt_tab resource...")
+print("🔽 开始下载punkt_tab资源...")
 try:
    nltk.download("punkt_tab", download_dir=download_dir)
-    print("✅ Download successful!")
+    print("✅ 下载成功!")
 except Exception as e:
-    print(f"❌ Download failed: {e}")
-    print("💡 Alternative: Download manually from GitHub")
+    print(f"❌ 下载失败: {e}")
+    print("💡 备选方案:手动从GitHub下载")
    print(
        "   URL: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip"
    )
--- a/docs/source/developer_guide/performance/performance_benchmark/benchmark_kernel.md
+++ b/docs/source/developer_guide/performance/performance_benchmark/benchmark_kernel.md
@@ -34,9 +34,9 @@ The fork pattern is used to track the entire time period from the start to the e
 /xxxx/xxxx/xprofiler -r500 --xpu=0 python test.py
 ```

- --r: Sets the trace time resolution in nanoseconds (ns). The default is 100. If an "out of space error" occurs, try increasing the -r value to 500.
+* --r: Sets the trace time resolution in nanoseconds (ns). The default is 100. If an "out of space error" occurs, try increasing the -r value to 500.

- --xpu: Specifies the acquisition device ID, supporting multi-card configuration. --xpu=all enables all cards; the default is card 0.
+* --xpu: Specifies the acquisition device ID, supporting multi-card configuration. --xpu=all enables all cards; the default is card 0.

 More parameters can be found in the command-line parameters section later.

@@ -58,7 +58,7 @@ A temporary .sock file will be generated in the execution directory. The path ne

 ```bash
 export XPU_ENABLE_PROFILER_TRACING=1
-export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
+export XPU_TRACING_OUTPUT_NAME=<xprofiler 执行目录>/xprofiler.sock
 # Start your own program
 python xxx.py
 ```
@@ -99,7 +99,7 @@ xprofiler.sock
 ```python
 export XPU_ENABLE_PROFILER_TRACING=1
 # Here, the path to the .sock file from step 2 is used for assignment.
-export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
+export XPU_TRACING_OUTPUT_NAME=<xprofiler 执行目录>/xprofiler.sock
 # Start your own program
 python xxx.py
 ```
@@ -108,21 +108,21 @@ Note: If you want to specify a particular card to run on, you must import the XP

 ##### More parameters

-| parameters                 | Example                                 | default value | describe                                                                                                                                                                                           |
-| -------------------------- | --------------------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| -b or --buffer-size        | -b=512                                  | 256           | Specifies the size of the trace buffer in MB. This is generally not required. However, if there are many trace signals, the buffer size can be increased appropriately to avoid OOS (Out of Size). |
-| -x or --xpu                | -x=0--xpu=0                             | 0             | Set the card number to be tracked; multiple cards or all cards can be set.                                                                                                                         |
-| -t or --time               | -t=10                                   | off           | Enable time mode, in seconds, to capture information over a specified period.                                                                                                                      |
-| -d or --deamonize          | -r500                                   | 0             | Enable daemon mode to retrieve events in the background.                                                                                                                                           |
-| -r or --export-profile     | -e ./trace_output-e ./output/trace.json | ./            | Record the trace results to a document or folder. If this parameter is not specified, a default xprofiler.trace.json file will be generated in the execution directory.                            |
-| -S or --settings           | -S xprofiler.trace.json                 | off           | xprofiler reads a JSON file containing the events that need to be traced. If this parameter is not configured, xprofiler enables `--profile-api-trace` and `--sse-trace` by default.               |
-| -A or --profiler-api-trace | -A                                      | on            | Get driver events.                                                                                                                                                                                 |
-| -s or --sse-trace          | -s                                      | on            | Get all SSE events.                                                                                                                                                                                |
-| -C or --cluster-trace      | -C                                      | off           | Retrieve all cluster events.                                                                                                                                                                       |
-| -n or --sdnn-trace         | -n                                      | off           | Get all SDNN events.                                                                                                                                                                               |
-| -c or --sdnn-cluster-trace | -c                                      | off           | Retrieve all SDNN cluster events.                                                                                                                                                                  |
-| -E or --cache-trace        | -E                                      | off           | Get bandwidth statistics events.                                                                                                                                                                   |
-| -u or --debug              | -u44:open log，debug level-u0:close log | 33            | Debug the interface and enable driver event/device event logging.。                                                                                                                                |
+| parameters                | Example                     | default value | describe                |
+| -------------------------- | --------------------------------------- | ------ | ------------------------------------------------------------ |
+| -b or --buffer-size        | -b=512                                  | 256    | Specifies the size of the trace buffer in MB. This is generally not required. However, if there are many trace signals, the buffer size can be increased appropriately to avoid OOS (Out of Size). |
+| -x or --xpu                | -x=0--xpu=0                             | 0      | Set the card number to be tracked; multiple cards or all cards can be set.                       |
+| -t or --time               | -t=10                                   | off    | Enable time mode, in seconds, to capture information over a specified period.                   |
+| -d or --deamonize          | -r500                                   | 0      | Enable daemon mode to retrieve events in the background.                               |
+| -r or --export-profile     | -e ./trace_output-e ./output/trace.json | ./     | Record the trace results to a document or folder. If this parameter is not specified, a default xprofiler.trace.json file will be generated in the execution directory. |
+| -S or --settings           | -S xprofiler.trace.json                 | off    | xprofiler reads a JSON file containing the events that need to be traced. If this parameter is not configured, xprofiler enables `--profile-api-trace` and `--sse-trace` by default. |
+| -A or --profiler-api-trace | -A                                      | on     | Get driver events.                                              |
+| -s or --sse-trace          | -s                                      | on     | Get all SSE events.                                           |
+| -C or --cluster-trace      | -C                                      | off    | Retrieve all cluster events.                                        |
+| -n or --sdnn-trace         | -n                                      | off    | Get all SDNN events.                                           |
+| -c or --sdnn-cluster-trace | -c                                      | off    | Retrieve all SDNN cluster events.                                  |
+| -E or --cache-trace        | -E                                      | off    | Get bandwidth statistics events.                                           |
+| -u or --debug              | -u44:open log，debug level-u0:close log    | 33     | Debug the interface and enable driver event/device event logging.。                    |

 #### 3.View Results

@@ -144,4 +144,4 @@ Search directly, or visit[Perfetto UI](https://ui.perfetto.dev/#!/viewer?local_c

 With various performance data available, analysis and optimization can then be performed based on the results.

-(Further details to be added later)
+(Further details to be added later)
--- a/docs/source/developer_guide/performance/performance_benchmark/benchmark_server.md
+++ b/docs/source/developer_guide/performance/performance_benchmark/benchmark_server.md
@@ -11,26 +11,30 @@ You can directly use vLLM's CLI benchmark. For more details, please refer to[vLL
 Server startup script reference

 ```bash
-USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
+python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
-      --port xxxx \
-      --model /xxxx/xxxx/model\
+      --port 8000 \
+      --model /xxxx/xxxx/mkdel\
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
      --tensor-parallel-size 1 \
      --dtype float16 \
-      --max_num_seqs 128 \
-      --max_num_batched_tokens 32768 \
-      --max-seq-len-to-capture 32768 \
-      --block-size 128 \
      --no-enable-prefix-caching \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name modelname \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention", 
+                                                "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output_kunlun",
+                                                "vllm.mamba_mixer2", 
+                                                "vllm.mamba_mixer", 
+                                                "vllm.short_conv", 
+                                                "vllm.linear_attention", 
+                                                "vllm.plamo2_mamba_mixer", 
+                                                "vllm.gdn_attention", 
+                                                "vllm.sparse_attn_indexer"]}' \
+
 ```

 ##### 1.2Execute test
@@ -124,26 +128,30 @@ The following demonstrates the performance test of the Qwen3-8B in a single-card
 The first step is to start the server. The example script is shown below.

 ```bash
-USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
+python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
-      --port xxxx \
-      --model /xxxx/xxxx/Qwen3-8B\
+      --port 8000 \
+      --model /models/Qwen3-8B\
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
      --tensor-parallel-size 1 \
      --dtype float16 \
-      --max_num_seqs 128 \
-      --max_num_batched_tokens 32768 \
-      --max-seq-len-to-capture 32768 \
-      --block-size 128 \
      --no-enable-prefix-caching \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
-      --served-model-name Qwen3-8B \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
+      --served-model-name Qwen3-8B-Instruct \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention", 
+                                                "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output_kunlun",
+                                                "vllm.mamba_mixer2", 
+                                                "vllm.mamba_mixer", 
+                                                "vllm.short_conv", 
+                                                "vllm.linear_attention", 
+                                                "vllm.plamo2_mamba_mixer", 
+                                                "vllm.gdn_attention", 
+                                                "vllm.sparse_attn_indexer"]}' \
+
 ```

 ##### 2.2 Start EvalScope
--- a/docs/source/developer_guide/performance/performance_benchmark/index.md
+++ b/docs/source/developer_guide/performance/performance_benchmark/index.md
@@ -7,5 +7,4 @@ This document details the performance testing methods for vllm-kunlun and the an
 :maxdepth: 1
 benchmark_server
 benchmark_kernel
-profiling
 :::
--- a/docs/source/developer_guide/performance/performance_benchmark/profiling.md
+++ b/docs/source/developer_guide/performance/performance_benchmark/profiling.md
@@ -1,418 +0,0 @@
-## Profiling
-
-
-
-### 🔧 Action Plan（Three Phases）
-#### Phase 1️⃣: Multi-Device Log Redirection Configuration
-##### Background
-By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
- It becomes impossible to distinguish which log originates from which device.
- Timestamps become interleaved, making it difficult to analyze the temporal relationships.
- Single-device bottlenecks are masked by global aggregation.
-
-##### Solution
-During model initialization, create separate log files for each device.
-##### Code Explanation (embedded in qwen2.py)
-```python
-import os  # ← Ensure this is imported at the top of the file
-from vllm.distributed import get_tensor_model_parallel_rank  # ← Import function to get the tensor model parallel rank
-
-class Qwen2Model(nn.Module):
-
-    def __init__(self,
-                 *,
-                 vllm_config: VllmConfig,
-                 prefix: str = "",
-                 decoder_layer_type: type[nn.Module] = Qwen2DecoderLayer):
-        super().__init__()
-
-        # ========== [Expert Solution] Kunlun XPU Multi-Device Log Redirection ==========
-        try:
-            # Step 1: Get the current XPU device's rank (0~7)
-            rank = get_tensor_model_parallel_rank()
-            
-            # Step 2: Create log directory (works with your get_kernel_time_ex.py)
-            log_dir = "./xpu_logs"
-            os.makedirs(log_dir, exist_ok=True)
-            
-            # Step 3: Generate a separate log file for each device
-            log_file = os.path.join(log_dir, f"rank_{rank}.log")
-            
-            # Step 4: Core operation – redirect file descriptors
-            # os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
-            fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
-            os.dup2(fd, 1)  # Redirect stdout → rank_X.log
-            os.dup2(fd, 2)  # Redirect stderr → rank_X.log
-            os.close(fd)     # Close original file descriptor; redirection persists
-            
-            # Optional: print a confirmation message (will go into rank_X.log)
-            print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
-            
-        except Exception as e:
-            # Fallback mechanism: failure to redirect logs does not affect model loading
-            print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
-        # ========== End of log redirection code ==========
-
-```
-##### ⚠️ Common Issues
-**Q1**:Why not use Python's `logging` module?
-**A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required.
-**Q1**:Will logs be lost if the model fails to load??
-**A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
-
-#### Phase 2️⃣: Profiling Environment Activation
-##### 🚀 vLLM Launch
-```bash
-unset XPU_DUMMY_EVENT
-export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-export XPU_USE_MOE_SORTED_THRES=1
-export XFT_USE_FAST_SWIGLU=1
-export XMLIR_CUDNN_ENABLED=1
-export XPU_USE_DEFAULT_CTX=1
-export XMLIR_FORCE_USE_XPU_GRAPH=1
-export XPU_USE_FAST_SWIGLU=1
-export VLLM_HOST_IP=$(hostname -i)
-echo "VLLM_HOST_IP: $VLLM_HOST_IP"
-
-export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
-
-export XPUAPI_DEBUG=0x1              # Enable kernel performance logging
-export XPURT_DISPATCH_MODE=PROFILING # Activate profiling mode
-
-USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
-      --host 0.0.0.0 \
-      --port 8000 \
-      --model /models/Qwen2.5-72B-Instruct \
-      --gpu-memory-utilization 0.9 \
-      --trust-remote-code \
-      --max-model-len 32768 \
-      --tensor-parallel-size 8 \
-      --dtype float16 \
-      --max_num_seqs 512 \
-      --max_num_batched_tokens 32768 \
-      --max-seq-len-to-capture 32768 \
-      --block-size 128 \
-      --no-enable-prefix-caching \
-      --no-enable-chunked-prefill \
-      --distributed-executor-backend mp \
-      --served-model-name Qwen2.5-72B-Instruct \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log
-
-```
-
-
-##### 🚀 Client Load Testing
-```bash
-#!/bin/bash
-
-# Define test combinations array (concurrency x input length x output length)
-TEST_COMBINATIONS=(
-    "8x1024x1024" # Medium-low concurrency
-)
-
-# Create result directory
-RESULT_DIR="bench_$(date +%Y%m%d_%H%M)"
-mkdir -p $RESULT_DIR
-
-# Summary results file
-SUMMARY_FILE="$RESULT_DIR/summary_results.csv"
-echo "num_prompts,input_len,output_len,throughput,latency_mean,latency_p50,latency_p90,latency_p99" >$SUMMARY_FILE
-
-# Progress counter
-TOTAL_TESTS=${#TEST_COMBINATIONS[@]}
-CURRENT_TEST=0
-
-# Loop through different test combinations
-for COMBINATION in "${TEST_COMBINATIONS[@]}"; do
-    # Parse combination parameters
-    NUM_PROMPTS=$(echo $COMBINATION | cut -d'x' -f1)
-    INPUT_LEN=$(echo $COMBINATION | cut -d'x' -f2)
-    OUTPUT_LEN=$(echo $COMBINATION | cut -d'x' -f3)
-
-    # Update progress
-    CURRENT_TEST=$((CURRENT_TEST + 1))
-
-    echo "=========================================================="
-    echo "Test progress: $CURRENT_TEST/$TOTAL_TESTS ($(printf "%.1f" $(echo "$CURRENT_TEST/$TOTAL_TESTS*100" | bc -l))%)"
-    echo "Current test configuration: concurrency=$NUM_PROMPTS, input length=$INPUT_LEN, output length=$OUTPUT_LEN"
-    echo "=========================================================="
-
-    OUTPUT_FILE="$RESULT_DIR/p800_${NUM_PROMPTS}_${INPUT_LEN}_${OUTPUT_LEN}.log"
-
-    # Run benchmark
-    python3 -m vllm.entrypoints.cli.main bench serve \
-        --host 127.0.0.1 \
-        --port 8000 \
-        --backend vllm \
-        --model Qwen2.5-72B-Instruct \
-        --dataset-name random \
-        --num-prompts $NUM_PROMPTS \
-        --random-input-len $INPUT_LEN \
-        --random-output-len $OUTPUT_LEN \
-        --tokenizer /ssd1/models/Qwen2.5-72B-Instruct \
-        --ignore-eos 2>&1 | tee $OUTPUT_FILE
-
-    # Wait 15 seconds to let the service recover
-    echo "Waiting 15 seconds before the next round..."
-    sleep 15
-
-    # Extract key performance metrics from output and append to summary file
-    THROUGHPUT=$(grep "Throughput" $OUTPUT_FILE | awk '{print $2}')
-    LATENCY_MEAN=$(grep "Mean latency" $OUTPUT_FILE | awk '{print $3}')
-    LATENCY_P50=$(grep "p50 latency" $OUTPUT_FILE | awk '{print $3}')
-    LATENCY_P90=$(grep "p90 latency" $OUTPUT_FILE | awk '{print $3}')
-    LATENCY_P99=$(grep "p99 latency" $OUTPUT_FILE | awk '{print $3}')
-
-    echo "$NUM_PROMPTS,$INPUT_LEN,$OUTPUT_LEN,$THROUGHPUT,$LATENCY_MEAN,$LATENCY_P50,$LATENCY_P90,$LATENCY_P99" >>$SUMMARY_FILE
-done
-
-# Output summary report
-echo "=========================================================="
-echo "Benchmark completed! Results saved in: $RESULT_DIR"
-echo "=========================================================="
-
-
-```
-
-#### Phase 3️⃣: Log Analysis and Bottleneck Identification
-```lua
-xpu_logs/
-├─ rank_0.log
-├─ rank_1.log
-├─ rank_2.log
-├─ rank_3.log
-├─ rank_4.log
-├─ rank_5.log
-├─ rank_6.log
-└─ rank_7.log
-
-```
-##### 🔍 Script Workflow (op_log.py)
-**Input**:Raw Kernel Logs (Sample Format)
-```
-[XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
-[XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
-```
-**Processing logic**
-:::::{tab-set}
-::::{tab-item} op_log.py 
-
-
-```python
-"""
-A better version of 'get_op_time.py', get more level dump and support kl3.
- 
-Usage: python3 get_kernel_time_ex.py --help
-"""
- 
-import os
-import sys
-import re
- 
-unit_factors = [0.9, 1.3, 1.45] # kunlun1, kunlun2, kunlun3
-patterns = ["\[XPURT_PROF\] (\S+)\s+\S+\s+(\S+) ns", "\[XPURT_PROF\] (\S+)\s+(\S+)\s+\S+ ns"]
-tab_space_num = int(4)
- 
-def get_total_time(res):
-    total_time = 0.0
-    for i in res.values():
-        total_time += i
-    return  total_time
- 
-def print_info_op(res, cnt, unit, op):
-    total_time = get_total_time(res)
-    total_cnt = 0
-    # print detailed op time
-    lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
-    if sys.version_info.major == 2:
-        import commands
-        for i in range(len(lis)):
-            (status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
-            if status == 0:
-                formt_type = (cmd_output.split('('))[0]
-            total_cnt += cnt[lis[i][0]]
-    elif sys.version_info.major == 3:
-        import subprocess
-        for i in range(len(lis)):
-            (status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
-            if status == 0:
-                formt_type = (cmd_output.split('('))[0]
-            total_cnt += cnt[lis[i][0]]
-    print(f"{op} {total_time / unit} {total_cnt}")
- 
-def print_info_kernel(res, cnt, unit):
-    total_time = get_total_time(res)
-    total_cnt = 0
-    print("Total time(ms) is {}".format(total_time / unit))
-    # print detailed op time
-    lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
-    if sys.version_info.major == 2:
-        print("{:<90}{:<10}{:<15}{:<15}".format("Op type", "count", "time(ms)", "%"))
-        import commands
-        for i in range(len(lis)):
-            (status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
-            if status == 0:
-                formt_type = (cmd_output.split('('))[0]
-            print("{:<90}{:<10}{:<15}{:<15.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
-                lis[i][1] / total_time * 100))
-            total_cnt += cnt[lis[i][0]]
-    elif sys.version_info.major == 3:
-        print("{:<90}{:<10}{:<20}{:<20}".format("Op type", "count", "time(ms)", "%"))
-        import subprocess
-        for i in range(len(lis)):
-            (status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
-            if status == 0:
-                formt_type = (cmd_output.split('('))[0]
-            print("{:<150}{:<10}{:<25}{:<20.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
-                lis[i][1] / total_time * 100))
-            total_cnt += cnt[lis[i][0]]
- 
-    print("Total count is {}".format(total_cnt))
- 
-def count_head_spaces(s: str) -> int:
-   
-    count = 0
-    for char in s:
-        if char == ' ':
-            count += 1
-        else:
-            break
-    return count
- 
-def process_line(lines, pattern1, unit_factor, dump_level):
-    """ process a line in a file with profiling info
- 
-    Args:
-        unit_factor: A factor differentiated by KUNLUN1 and KUNLUN2
- 
-    """
-    res = {}
-    cnt = {}
-    op = "init_op"
-    unit = unit_factor * 1000 * 1000 # ns -> ms
-    wait_next_one = False
-    for i in range(len(lines)):
-        cur_line = lines[i]
-        if "gtest_" in cur_line:
-            cur_level = count_head_spaces(cur_line) / tab_space_num
-            if cur_level == dump_level:
-                wait_next_one = False
-                print_info_op(res, cnt, unit, op)
-                # clear buf
-                res = {}
-                cnt = {}
-                op = cur_line.lstrip().rstrip()
-            elif cur_level < dump_level:
-                wait_next_one = True
-                # skip record kernel time untime next one
-                continue
-        if wait_next_one:
-            # skip record kernel time
-            continue
-        match = re.match(pattern1, lines[i])
-        if match:
-            op_type = match.group(1)
-            op_time = match.group(2)
-            if op_type in res:
-                res[op_type] += float(op_time)
-                cnt[op_type] += 1
-            else:
-                res[op_type] = float(op_time)
-                cnt[op_type] = 1
- 
-    # get left total time
-    if dump_level == -1:
-        print_info_kernel(res, cnt, unit)
-    else:
-        print_info_op(res, cnt, unit, op)
-    return res
- 
-def process_file(file_name, pattern2, unit_factor, dump_level = -1):
-    """ Process a file line by line
- 
-    Iteratively process each line in the target file.
- 
-    """
- 
-    with open(file_name, "r") as f:
-        lines = f.readlines()
-        f1_res_list = process_line(lines, pattern2, unit_factor, dump_level)
- 
-if __name__ == '__main__':
-    import argparse
- 
-
-    parser = argparse.ArgumentParser()
- 
-
-    group = parser.add_mutually_exclusive_group()
-    group.add_argument('-xpu1', action='store_true', help='指定为 xpu1')
-    group.add_argument('-xpu2', action='store_true', help='指定为 xpu2')
-    group.add_argument('-xpu3', action='store_true', help='指定为 xpu3')
-    parser.add_argument('--level', type=int, default=-1, help='指定 dump 缩进级别（默认为 -1）')
-
-    parser.add_argument('filename', help='要处理的文件名')
- 
-
-    args = parser.parse_args()
- 
-
-    filename = args.filename
-    xpu_version = 0
-    if args.xpu2:
-        xpu_version = 1
-    if args.xpu3:
-        xpu_version = 2
-    dump_level = args.level
-    print(f'Filename: {filename}')
-    print(f'-xpu option: {xpu_version}')
-    print(f'--level option: {dump_level}')
- 
-    unit_factor = unit_factors[xpu_version]
-    pattern_idx = 0
-    if xpu_version > 0:
-        pattern_idx = 1
-    process_file(filename, patterns[pattern_idx], unit_factor, dump_level)
- 
-```
-
-::::
-
-::::{tab-item} op_log.sh
-
-
-
-```bash
-
-for i in {0..7}; do
-    python op_log.py -xpu3 xpu_logs/rank_${i}.log > analysis_rank${i}.log
-    echo "Rank ${i} 分析完成"
-done
-
-
-for i in {0..7}; do
-    echo "=== Rank $i ===" 
-    head -n 6 analysis_rank${i}.log | tail -n 5
-done
-```
-::::
-:::::
-##### 📈 Output Example (analysis_rank0.log)
-```
-Filename: xpu_logs/rank_0.log
-xpu option: 2
--level option: -1
-Total time(ms) is 53742.29571862069
-Op type                                                                                   count     time(ms)            %                   
-void xblas_xpu3::fc_cdnn_infer<float16, float16, float16, float16, float, float, float, float, 1>                                                     661569    22736.262780689656       42.306              
-void kl3_all_reduce<float16>                                                                                                                          176134    14782.525712413793       27.506              
-void kl3_all_reduce_butterfly<float16>                                                                                                                164864    4197.28395862069         7.81           
-```
-##### 🚨 Troubleshooting Guide
-|Symptom|Cause|Solution|
-|-|-|-|
-|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set|
-All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified|
-|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production|