Initial commit for vLLM-Kunlun Plugin

This commit is contained in:
dongxinyu03
2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions

View File

@@ -0,0 +1,70 @@
# Contributing
## Building and Testing
It's recommended to set up a local development environment to build vllm-kunlun and run tests
before you submit a PR.
#### Run models locally
After completing Run lint setup which is shown in quicksatrt, you can run your changed locally:
```{code-block} bash
:substitutions:
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8356 \
--model /your_modified_models\
--trust-remote-code \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name your_modified_models \
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' \
```
Please save a screenshot of your service running successfully, and attach an accuracy report.
#### Submit the commit
```bash
# Commit changed files using `-s`
git commit -sm "your commit info"
```
🎉 Congratulations! You have completed the development environment setup.
## PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
- `[Attention]` for new features or optimization in attention.
- `[Communicator]` for new features or optimization in communicators.
- `[ModelRunner]` for new features or optimization in model runner.
- `[Platform]` for new features or optimization in platform.
- `[Worker]` for new features or optimization in worker.
- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
- `[Kernel]` for changes affecting compute kernels and ops.
- `[Bugfix]` for bug fixes.
- `[Doc]` for documentation fixes and improvements.
- `[Test]` for tests (such as unit tests).
- `[CI]` for build or continuous integration improvements.
- `[Misc]` for PRs that do not fit the above categories. Please use this sparingly.
:::{note}
If the PR spans more than one category, please include all relevant prefixes.
:::
## Others
If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers.
:::{toctree}
:caption: Index
:maxdepth: 1
testing
multi_node_test
:::

View File

@@ -0,0 +1,271 @@
## Operator accuracy test
### torch_xray
torch_xray is an operator precision analysis tool that can dump module-level input-output precision comparisons and automatically construct operator unit tests.
#### 1.Download and install
***\*python3.10:\****
bos:/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl
[https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/](https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-py3-none-any.whl)torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl
***\*python3.8:\****
bos:/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp38-cp38-linux_x86_64.whl
[https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/](https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-py3-none-any.whl)torch_xray-999.9.9-cp38-cp38-linux_x86_64.whl
Note that the same installation package must be used when using it in different environments.
#### 2.Use
##### Dump module-level inputs and outputs and compare their precision.
Below is a sample code snippet used to dump the input and output of the vision module and compare the errors in the vllm framework.
```bash
from torch_xray import PrecisionDebugger
def execute_model(
self,
scheduler_output: "SchedulerOutput",
intermediate_tensors: Optional[IntermediateTensors] = None,
) -> Union[ModelRunnerOutput, AsyncModelRunnerOutput, IntermediateTensors]:
# dump_path # Path to store dump results
# rank # Rank that needs to be dumped
# step # Setting the inference value to 1 is sufficient.
# model # The module to be dumped must be of type nn.module
debugger = PrecisionDebugger(dump_path="dump-vision", hook_name="dump", rank=[0], step=[1], model=self.model.visual, dump_torch_api=False)
debugger.start()
........
```
The results directory will generate an h5 file and a csv file.
```bash
-rw-r--r-- 1 root root 471231309 Oct 31 13:12 globalrank-0_localrank-0.h5
-rw-r--r-- 1 root root 71 Oct 31 13:11 globalrank-0_localrank-0_summary.csv
```
##### Data processing
```bash
summary xxx.h5 sum.txt
```
The generated h5 file is processed using the summary command to generate a txt file in which the results are presented in tabular form.
```bash
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
| Index | Step | Rank | Module | Min | Max | Mean | Std |
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
| 0 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_params.weight | -0.0776367 | 0.0795898 | 6.8e-06 | 0.0072608 |
| 1 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_params.bias | -3.046875 | 2.953125 | 0.0113748 | 0.3257138 |
| 2 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_input.0 | -0.7490234 | 0.7021484 | 0.3302804 | 0.2339017 |
| 3 | 1 | 0 | patch_embed.proj.Conv3d.0.forward_output.0 | -4.0078125 | 5.1210938 | 0.0147052 | 0.3815643 |
| 4 | 1 | 0 | pos_embed.Embedding.0.forward_params.weight | -13.8125 | 20.25 | 0.0010043 | 0.2428094 |
| 5 | 1 | 0 | pos_embed.Embedding.0.forward_input.0 | 0.0 | 2303.0 | 1153.9191895 | 714.594360 |
| 6 | 1 | 0 | pos_embed.Embedding.0.forward_output.0 | -13.8125 | 20.25 | 0.0007552 | 0.2643428 |
| 7 | 1 | 0 | rotary_pos_emb.Qwen2_5_VisionRotaryEmbedding.0.forward... | 0.0 | 25.0 | 1.7337022 | 3.9271674 |
| 8 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_params.weight | -0.5351562 | 3.140625 | 0.4660275 | 0.7907906 |
| 9 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_params.bias | -2.359375 | 2.921875 | 0.0013793 | 0.1879374 |
| 10 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_input.0 | -15.65625 | 20.21875 | 0.0155256 | 0.4382802 |
| 11 | 1 | 0 | blocks.0.norm1.LayerNorm.0.forward_output.0 | -6.1640625 | 6.7460938 | 0.0006746 | 0.2708515 |
| 12 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_params.bias | -6.125 | 6.1875 | -0.0292423 | 0.8602651 |
| 13 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_input.0 | -6.1640625 | 6.7460938 | 0.0006746 | 0.2708515 |
| 14 | 1 | 0 | blocks.0.attn.qkv.QKVParallelLinear.0.forward_output.0 | -6.5859375 | 7.6171875 | -0.0125549 | 1.0678084 |
| 15 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_params... | -3.578125 | 3.203125 | -0.0043617 | 0.4846557 |
| 16 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_input.0 | -1.9130859 | 1.4375 | 0.0005577 | 0.0947055 |
| 17 | 1 | 0 | blocks.0.attn.proj.RowParallelLinear.0.forward_output.0 | -9.109375 | 7.3867188 | -0.0034284 | 0.4465481 |
| 18 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_params.weight | -0.1376953 | 14.5625 | 1.9166113 | 3.017405 |
| 19 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_params.bias | -1.6328125 | 3.84375 | 0.0062865 | 0.2443586 |
| 20 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_input.0 | -8.5859375 | 11.109375 | 0.0120974 | 0.4243064 |
| 21 | 1 | 0 | blocks.0.norm2.LayerNorm.1.forward_output.0 | -12.015625 | 14.265625 | -0.0012364 | 0.4973041 |
| 22 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -9.4375 | 0.7304688 | -2.4200516 | 1.6754951 |
| 23 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -12.015625 | 14.265625 | -0.0012364 | 0.4973041 |
| 24 | 1 | 0 | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forwar... | -12.59375 | 13.0625 | -2.1465943 | 1.8433502 |
| 25 | 1 | 0 | blocks.0.mlp.act_fn.GELU.0.forward_input.0 | -12.59375 | 13.0625 | -2.1465943 | 1.8433502 |
+-------+------+------+-----------------------------------------------------------+-------------+-------------+--------------+-------------+
```
##### Accuracy Comparison
```bash
# The results are stored in result.csv
compare xpu.h5 gpu.h5 result.csv
```
The `compare` command is used to process the H5 files generated on the GPU and XPU, resulting in a CSV file. This CSV file is then downloaded to the local machine and opened with Excel, yielding a result similar to the image below.
If you encounter a "no matched keys" problem, please refer to the instructions at the end of this article for a solution.
##### Example of results
```bash
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
| Index | Status | Module (Bench/Target) | Cosine | RMSE | IsClose (%) | Max Err (t) | GtNum |
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
| 0 | | patch_embed.proj.Conv3d.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
| 1 | | patch_embed.proj.Conv3d.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
| 2 | | patch_embed.proj.Conv3d.0.forward_input.0 | 1 | 0 | 100 | 0 | 0 |
| 3 | | patch_embed.proj.Conv3d.0.forward_output.0 | 1 | 9.90E-06 | 100 | 0.001953 | 267 |
| 4 | | pos_embed.Embedding.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
| 5 | | pos_embed.Embedding.0.forward_input.0 | 1 | 0 | 100 | 0 | 0 |
| 6 | | pos_embed.Embedding.0.forward_output.0 | 1 | 0 | 100 | 0 | 0 |
| 7 | | rotary_pos_emb.Qwen2_5_VisionRotaryEmbedding.0.forward... | 1 | 0 | 100 | 0 | 0 |
| 8 | | blocks.0.norm1.LayerNorm.0.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
| 9 | | blocks.0.norm1.LayerNorm.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
| 10 | | blocks.0.norm1.LayerNorm.0.forward_input.0 | 1 | 1.14E-05 | 100 | 0.00390625 | 216 |
| 11 | | blocks.0.norm1.LayerNorm.0.forward_output.0 | 1 | 1.84E-05 | 99.98 | 0.0078125 | 1585 |
| 12 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
| 13 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_input.0 | 1 | 1.84E-05 | 99.98 | 0.0078125 | 1585 |
| 14 | | blocks.0.attn.qkv.QKVParallelLinear.0.forward_output.0 | 1 | 0.0002776 | 99.53 | 0.00390625 | 119074 |
| 15 | | blocks.0.attn.proj.RowParallelLinear.0.forward_params... | 1 | 0 | 100 | 0 | 0 |
| 16 | | blocks.0.attn.proj.RowParallelLinear.0.forward_input.0 | 1 | 3.40E-05 | 99.07 | 0.0012207 | 52482 |
| 17 | | blocks.0.attn.proj.RowParallelLinear.0.forward_output.0 | 1 | 0.0001283 | 99.07 | 0.00390625 | 50591 |
| 18 | | blocks.0.norm2.LayerNorm.1.forward_params.weight | 1 | 0 | 100 | 0 | 0 |
| 19 | | blocks.0.norm2.LayerNorm.1.forward_params.bias | 1 | 0 | 100 | 0 | 0 |
| 20 | | blocks.0.norm2.LayerNorm.1.forward_input.0 | 1 | 0.0001437 | 99.01 | 0.0039062 | 31376 |
| 21 | Fail | blocks.0.norm2.LayerNorm.1.forward_output.0 | 1 | 0.0002779 | 98.72 | 0.015625 | 40770 |
| 22 | | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0 | 100 | 0 | 0 |
| 23 | Fail | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0.0002779 | 98.72 | 0.015625 | 40770 |
| 24 | | blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward... | 1 | 0.000779 | 98.67 | 0.0078125 | 196313 |
| 25 | | blocks.0.mlp.act_fn.GELU.0.forward_input.0 | 1 | 0.000779 | 98.67 | 0.0078125 | 196313 |
| 26 | | blocks.0.mlp.act_fn.GELU.0.forward_output.0 | 1 | 0.0001012 | 98.08 | 0.0039062 | 153508 |
+-------+--------+-----------------------------------------------------------+--------+-----------+-------------+-------------+--------+
```
Generally, the main focus is on Min Err/Max Err.
##### Indicator Explanation
To be improved...
#### The dump operator is tested and run.
```bash
X_DEBUG=0x102 # trace operator name、arguments shape、dtype、data_range
X_DEDUP=True # Remove duplicates based on shape and dtype.
X_DUMP_NUM # The default value is 0, meaning no tensor data is saved. Setting it to n means that n parameters are randomly selected from each operator to save the actual parameters.
```
Below is a sample code snippet that dumps information such as the size and dtype of the forward operator of Qwen3_VisionTransformer. During runtime, an xray_debug directory will be automatically created in the current directory to store the dump results.
```bash
from torch_xray import begin_dump, end_dump
.............

class Qwen3_VisionTransformer(nn.Module):

def __init__(
self,
vision_config: Qwen3VLVisionConfig,
norm_eps: float = 1e-6,
quant_config: Optional[QuantizationConfig] = None,
prefix: str = "",
use_data_parallel: bool = False,
) -> None:
super().__init__()
self.hidden_size = vision_config.hidden_size
..........
def forward(
self,
x: torch.Tensor,
grid_thw: list[list[int]],
) -> torch.Tensor:
# Start dump
# X_DEBUG=0x102 # trace operator name、arguments shape、dtype、data_range
# X_DEDUP=True # Remove duplicates based on shape and dtype.
# The default value is 0, meaning no tensor data is saved. Setting it to n means that n parameters are randomly selected from each operator to save the actual parameters.
begin_dump(X_DEBUG=0x102, X_DEDUP=True, X_DUMP_NUM=5)
hidden_states = x.to(device=self.device, dtype=self.dtype)
hidden_states = self.patch_embed(hidden_states)
...........
# End dump
end_dump(clear_context=True)
return hidden_states
```
This is the file directory.
```bash
├── xary_debug/
│ ├── proc_xxx/ # Process-based storage results
│ ├── dump/ # The dumped tensor
│ ├── dump.json # Information needed to generate unit tests, such as input/output size and dtype.
```
##### Generate unit test
jprof --cpu_init --blacklist --factory=load dump.json
Create a pytests directory in the current directory to store unit tests.
##### Run unit test
The GPU only needs to copy the XPU's pytests directory and execute it.
Since the unit test program defaults to finding the actual dumped tensors using relative paths, this step must be performed in the xary_debug/ directory.
```bash
# detail_compare_path stores the unit test results.
pytest --detail_compare_path=./xxx.csv proc_xxx/pytests/ --seed 42
```
##### Results Comparison
```bash
# After obtaining two result CSV files, compare them and generate result.csv.
summary_diff_check ./xpu.csv ./gpu.csv ./result.csv
```
##### Example of results
```bash
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
| name | op_name | dtype | shape | min-val | max-val | is_pass | xpu_max | gpu_max |
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
| 00004-aten | aten.linspace.default | torch.float | [10] | 0 | 47 | pass | 0 | 1.91E-06 |
| 00005-aten | aten.linspace.default | torch.float | [26] | 0 | 47 | pass | 0 | 0 |
| 00027-aten | aten.add.Tensor | torch.int64 | [10, 26] | 0 | 0 | pass | 0 | 0 |
| 00028-aten | aten.add.Tensor | torch.int64 | [10, 26] | 0 | 0 | pass | 0 | 0 |
| 00037-aten | aten.add.Tensor | torch.float | [260, 1152] | -29.09375 | 33.75 | pass | 0 | 0 |
| 00038-aten | aten.add.Tensor | torch.float | [260, 1152] | -27.1875 | 37.625 | pass | 0 | 0 |
| 00047-aten | aten.add.Tensor | torch.float | [260, 1152] | -28.98438 | 42.34375 | pass | 0 | 0 |
| 00082-aten | aten.sub.Tensor | torch.int32 | [1] | 0 | 0 | pass | 0 | 0 |
+------------+-----------------------+-------------+-------------+-----------+----------+---------+---------+----------+
```
The main focus is on the values of gpu_1e-1, xpu_1e-1, etc., which represent the number of elements whose error between the gpu/xpu result and the cpu result exceeds the order of 1e-n. This serves as the primary basis for determining whether there is a problem with the operator's precision.
#### Replenish
##### Bypassing the issue of differing naming conventions between Kunlun Card and GPU modules, which prevents diff calculation.
```bash
#
blocks.0.mlp.linear_fc1.ColumnParallelLinear.0.forward_params.bias
#
blocks.0.mlp.linear_fc1.ColumnParalleLinear.forward_params.bias
```
As shown in the figure above, due to various reasons, the module names dumped by the GPU and XPU are often different, and the compare command cannot be used to identify them directly.
```python
for step in steps: # (['/'] for group creation order h5py >= 3.10.0)
# for bench_key, target_key in get_matched_names(
# list(dump_ben[str(step)].keys()),
# list(dump_tar[str(step)].keys()),
# fuzzy_match,
# ):
for bench_key, target_key in zip(
list(dump_ben[str(step)].keys()),
list(dump_tar[str(step)].keys()),
):
```
Modify torch_xray/compare/compare.py to skip the get_matched_name step. This modification will allow for line-by-line comparison even if module names differ, producing a compare result. However, it's crucial to ensure that the number of rows in the GPU and XPU dumps is consistent.

View File

@@ -0,0 +1,240 @@
## Overall accuracy test
### EvalScope
#### 1.Download and install
EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:
```bash
#pip
pip install evalscope[perf] -U
#git
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'
```
#### 2.Dataset preparation script
```python
from evalscope.collections import CollectionSchema, DatasetInfo, WeightedSampler
from evalscope.utils.io_utils import dump_jsonl_data
import os # Step 1: Import the os module
schema = CollectionSchema(
name="VL-Test",
datasets=[
CollectionSchema(
name="PureText",
weight=1,
datasets=[
DatasetInfo(
name="mmlu_pro",
weight=1,
task_type="exam",
tags=["en"],
args={"few_shot_num": 0},
),
DatasetInfo(
name="ifeval",
weight=1,
task_type="instruction",
tags=["en"],
args={"few_shot_num": 0},
),
DatasetInfo(
name="gsm8k",
weight=1,
task_type="math",
tags=["en"],
args={"few_shot_num": 0},
),
],
),
CollectionSchema(
name="Vision",
weight=2,
datasets=[
DatasetInfo(
name="math_vista",
weight=1,
task_type="math",
tags=["en"],
args={"few_shot_num": 0},
),
DatasetInfo(
name="mmmu_pro",
weight=1,
task_type="exam",
tags=["en"],
args={"few_shot_num": 0},
),
],
),
],
)
# get the mixed data
mixed_data = WeightedSampler(schema).sample(1000)
output_path = "outputs/vl_test.jsonl" # Step 2: Define the output file path
output_dir = os.path.dirname(output_path) # Step 3: Obtain the directory name
if not os.path.exists(output_dir): # Step 4: Check if the directory exists
os.makedirs(output_dir, exist_ok=True) # Step 5: Automatically create directories
# dump the mixed data to a jsonl file
dump_jsonl_data(mixed_data, output_path) # Step 6: Securely write to the file
```
Dataset composition visualization:
```
┌───────────────────────────────────────┐
│ VL-Test (1000 samples) │
├─────────────────┬─────────────────────┤
│ PureText │ Vision │
│ (333 samples) │ (667 samples) │
├─────────────────┼─────────────────────┤
│ • mmlu_pro │ • math_vista │
│ • ifeval │ • mmmu_pro │
│ • gsm8k │ │
└─────────────────┴─────────────────────┘
```
#### 3.Test
```python
from dotenv import dotenv_values
from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType
task_cfg = TaskConfig(
model="Qwen2.5-VL-7B-Instruct",
api_url="http://localhost:8804/v1",
api_key="EMPTY",
eval_type=EvalType.SERVICE,
datasets=[
"data_collection",
],
dataset_args={
"data_collection": {
"local_path": "../outputs/vl_test.jsonl",
}
},
eval_batch_size=5,
generation_config={
"max_tokens": 30000, # The maximum number of tokens that can be generated should be set to a large value to avoid output truncation.
"temperature": 0.6, # Sampling temperature (recommended value from qwen report)
"top_p": 0.95, # top-p sampling (recommended value from qwen report)
"top_k": 20, # Top-k sampling (recommended value from qwen report)
"n": 1, # Number of responses generated per request
"repetition_penalty": 1.0, # 1.0 = Penalty disabled, >1.0 = Penalty repeated.
},
)
run_task(task_cfg=task_cfg)
```
Parameter Tuning Guide:
| Parameter | Current value | Effect | Adjustment suggestions |
| ----------------- | ------------- | ---------------------------------------- | -------------------------------------------------------- |
| `temperature` | 0.6 | Control output diversity | Math problems ↓ 0.3 / Creative writing ↑ 0.9 |
| `top_p` | 0.95 | Filtering low-probability tokens | Reduce "nonsense" |
| `eval_batch_size` | 5 | Number of requests processed in parallel | With sufficient video memory, it can be increased to 10. |
Run the test:
```bash
#!/bin/bash
# ========================================
# Step 1: Set the log file path
# ========================================
LOG_FILE="accuracy_$(date +%Y%m%d_%H%M).log"
# ========================================
# Step 2: Execute the Python script and capture all output
# Meaning of 2>&1:
# - 2 represents standard error output (stderr)
# ->& represents redirection and merging
# - 1 represents standard output (stdout)
# Function: Merges error messages into standard output as well.
# ========================================
python accuracy.py 2>&1 | tee "$LOG_FILE"
# ========================================
# Step 3: Check Execution Status
# ${PIPESTATUS[0]} Get the exit code of the first command (Python) in the pipeline
# ========================================
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -eq 0 ]; then
echo "✅ Evaluation completed! Log saved to: $LOG_FILE"
else
echo "❌ Evaluation failed! Exit code: $EXIT_CODE Please check the log: $LOG_FILE"
fi
```
#### 4.Common problem fixes
##### 4.1 NLTK resource missing fix
```bash
Resource punkt_tab not found.
```
Solution
```python
import nltk
import os
# Step 1: Set the download path (select a writable directory)
download_dir = "/workspace/myenv/nltk_data"
os.makedirs(download_dir, exist_ok=True)
# Step 2: Configure NLTK data path
nltk.data.path.append(download_dir)
# Step 3: Download necessary resources
print("🔽 Start downloading punkt_tab resource...")
try:
nltk.download("punkt_tab", download_dir=download_dir)
print("✅ Download successful!")
except Exception as e:
print(f"❌ Download failed: {e}")
print("💡 Alternative: Download manually from GitHub")
print(
" URL: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip"
)
```
repair:
```bash
# Activate environment
source /workspace/myenv/bin/activate
# Run the repair script
python fix_nltk.py
# Rerun the test
bash run_accuracy_test.sh
```
#### 5.Results Display
```bash
+-------------+---------------------+--------------+---------------+-------+
| task_type | metric | dataset_name | average_score | count |
+-------------+---------------------+--------------+---------------+-------+
| exam | acc | mmmu_pro | 0.521 | 334 |
| math | acc | math_vista | 0.6066 | 333 |
| exam | acc | mmlu_pro | 0.5405 | 111 |
| instruction | prompt_level_strict | ifeval | 0.6937 | 111 |
| math | acc | gsm8k | 0.8288 | 111 |
+-------------+---------------------+--------------+---------------+-------+
```

View File

@@ -0,0 +1,10 @@
# Accuracy
This document details the accuracy testing methods for vllm-kunlun and the analysis of the results.
:::{toctree}
:caption: Accuracy
:maxdepth: 1
accuracy_server
accuracy_kernel
:::

View File

@@ -0,0 +1,18 @@
# GLM-Air-4.5
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
* Hardware Environment: KunLun P800
* Parallel mode:TP8
```bash
+-------------+----------+---------------+---------+-----+--------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+-------------+----------+---------------+---------+-----+--------+---------+
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 1 | 43 | 0.9302 | default |
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 2 | 90 | 0.9222 | default |
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 3 | 105 | 0.8762 | default |
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 4 | 128 | 0.8984 | default |
| GLM-4.5-Air | math_500 | AveragePass@1 | Level 5 | 134 | 0.8955 | default |
+-------------+----------+---------------+---------+-----+--------+---------+
```

View File

@@ -0,0 +1,18 @@
# GLM-4.5
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
* Hardware Environment: KunLun P800
* Parallel mode:TP8
```bash
+---------+----------+---------------+---------+-----+--------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+---------+----------+---------------+---------+-----+--------+---------+
| GLM-4.5 | math_500 | AveragePass@1 | Level 1 | 43 | 0.9302 | default |
| GLM-4.5 | math_500 | AveragePass@1 | Level 2 | 90 | 0.8111 | default |
| GLM-4.5 | math_500 | AveragePass@1 | Level 3 | 105 | 0.7143 | default |
| GLM-4.5 | math_500 | AveragePass@1 | Level 4 | 128 | 0.6172 | default |
| GLM-4.5 | math_500 | AveragePass@1 | Level 5 | 134 | 0.5149 | default |
+---------+----------+---------------+---------+-----+--------+---------+
```

View File

@@ -0,0 +1,18 @@
# InternVL3_5-30B-A3B
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
* Hardware Environment: KunLun P800
* Parallel mode:TP8
```
+-------------+---------------------+--------------+---------------+-------+
| task_type | metric | dataset_name | average_score | count |
+-------------+---------------------+--------------+---------------+-------+
| exam | acc | mmmu_pro | 0.5449 | 334 |
| math | acc | math_vista | 0.6847 | 333 |
| exam | acc | mmlu_pro | 0.6126 | 111 |
| instruction | prompt_level_strict | ifeval | 0.7658 | 111 |
| math | acc | gsm8k | 0.9369 | 111 |
+-------------+---------------------+--------------+---------------+-------+
```

View File

@@ -0,0 +1,18 @@
# Qwen2.5-VL-7B-Instruct
* vLLM Version: vLLM: 0.10.1.1 , vLLM-KunLun Version: v0.10.1.1
* Software Environment:OS: Ubuntu 22.04, PyTorch ≥ 2.5.1
* Hardware Environment: KunLun P800
* Parallel mode:TP1
```
+-------------+---------------------+--------------+---------------+-------+
| task_type | metric | dataset_name | average_score | count |
+-------------+---------------------+--------------+---------------+-------+
| exam | acc | mmmu_pro | 0.521 | 334 |
| math | acc | math_vista | 0.6066 | 333 |
| exam | acc | mmlu_pro | 0.5405 | 111 |
| instruction | prompt_level_strict | ifeval | 0.6937 | 111 |
| math | acc | gsm8k | 0.8288 | 111 |
+-------------+---------------------+--------------+---------------+-------+
```

View File

@@ -0,0 +1,10 @@
# Accuracy Report
:::{toctree}
:caption: Accuracy Report
:maxdepth: 1
Qwen2.5-VL-7B-Instruct
InternVL3_5-30B-A3B
GLM-4.5
GLM-4.5-Air
:::

View File

@@ -0,0 +1,8 @@
# Accuracy
:::{toctree}
:caption: Accuracy
:maxdepth: 1
accuracy/index
accuracy_report/index
:::

View File

@@ -0,0 +1,76 @@
# Kunlun Graph
## Why we need Kunlun Graph?
When in LLM inference, each token requires nearly thousand operator executions, and when host launching operators are slower than device, it will cause host bound. In severe cases, the device will be idle for more than half of the time. To solve this problem, we use graph in LLM inference.
```
eager mode:
host: | launch op1 | launch op2 | launch op3 | launch op4 | launch op5 |
device: | run op1 |free| run op2 |free| run op3 |free| run op4 |free| run op5 |
| <----- total time -----> |
graph mode:
host: | launch graph |
device: | run op1 | run op2 | run op3 | run op4 | run op5 |
| <----- total time -----> |
```
## How to use Kunlun Graph?
Kunlun Graph is enabled by default in V1 Engine, just need to check that `enforce_eager` is not set to `True`.
## How it works?
In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on devices, and waiting for result.
But in reality, graph mode is not that simple.
### Padding and Bucketing
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency.
Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, which will greatly reduce redundant computing. But when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is:
1. Set a threshold;
2. When `num_scheduled_tokens` is bigger than the threshold, use `eager_mode`;
3. Capture multiple graphs within a range below the threshold;
```
| graph1 |
| graph2 |
| graph3 |
| graph4 | # the threshold
| input1 | pad | # use graph1
| input2 | # don't need pad
| input3 | pad | # use graph4
| input4 | # use eager mode
```
### Piecewise and Full graph
Due to the increasing complexity of the attention layer in current LLM, we can't ensure all types of attention can run in graph. In MLA, prefill_tokens and decode_tokens have different calculation method, so when a batch has both prefills and decodes in MLA, graph mode is difficult to handle this situation.
vLLM solves this problem with piecewise graph mode. We use eager mode to launch attention's ops, and use graph to deal with others. But it also bring some problems: The cost of launching ops has become large again, although much smaller than eager mode, but it will also lead to host bound when cpu is poor or `num_tokens` is small.
## How it be implemented?
vLLM has already implemented most of the modules in graph mode. You can see more details at: [CUDA Graphs](https://docs.vllm.ai/en/latest/design/cuda_graphs.html)
When in graph mode, vLLM will call `current_platform.get_static_graph_wrapper_cls` to get current device's graph model wrapper, so what we need to do is to implement the graph mode wrapper on Kunlun: `Kunlun Graph Wrapper`.
vLLM has added `support_torch_compile` decorator to all models, this decorator will replace the `__init__` and `forward` interface of the model class, and when `forward` called, the code inside the `vllm_kunlun.compilation` will be executed, and it will do capture or replay as mentioned above.
## Limitation
1. `FULL` and `FULL_AND_PIECEWISE` are not supported now;
3. `use_inductor` is not supported now;

View File

@@ -0,0 +1,9 @@
# Feature Guide
This section provides an overview of the features implemented in vLLM-Kunlun. Developers can refer to this guide to understand how vLLM-Kunlun works.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
Kunlun_Graph
:::

View File

@@ -0,0 +1,7 @@
# Performance
:::{toctree}
:caption: Performance
:maxdepth: 1
performance_benchmark/index
:::

View File

@@ -0,0 +1,147 @@
## Operator performance
### XProfiler
#### 1.Download and install
- The download link for the x86_64 platform installation package xre-Linux-x86_64 is:
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/peermem/xre-Linux-x86_64-5.0.21.26.run`
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/peermem/xre-Linux-x86_64-5.0.21.26.tar.gz`
- If the client is using bdCentOS, we recommend using the following download link:
`https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.26/xre-bdcentos-x86_64-5.0.21.26.tar.gz`
After downloading and extracting, you can directly execute `xpu-installer` and `install_rt.sh` to install.
#### 2.Start using
XProfiler supports three modes: 1) fork mode; 2) time mode; and 3) daemon mode. After execution, XProfiler will generate two types of JSON files:
- xprofiler.settings.json: Records the event configuration for this trace.
- xprofiler.trace.json: Records the results of this trace.
The specific modes will be introduced below.
##### fork mode
The fork pattern is used to track the entire time period from the start to the end of a user program. This pattern is suitable for most inference tasks and is the simplest to use. An example is shown below:
```bash
/xxxx/xxxx/xprofiler -r500 --xpu=0 python test.py
```
- --r: Sets the trace time resolution in nanoseconds (ns). The default is 100. If an "out of space error" occurs, try increasing the -r value to 500.
- --xpu: Specifies the acquisition device ID, supporting multi-card configuration. --xpu=all enables all cards; the default is card 0.
More parameters can be found in the command-line parameters section later.
##### time mode
The time mode is used to track user programs for a period of time. This method is suitable for tasks that need to run for a long time.
Using the -t or --time command-line parameter, XPorfiler will run for the specified time and then exit, in seconds. In this mode, the application needs to be started separately. An example is as follows:
(1) Starting XPorfiler
```bash
/xxxx/xxxx/xprofiler -r 500 --xpu=0 -t600 # Time mode collects events within a specified time period, measured in seconds (s).
```
A temporary .sock file will be generated in the execution directory. The path needs to be configured in the environment variables.
(2) Start the program
```bash
export XPU_ENABLE_PROFILER_TRACING=1
export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
# Start your own program
python xxx.py
```
##### deamon mode
The daemon mode is used to track the event timeline of a specified code segment, eliminating interference from redundant information. The startup command is the same as in fork mode.
(1) Insert start and stop interfaces.
```python
import xtorch_ops
# Only capture events during the generate phase
xtorch_ops.kunlun_profiler_start()
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
lora_request=lora_request,
)
xtorch_ops.kunlun_profiler_end()
```
(2) Launch X profiler in a terminal
```python
# Specify the output file as the trace_output file in the current path.
/xxxx/xxxx/xprofiler-Linux_x86_64-2.0.2.0/bin/xprofiler -r 500 --xpu=0 -e ./trace_output -d
```
After startup, a .sock file will be generated in the current directory.
```bash
xprofiler.sock
```
(3) Launch your own program on another terminal.
```python
export XPU_ENABLE_PROFILER_TRACING=1
# Here, the path to the .sock file from step 2 is used for assignment.
export XPU_TRACING_OUTPUT_NAME=<xprofiler execution directory>/xprofiler.sock
# Start your own program
python xxx.py
```
Note: If you want to specify a particular card to run on, you must import the XPU_VISIBLE_DEVICES environment variable in the terminal in steps 2 and 3; otherwise, you will not be able to capture the data.
##### More parameters
| parameters | Example | default value | describe |
| -------------------------- | --------------------------------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| -b or --buffer-size | -b=512 | 256 | Specifies the size of the trace buffer in MB. This is generally not required. However, if there are many trace signals, the buffer size can be increased appropriately to avoid OOS (Out of Size). |
| -x or --xpu | -x=0--xpu=0 | 0 | Set the card number to be tracked; multiple cards or all cards can be set. |
| -t or --time | -t=10 | off | Enable time mode, in seconds, to capture information over a specified period. |
| -d or --deamonize | -r500 | 0 | Enable daemon mode to retrieve events in the background. |
| -r or --export-profile | -e ./trace_output-e ./output/trace.json | ./ | Record the trace results to a document or folder. If this parameter is not specified, a default xprofiler.trace.json file will be generated in the execution directory. |
| -S or --settings | -S xprofiler.trace.json | off | xprofiler reads a JSON file containing the events that need to be traced. If this parameter is not configured, xprofiler enables `--profile-api-trace` and `--sse-trace` by default. |
| -A or --profiler-api-trace | -A | on | Get driver events. |
| -s or --sse-trace | -s | on | Get all SSE events. |
| -C or --cluster-trace | -C | off | Retrieve all cluster events. |
| -n or --sdnn-trace | -n | off | Get all SDNN events. |
| -c or --sdnn-cluster-trace | -c | off | Retrieve all SDNN cluster events. |
| -E or --cache-trace | -E | off | Get bandwidth statistics events. |
| -u or --debug | -u44:open logdebug level-u0:close log | 33 | Debug the interface and enable driver event/device event logging.。 |
#### 3.View Results
The generated xprofiler.trace.json file can be viewed and analyzed using a visual interface. Two tools are introduced here.
##### Chrome browser
Enter chrome://tracing/ in your browser (you may need to enable developer tools the first time you access this site), and click "load" in the top left corner to import the file. Interface display.
![img](https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=89aef70f112a4394adcac8b03ef994db&docGuid=WFoZOcuqnSXJIE)
##### prefetto ui
Search directly, or visit[Perfetto UI](https://ui.perfetto.dev/#!/viewer?local_cache_key)The interface is as follows。
![img](https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=895a715344e9473c9ee93518c3064b27&docGuid=WFoZOcuqnSXJIE)
#### 4.Performance Analysis
With various performance data available, analysis and optimization can then be performed based on the results.
(Further details to be added later)

View File

@@ -0,0 +1,199 @@
## vLLM server performance
### vLLM benchmark CLI
You can directly use vLLM's CLI benchmark. For more details, please refer to[vLLM Developer Guide Benchmark Suites](https://docs.vllm.ai/en/stable/contributing/benchmarks.html)
#### 1.Online testing
##### 1.1Start the vLLM server
Server startup script reference
```bash
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port xxxx \
--model /xxxx/xxxx/model\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--max-seq-len-to-capture 32768 \
--block-size 128 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name modelname \
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' \
```
##### 1.2Execute test
To run the test script, you can refer to the code below.
```bash
#!/bin/bash
# Run benchmark tests
python -m vllm.entrypoints.cli.main bench serve \
--host 127.0.0.1 \
--port xxxx \
--backend vllm \
--model modelname \
--dataset-name random \
--num-prompts 500 \
--random-input-len 1024 \
--random-output-len 1024 \
--tokenizer /xxxx/xxxx/model \
--ignore-eos 2>&1 | tee benchmark.log
```
##### 1.3Result
The following content will be displayed after the process is complete.
```bash
========== Serving Benchmark Result ==========
Successful requests: 500
Benchmark duration (s): 144.89
Total input tokens: 510414
Total generated tokens: 512000
Request throughput (req/s): 3.45
Output token throughput (tok/s): 3533.68
Total Token throughput (tok/s): 7056.42
----------Time to First Token----------
Mean TTFT (ms): 57959.61
Median TTFT (ms): 43551.93
P99 TTFT (ms): 116202.52
----------Time per Output Token (excl. 1st token)----------
Mean TPOT (ms): 33.30
Median TPOT (ms): 34.15
P99 TPOT (ms): 35.59
----------Inter-token Latency----------
Mean ITL (ms): 33.30
Median ITL (ms): 29.05
P99 ITL (ms): 46.14
============================================
```
Key Parameter Explanation:
| index | meaning | Optimization Objective |
| --------------------------- | ------------------------------------| ---------- |
| ***\*Output Throughput\**** | Output token generation rate | ↑ The higher the better |
| ***\*Mean TTFT\**** | First Token Delay (Time To First Token) | ↓ The lower the better |
| ***\*P99 TTFT\**** | 99% of requests have delayed first token. | ↓ The lower the better |
| ***\*Mean TPOT\**** | Average generation time per output token | ↓ The lower the better |
| ***\*P99 TPOT\**** | 99% of requests' time per token generation | ↓ The lower the better |
| ***\*ITL\**** | Delay between adjacent output tokens | ↓ The lower the better |
#### 2.Offline testing
Comming soon...
### EvalScope
EvalScope is a comprehensive model testing tool that can test not only model accuracy but also performance. For more information, please visit [website address missing].[EvalScope](https://evalscope.readthedocs.io/en/latest/index.html)A brief introduction follows.
#### 1.Download and install
EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:
```bash
#pip
pip install evalscope[perf] -U
#git
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'
```
After downloading, some modules may be missing, causing the program to fail to run. Just follow the prompts to install them.
#### 2.Start using
The following demonstrates the performance test of the Qwen3-8B in a single-card scenario.
##### 2.1Start the server
The first step is to start the server. The example script is shown below.
```bash
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port xxxx \
--model /xxxx/xxxx/Qwen3-8B\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--max-seq-len-to-capture 32768 \
--block-size 128 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-8B \
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' \
```
##### 2.2 Start EvalScope
Start EvalScope to begin performance testing.
```bash
evalscope perf \
--parallel 1 10\#The number of concurrent requests can be tested at once, separated by spaces.
--number 10 20\#The total number of requests per request, aligned with spaces and the concurrency count.
--model Qwen3-8B \
--url http://127.0.0.1:xxxx/v1/chat/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--min-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--tokenizer-path /xxxx/xxxx/Qwen3-8B\
--extra-args '{"ignore_eos": true}'
```
##### 2.3Results Analysis
The following figure shows the results. You can view other data from a single test through the logs. For the specific meaning of the parameters, please refer to the parameter interpretation in the vLLM benchmark test.
```bash
Performance Test Summary Report
Basic Information:
+-------------------+------------------------+
| Model | Qwen3-8B |
| Total Generated | 30,720.0 tokens |
| Total Test Time | 199.79 seconds |
| Avg Output Rate | 153.76 tokens/sec |
+-------------------+------------------------+
Detailed Performance Metrics
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
| Conc. | RPS | Avg Lat.(s)| P99 Lat.(s)| Gen. Toks/s| Avg TTFT(s)| P99 TTFT(s)| Avg TPOT(s)| P99 TPOT(s)| Success Rate |
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
| 1 | 0.07 | 16.191 | 16.475 | 70.40 | 0.080 | 0.085 | 0.016 | 0.016 | 100.0% |
| 10 | 0.53 | 18.927 | 19.461 | 540.87 | 0.503 | 0.562 | 0.018 | 0.019 | 100.0% |
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
Best Performance Configuration
Highest RPS: Concurrency 10 (0.53 req/sec)
Lowest Latency: Concurrency 1 (16.191 seconds)
Performance Recommendations:
* The system seems not to have reached its performance bottleneck, try higher concurrency
```

View File

@@ -0,0 +1,11 @@
# Performance_benchmark
This document details the performance testing methods for vllm-kunlun and the analysis of the results to ultimately optimize performance. The main considerations are server throughput and operator performance.
:::{toctree}
:caption: Performance
:maxdepth: 1
benchmark_server
benchmark_kernel
profiling
:::

View File

@@ -0,0 +1,418 @@
## Profiling
### 🔧 Action PlanThree Phases
#### Phase 1⃣: Multi-Device Log Redirection Configuration
##### Background
By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
- It becomes impossible to distinguish which log originates from which device.
- Timestamps become interleaved, making it difficult to analyze the temporal relationships.
- Single-device bottlenecks are masked by global aggregation.
##### Solution
During model initialization, create separate log files for each device.
##### Code Explanation (embedded in qwen2.py)
```python
import os # ← Ensure this is imported at the top of the file
from vllm.distributed import get_tensor_model_parallel_rank # ← Import function to get the tensor model parallel rank
class Qwen2Model(nn.Module):
def __init__(self,
*,
vllm_config: VllmConfig,
prefix: str = "",
decoder_layer_type: type[nn.Module] = Qwen2DecoderLayer):
super().__init__()
# ========== [Expert Solution] Kunlun XPU Multi-Device Log Redirection ==========
try:
# Step 1: Get the current XPU device's rank (0~7)
rank = get_tensor_model_parallel_rank()
# Step 2: Create log directory (works with your get_kernel_time_ex.py)
log_dir = "./xpu_logs"
os.makedirs(log_dir, exist_ok=True)
# Step 3: Generate a separate log file for each device
log_file = os.path.join(log_dir, f"rank_{rank}.log")
# Step 4: Core operation redirect file descriptors
# os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
os.dup2(fd, 1) # Redirect stdout → rank_X.log
os.dup2(fd, 2) # Redirect stderr → rank_X.log
os.close(fd) # Close original file descriptor; redirection persists
# Optional: print a confirmation message (will go into rank_X.log)
print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
except Exception as e:
# Fallback mechanism: failure to redirect logs does not affect model loading
print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
# ========== End of log redirection code ==========
```
##### ⚠️ Common Issues
**Q1**:Why not use Python's `logging` module?
**A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Pythons `logging` module. Redirection via low-level file descriptors is required.
**Q1**:Will logs be lost if the model fails to load??
**A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
#### Phase 2⃣: Profiling Environment Activation
##### 🚀 vLLM Launch
```bash
unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export XPU_USE_MOE_SORTED_THRES=1
export XFT_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_FAST_SWIGLU=1
export VLLM_HOST_IP=$(hostname -i)
echo "VLLM_HOST_IP: $VLLM_HOST_IP"
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export XPUAPI_DEBUG=0x1 # Enable kernel performance logging
export XPURT_DISPATCH_MODE=PROFILING # Activate profiling mode
USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
      --port 8000 \
      --model /models/Qwen2.5-72B-Instruct \
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
      --tensor-parallel-size 8 \
      --dtype float16 \
      --max_num_seqs 512 \
      --max_num_batched_tokens 32768 \
      --max-seq-len-to-capture 32768 \
      --block-size 128 \
      --no-enable-prefix-caching \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name Qwen2.5-72B-Instruct \
      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log
```
##### 🚀 Client Load Testing
```bash
#!/bin/bash
# Define test combinations array (concurrency x input length x output length)
TEST_COMBINATIONS=(
"8x1024x1024" # Medium-low concurrency
)
# Create result directory
RESULT_DIR="bench_$(date +%Y%m%d_%H%M)"
mkdir -p $RESULT_DIR
# Summary results file
SUMMARY_FILE="$RESULT_DIR/summary_results.csv"
echo "num_prompts,input_len,output_len,throughput,latency_mean,latency_p50,latency_p90,latency_p99" >$SUMMARY_FILE
# Progress counter
TOTAL_TESTS=${#TEST_COMBINATIONS[@]}
CURRENT_TEST=0
# Loop through different test combinations
for COMBINATION in "${TEST_COMBINATIONS[@]}"; do
# Parse combination parameters
NUM_PROMPTS=$(echo $COMBINATION | cut -d'x' -f1)
INPUT_LEN=$(echo $COMBINATION | cut -d'x' -f2)
OUTPUT_LEN=$(echo $COMBINATION | cut -d'x' -f3)
# Update progress
CURRENT_TEST=$((CURRENT_TEST + 1))
echo "=========================================================="
echo "Test progress: $CURRENT_TEST/$TOTAL_TESTS ($(printf "%.1f" $(echo "$CURRENT_TEST/$TOTAL_TESTS*100" | bc -l))%)"
echo "Current test configuration: concurrency=$NUM_PROMPTS, input length=$INPUT_LEN, output length=$OUTPUT_LEN"
echo "=========================================================="
OUTPUT_FILE="$RESULT_DIR/p800_${NUM_PROMPTS}_${INPUT_LEN}_${OUTPUT_LEN}.log"
# Run benchmark
python3 -m vllm.entrypoints.cli.main bench serve \
--host 127.0.0.1 \
--port 8000 \
--backend vllm \
--model Qwen2.5-72B-Instruct \
--dataset-name random \
--num-prompts $NUM_PROMPTS \
--random-input-len $INPUT_LEN \
--random-output-len $OUTPUT_LEN \
--tokenizer /ssd1/models/Qwen2.5-72B-Instruct \
--ignore-eos 2>&1 | tee $OUTPUT_FILE
# Wait 15 seconds to let the service recover
echo "Waiting 15 seconds before the next round..."
sleep 15
# Extract key performance metrics from output and append to summary file
THROUGHPUT=$(grep "Throughput" $OUTPUT_FILE | awk '{print $2}')
LATENCY_MEAN=$(grep "Mean latency" $OUTPUT_FILE | awk '{print $3}')
LATENCY_P50=$(grep "p50 latency" $OUTPUT_FILE | awk '{print $3}')
LATENCY_P90=$(grep "p90 latency" $OUTPUT_FILE | awk '{print $3}')
LATENCY_P99=$(grep "p99 latency" $OUTPUT_FILE | awk '{print $3}')
echo "$NUM_PROMPTS,$INPUT_LEN,$OUTPUT_LEN,$THROUGHPUT,$LATENCY_MEAN,$LATENCY_P50,$LATENCY_P90,$LATENCY_P99" >>$SUMMARY_FILE
done
# Output summary report
echo "=========================================================="
echo "Benchmark completed! Results saved in: $RESULT_DIR"
echo "=========================================================="
```
#### Phase 3⃣: Log Analysis and Bottleneck Identification
```lua
xpu_logs/
├─ rank_0.log
├─ rank_1.log
├─ rank_2.log
├─ rank_3.log
├─ rank_4.log
├─ rank_5.log
├─ rank_6.log
└─ rank_7.log
```
##### 🔍 Script Workflow (op_log.py)
**Input**:Raw Kernel Logs (Sample Format)
```
[XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
[XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
```
**Processing logic**
:::::{tab-set}
::::{tab-item} op_log.py
```python
"""
A better version of 'get_op_time.py', get more level dump and support kl3.
 
Usage: python3 get_kernel_time_ex.py --help
"""
 
import os
import sys
import re
 
unit_factors = [0.9, 1.3, 1.45] # kunlun1, kunlun2, kunlun3
patterns = ["\[XPURT_PROF\] (\S+)\s+\S+\s+(\S+) ns", "\[XPURT_PROF\] (\S+)\s+(\S+)\s+\S+ ns"]
tab_space_num = int(4)
 
def get_total_time(res):
    total_time = 0.0
    for i in res.values():
        total_time += i
    return  total_time
 
def print_info_op(res, cnt, unit, op):
    total_time = get_total_time(res)
    total_cnt = 0
    # print detailed op time
    lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
    if sys.version_info.major == 2:
        import commands
        for i in range(len(lis)):
            (status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
            if status == 0:
                formt_type = (cmd_output.split('('))[0]
            total_cnt += cnt[lis[i][0]]
    elif sys.version_info.major == 3:
        import subprocess
        for i in range(len(lis)):
            (status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
            if status == 0:
                formt_type = (cmd_output.split('('))[0]
            total_cnt += cnt[lis[i][0]]
    print(f"{op} {total_time / unit} {total_cnt}")
 
def print_info_kernel(res, cnt, unit):
    total_time = get_total_time(res)
    total_cnt = 0
    print("Total time(ms) is {}".format(total_time / unit))
    # print detailed op time
    lis=sorted(res.items(), key=lambda d:d[1], reverse=True)
    if sys.version_info.major == 2:
        print("{:<90}{:<10}{:<15}{:<15}".format("Op type", "count", "time(ms)", "%"))
        import commands
        for i in range(len(lis)):
            (status, cmd_output) = commands.getstatusoutput("c++filt {}".format(lis[i][0]))
            if status == 0:
                formt_type = (cmd_output.split('('))[0]
            print("{:<90}{:<10}{:<15}{:<15.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
                lis[i][1] / total_time * 100))
            total_cnt += cnt[lis[i][0]]
    elif sys.version_info.major == 3:
        print("{:<90}{:<10}{:<20}{:<20}".format("Op type", "count", "time(ms)", "%"))
        import subprocess
        for i in range(len(lis)):
            (status, cmd_output) = subprocess.getstatusoutput("c++filt {}".format(lis[i][0]))
            if status == 0:
                formt_type = (cmd_output.split('('))[0]
            print("{:<150}{:<10}{:<25}{:<20.5}".format(formt_type, cnt[lis[i][0]], lis[i][1] / unit, \
                lis[i][1] / total_time * 100))
            total_cnt += cnt[lis[i][0]]
 
    print("Total count is {}".format(total_cnt))
 
def count_head_spaces(s: str) -> int:
   
    count = 0
    for char in s:
        if char == ' ':
            count += 1
        else:
            break
    return count
 
def process_line(lines, pattern1, unit_factor, dump_level):
    """ process a line in a file with profiling info
 
    Args:
        unit_factor: A factor differentiated by KUNLUN1 and KUNLUN2
 
    """
    res = {}
    cnt = {}
    op = "init_op"
    unit = unit_factor * 1000 * 1000 # ns -> ms
    wait_next_one = False
    for i in range(len(lines)):
        cur_line = lines[i]
        if "gtest_" in cur_line:
            cur_level = count_head_spaces(cur_line) / tab_space_num
            if cur_level == dump_level:
                wait_next_one = False
                print_info_op(res, cnt, unit, op)
                # clear buf
                res = {}
                cnt = {}
                op = cur_line.lstrip().rstrip()
            elif cur_level < dump_level:
                wait_next_one = True
                # skip record kernel time untime next one
                continue
        if wait_next_one:
            # skip record kernel time
            continue
        match = re.match(pattern1, lines[i])
        if match:
            op_type = match.group(1)
            op_time = match.group(2)
            if op_type in res:
                res[op_type] += float(op_time)
                cnt[op_type] += 1
            else:
                res[op_type] = float(op_time)
                cnt[op_type] = 1
 
    # get left total time
    if dump_level == -1:
        print_info_kernel(res, cnt, unit)
    else:
        print_info_op(res, cnt, unit, op)
    return res
 
def process_file(file_name, pattern2, unit_factor, dump_level = -1):
    """ Process a file line by line
 
    Iteratively process each line in the target file.
 
    """
 
    with open(file_name, "r") as f:
        lines = f.readlines()
        f1_res_list = process_line(lines, pattern2, unit_factor, dump_level)
 
if __name__ == '__main__':
    import argparse
 
    parser = argparse.ArgumentParser()
 
    group = parser.add_mutually_exclusive_group()
    group.add_argument('-xpu1', action='store_true', help='指定为 xpu1')
    group.add_argument('-xpu2', action='store_true', help='指定为 xpu2')
    group.add_argument('-xpu3', action='store_true', help='指定为 xpu3')
    parser.add_argument('--level', type=int, default=-1, help='指定 dump 缩进级别默认为 -1')
    parser.add_argument('filename', help='要处理的文件名')
 
    args = parser.parse_args()
 
    filename = args.filename
    xpu_version = 0
    if args.xpu2:
        xpu_version = 1
    if args.xpu3:
        xpu_version = 2
    dump_level = args.level
    print(f'Filename: {filename}')
    print(f'-xpu option: {xpu_version}')
    print(f'--level option: {dump_level}')
 
    unit_factor = unit_factors[xpu_version]
    pattern_idx = 0
    if xpu_version > 0:
        pattern_idx = 1
    process_file(filename, patterns[pattern_idx], unit_factor, dump_level)
 
```
::::
::::{tab-item} op_log.sh
```bash
for i in {0..7}; do
    python op_log.py -xpu3 xpu_logs/rank_${i}.log > analysis_rank${i}.log
    echo "Rank ${i} 分析完成"
done
for i in {0..7}; do
    echo "=== Rank $i ===" 
    head -n 6 analysis_rank${i}.log | tail -n 5
done
```
::::
:::::
##### 📈 Output Example (analysis_rank0.log)
```
Filename: xpu_logs/rank_0.log
-xpu option: 2
--level option: -1
Total time(ms) is 53742.29571862069
Op type                                                                                   count     time(ms)            %                   
void xblas_xpu3::fc_cdnn_infer<float16, float16, float16, float16, float, float, float, float, 1>                                                     661569    22736.262780689656       42.306              
void kl3_all_reduce<float16>                                                                                                                          176134    14782.525712413793       27.506              
void kl3_all_reduce_butterfly<float16>                                                                                                                164864    4197.28395862069         7.81           
```
##### 🚨 Troubleshooting Guide
|Symptom|Cause|Solution|
|-|-|-|
|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set|
All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified|
|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production|