[Doc] Optimize the document (#136)
This commit is contained in:
@@ -1,6 +1,7 @@
|
|||||||
# Contributing
|
# Contributing
|
||||||
|
|
||||||
## Building and Testing
|
## Building and Testing
|
||||||
|
|
||||||
It's recommended to set up a local development environment to build vllm-kunlun and run tests
|
It's recommended to set up a local development environment to build vllm-kunlun and run tests
|
||||||
before you submit a PR.
|
before you submit a PR.
|
||||||
|
|
||||||
@@ -21,10 +22,18 @@ python -m vllm.entrypoints.openai.api_server \
|
|||||||
--no-enable-chunked-prefill \
|
--no-enable-chunked-prefill \
|
||||||
--distributed-executor-backend mp \
|
--distributed-executor-backend mp \
|
||||||
--served-model-name your_modified_models \
|
--served-model-name your_modified_models \
|
||||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
"vllm.unified_attention_with_output",
|
||||||
"vllm.mamba_mixer2"]}' \
|
"vllm.unified_attention_with_output_kunlun",
|
||||||
|
"vllm.mamba_mixer2",
|
||||||
|
"vllm.mamba_mixer",
|
||||||
|
"vllm.short_conv",
|
||||||
|
"vllm.linear_attention",
|
||||||
|
"vllm.plamo2_mamba_mixer",
|
||||||
|
"vllm.gdn_attention",
|
||||||
|
"vllm.sparse_attn_indexer"]}' \
|
||||||
```
|
```
|
||||||
|
|
||||||
Please save a screenshot of your service running successfully, and attach an accuracy report.
|
Please save a screenshot of your service running successfully, and attach an accuracy report.
|
||||||
|
|
||||||
### Submit the commit
|
### Submit the commit
|
||||||
@@ -36,7 +45,6 @@ git commit -sm "your commit info"
|
|||||||
|
|
||||||
🎉 Congratulations! You have completed the development environment setup.
|
🎉 Congratulations! You have completed the development environment setup.
|
||||||
|
|
||||||
|
|
||||||
## PR Title and Classification
|
## PR Title and Classification
|
||||||
|
|
||||||
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
|
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
|
||||||
@@ -46,7 +54,7 @@ Only specific types of PRs will be reviewed. The PR title is prefixed appropriat
|
|||||||
- `[ModelRunner]` for new features or optimization in model runner.
|
- `[ModelRunner]` for new features or optimization in model runner.
|
||||||
- `[Platform]` for new features or optimization in platform.
|
- `[Platform]` for new features or optimization in platform.
|
||||||
- `[Worker]` for new features or optimization in worker.
|
- `[Worker]` for new features or optimization in worker.
|
||||||
- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
|
- `[Core]` for new features or optimization in the core vllm-kunlun logic (such as platform, attention, communicators, model runner)
|
||||||
- `[Kernel]` for changes affecting compute kernels and ops.
|
- `[Kernel]` for changes affecting compute kernels and ops.
|
||||||
- `[Bugfix]` for bug fixes.
|
- `[Bugfix]` for bug fixes.
|
||||||
- `[Doc]` for documentation fixes and improvements.
|
- `[Doc]` for documentation fixes and improvements.
|
||||||
@@ -60,7 +68,7 @@ If the PR spans more than one category, please include all relevant prefixes.
|
|||||||
|
|
||||||
## Others
|
## Others
|
||||||
|
|
||||||
If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers.
|
If you find any problem when contributing, you can join our slack group to talk with us and then feel free to submit a PR to improve the doc to help other developers.
|
||||||
|
|
||||||
:::{toctree}
|
:::{toctree}
|
||||||
:caption: Index
|
:caption: Index
|
||||||
|
|||||||
@@ -1,18 +1,23 @@
|
|||||||
# Profiling
|
# Profiling
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 🔧 Action Plan(Three Phases)
|
## 🔧 Action Plan(Three Phases)
|
||||||
|
|
||||||
### Phase 1️⃣: Multi-Device Log Redirection Configuration
|
### Phase 1️⃣: Multi-Device Log Redirection Configuration
|
||||||
|
|
||||||
#### Background
|
#### Background
|
||||||
|
|
||||||
By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
|
By default, kernel logs from all 8 XPU devices are interleaved and emitted to [stdout], resulting in:
|
||||||
|
|
||||||
- It becomes impossible to distinguish which log originates from which device.
|
- It becomes impossible to distinguish which log originates from which device.
|
||||||
- Timestamps become interleaved, making it difficult to analyze the temporal relationships.
|
- Timestamps become interleaved, making it difficult to analyze the temporal relationships.
|
||||||
- Single-device bottlenecks are masked by global aggregation.
|
- Single-device bottlenecks are masked by global aggregation.
|
||||||
|
|
||||||
#### Solution
|
#### Solution
|
||||||
|
|
||||||
During model initialization, create separate log files for each device.
|
During model initialization, create separate log files for each device.
|
||||||
|
|
||||||
#### Code Explanation (embedded in qwen2.py)
|
#### Code Explanation (embedded in qwen2.py)
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import os # ← Ensure this is imported at the top of the file
|
import os # ← Ensure this is imported at the top of the file
|
||||||
from vllm.distributed import get_tensor_model_parallel_rank # ← Import function to get the tensor model parallel rank
|
from vllm.distributed import get_tensor_model_parallel_rank # ← Import function to get the tensor model parallel rank
|
||||||
@@ -30,38 +35,42 @@ class Qwen2Model(nn.Module):
|
|||||||
try:
|
try:
|
||||||
# Step 1: Get the current XPU device's rank (0~7)
|
# Step 1: Get the current XPU device's rank (0~7)
|
||||||
rank = get_tensor_model_parallel_rank()
|
rank = get_tensor_model_parallel_rank()
|
||||||
|
|
||||||
# Step 2: Create log directory (works with your get_kernel_time_ex.py)
|
# Step 2: Create log directory (works with your get_kernel_time_ex.py)
|
||||||
log_dir = "./xpu_logs"
|
log_dir = "./xpu_logs"
|
||||||
os.makedirs(log_dir, exist_ok=True)
|
os.makedirs(log_dir, exist_ok=True)
|
||||||
|
|
||||||
# Step 3: Generate a separate log file for each device
|
# Step 3: Generate a separate log file for each device
|
||||||
log_file = os.path.join(log_dir, f"rank_{rank}.log")
|
log_file = os.path.join(log_dir, f"rank_{rank}.log")
|
||||||
|
|
||||||
# Step 4: Core operation – redirect file descriptors
|
# Step 4: Core operation – redirect file descriptors
|
||||||
# os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
|
# os.O_TRUNC: Clear previous logs on each run to avoid mixing outputs
|
||||||
fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
|
fd = os.open(log_file, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o664)
|
||||||
os.dup2(fd, 1) # Redirect stdout → rank_X.log
|
os.dup2(fd, 1) # Redirect stdout → rank_X.log
|
||||||
os.dup2(fd, 2) # Redirect stderr → rank_X.log
|
os.dup2(fd, 2) # Redirect stderr → rank_X.log
|
||||||
os.close(fd) # Close original file descriptor; redirection persists
|
os.close(fd) # Close original file descriptor; redirection persists
|
||||||
|
|
||||||
# Optional: print a confirmation message (will go into rank_X.log)
|
# Optional: print a confirmation message (will go into rank_X.log)
|
||||||
print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
|
print(f"[Qwen2Model Init] Rank {rank} log redirected to {log_file}")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
# Fallback mechanism: failure to redirect logs does not affect model loading
|
# Fallback mechanism: failure to redirect logs does not affect model loading
|
||||||
print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
|
print(f"[WARNING] Failed to redirect log for rank: {e}", flush=True)
|
||||||
# ========== End of log redirection code ==========
|
# ========== End of log redirection code ==========
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### ⚠️ Common Issues
|
#### ⚠️ Common Issues
|
||||||
|
|
||||||
**Q1**:Why not use Python's `logging` module?
|
**Q1**:Why not use Python's `logging` module?
|
||||||
**A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required.
|
**A**:The XPU runtime kernel logs are emitted from the C++ layer and cannot be captured by Python’s `logging` module. Redirection via low-level file descriptors is required.
|
||||||
**Q1**:Will logs be lost if the model fails to load??
|
**Q1**:Will logs be lost if the model fails to load??
|
||||||
**A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
|
**A**:The `try-except` block ensures that if log redirection fails, it falls back to the default behavior without affecting model startup.
|
||||||
|
|
||||||
### Phase 2️⃣: Profiling Environment Activation
|
### Phase 2️⃣: Profiling Environment Activation
|
||||||
|
|
||||||
#### 🚀 vLLM Launch
|
#### 🚀 vLLM Launch
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
unset XPU_DUMMY_EVENT
|
unset XPU_DUMMY_EVENT
|
||||||
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
@@ -96,14 +105,21 @@ USE_ORI_ROPE=1 VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
|
|||||||
--no-enable-chunked-prefill \
|
--no-enable-chunked-prefill \
|
||||||
--distributed-executor-backend mp \
|
--distributed-executor-backend mp \
|
||||||
--served-model-name Qwen2.5-72B-Instruct \
|
--served-model-name Qwen2.5-72B-Instruct \
|
||||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
"vllm.unified_attention_with_output",
|
||||||
"vllm.mamba_mixer2"]}' 2>&1 | tee output_p800.log
|
"vllm.unified_attention_with_output_kunlun",
|
||||||
|
"vllm.mamba_mixer2",
|
||||||
|
"vllm.mamba_mixer",
|
||||||
|
"vllm.short_conv",
|
||||||
|
"vllm.linear_attention",
|
||||||
|
"vllm.plamo2_mamba_mixer",
|
||||||
|
"vllm.gdn_attention",
|
||||||
|
"vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
#### 🚀 Client Load Testing
|
#### 🚀 Client Load Testing
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
@@ -177,6 +193,7 @@ echo "=========================================================="
|
|||||||
```
|
```
|
||||||
|
|
||||||
### Phase 3️⃣: Log Analysis and Bottleneck Identification
|
### Phase 3️⃣: Log Analysis and Bottleneck Identification
|
||||||
|
|
||||||
```text
|
```text
|
||||||
xpu_logs/
|
xpu_logs/
|
||||||
├─ rank_0.log
|
├─ rank_0.log
|
||||||
@@ -189,16 +206,19 @@ xpu_logs/
|
|||||||
└─ rank_7.log
|
└─ rank_7.log
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 🔍 Script Workflow (op_log.py)
|
#### 🔍 Script Workflow (op_log.py)
|
||||||
|
|
||||||
**Input**:Raw Kernel Logs (Sample Format)
|
**Input**:Raw Kernel Logs (Sample Format)
|
||||||
|
|
||||||
```
|
```
|
||||||
[XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
|
[XPURT_PROF] void xblas_xpu3::fc_cdnn_infer<float16,...> 123456 ns
|
||||||
[XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
|
[XPURT_PROF] void kl3_all_reduce<float16> 987654 ns
|
||||||
```
|
```
|
||||||
|
|
||||||
**Processing logic**
|
**Processing logic**
|
||||||
:::::{tab-set}
|
:::::{tab-set}
|
||||||
::::{tab-item} op_log.py
|
::::{tab-item} op_log.py
|
||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
"""
|
"""
|
||||||
@@ -382,8 +402,6 @@ if __name__ == '__main__':
|
|||||||
|
|
||||||
::::{tab-item} op_log.sh
|
::::{tab-item} op_log.sh
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
||||||
for i in {0..7}; do
|
for i in {0..7}; do
|
||||||
@@ -397,9 +415,12 @@ for i in {0..7}; do
|
|||||||
head -n 6 analysis_rank${i}.log | tail -n 5
|
head -n 6 analysis_rank${i}.log | tail -n 5
|
||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
::::
|
::::
|
||||||
:::::
|
:::::
|
||||||
|
|
||||||
#### 📈 Output Example (analysis_rank0.log)
|
#### 📈 Output Example (analysis_rank0.log)
|
||||||
|
|
||||||
```
|
```
|
||||||
Filename: xpu_logs/rank_0.log
|
Filename: xpu_logs/rank_0.log
|
||||||
-xpu option: 2
|
-xpu option: 2
|
||||||
@@ -410,9 +431,11 @@ void xblas_xpu3::fc_cdnn_infer<float16, float16, float16, float16, float,
|
|||||||
void kl3_all_reduce<float16> 176134 14782.525712413793 27.506
|
void kl3_all_reduce<float16> 176134 14782.525712413793 27.506
|
||||||
void kl3_all_reduce_butterfly<float16> 164864 4197.28395862069 7.81
|
void kl3_all_reduce_butterfly<float16> 164864 4197.28395862069 7.81
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 🚨 Troubleshooting Guide
|
#### 🚨 Troubleshooting Guide
|
||||||
|Symptom|Cause|Solution|
|
|
||||||
|-|-|-|
|
| Symptom | Cause | Solution |
|
||||||
|`xpu_logs` directory is empty|XPUAPI_DEBUG not enabled|Verify that the environment variable is correctly set|
|
| -------------------------------------- | ----------------------------------- | ------------------------------------------------------------ |
|
||||||
All 8 log files have identical content|Multi-process backend not activated|Ensure `--distributed-executor-backend` mp is specified|
|
| `xpu_logs` directory is empty | XPUAPI_DEBUG not enabled | Verify that the environment variable is correctly set |
|
||||||
|Throughput drops >15%|Profiling overhead too high|Enable profiling only during analysis; disable in production|
|
| All 8 log files have identical content | Multi-process backend not activated | Ensure `--distributed-executor-backend` mp is specified |
|
||||||
|
| Throughput drops >15% | Profiling overhead too high | Enable profiling only during analysis; disable in production |
|
||||||
|
|||||||
@@ -32,6 +32,9 @@ if [ $XPU_NUM -gt 0 ]; then
|
|||||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||||
fi
|
fi
|
||||||
export build_image="wjie520/vllm_kunlun:v0.0.1"
|
export build_image="wjie520/vllm_kunlun:v0.0.1"
|
||||||
|
# or export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32"
|
||||||
|
|
||||||
|
|
||||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||||
--net=host \
|
--net=host \
|
||||||
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||||
@@ -49,6 +52,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
|||||||
### Install vLLM 0.11.0
|
### Install vLLM 0.11.0
|
||||||
```
|
```
|
||||||
conda activate vllm_kunlun_0.10.1.1
|
conda activate vllm_kunlun_0.10.1.1
|
||||||
|
# or conda activate python310_torch25_cuda
|
||||||
|
|
||||||
pip install vllm==0.11.0 --no-build-isolation --no-deps
|
pip install vllm==0.11.0 --no-build-isolation --no-deps
|
||||||
```
|
```
|
||||||
@@ -79,6 +83,7 @@ bash xpytorch-cp310-torch251-ubuntu2004-x64.run
|
|||||||
## Install the KL3-customized build of PyTorch(Only MIMO V2)
|
## Install the KL3-customized build of PyTorch(Only MIMO V2)
|
||||||
```
|
```
|
||||||
wget -O xpytorch-cp310-torch251-ubuntu2004-x64.run https://klx-sdk-release-public.su.bcebos.com/kunlun2aiak_output/1231/xpytorch-cp310-torch251-ubuntu2004-x64.run
|
wget -O xpytorch-cp310-torch251-ubuntu2004-x64.run https://klx-sdk-release-public.su.bcebos.com/kunlun2aiak_output/1231/xpytorch-cp310-torch251-ubuntu2004-x64.run
|
||||||
|
bash xpytorch-cp310-torch251-ubuntu2004-x64.run
|
||||||
```
|
```
|
||||||
|
|
||||||
## Install custom ops
|
## Install custom ops
|
||||||
|
|||||||
@@ -113,7 +113,16 @@ python -m vllm.entrypoints.openai.api_server \
|
|||||||
--no-enable-chunked-prefill \
|
--no-enable-chunked-prefill \
|
||||||
--distributed-executor-backend mp \
|
--distributed-executor-backend mp \
|
||||||
--served-model-name GLM-4.5 \
|
--served-model-name GLM-4.5 \
|
||||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", "vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.mamba_mixer2"]}' > log_glm_plugin.txt 2>&1 &
|
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||||
|
"vllm.unified_attention_with_output",
|
||||||
|
"vllm.unified_attention_with_output_kunlun",
|
||||||
|
"vllm.mamba_mixer2",
|
||||||
|
"vllm.mamba_mixer",
|
||||||
|
"vllm.short_conv",
|
||||||
|
"vllm.linear_attention",
|
||||||
|
"vllm.plamo2_mamba_mixer",
|
||||||
|
"vllm.gdn_attention",
|
||||||
|
"vllm.sparse_attn_indexer"]}' > log_glm_plugin.txt 2>&1 &
|
||||||
```
|
```
|
||||||
|
|
||||||
If your service start successfully, you can see the info shown below:
|
If your service start successfully, you can see the info shown below:
|
||||||
|
|||||||
@@ -16,7 +16,8 @@ if [ $XPU_NUM -gt 0 ]; then
|
|||||||
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
export build_image="xxxxxxxxxxxxxxxxx"
|
export build_image="iregistry.baidu-int.com/xmlir/xmlir_ubuntu_2004_x86_64:v0.32"
|
||||||
|
|
||||||
|
|
||||||
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
||||||
--net=host \
|
--net=host \
|
||||||
@@ -32,14 +33,12 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
|||||||
|
|
||||||
### Preparation Weight
|
### Preparation Weight
|
||||||
|
|
||||||
* Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
|
- Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
|
||||||
* Modify the weights configuration.json file and add the fields quantization_config and compression_config.
|
- Modify the weights configuration.json file and add the fields quantization_config and compression_config.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"architectures": [
|
"architectures": ["Qwen3MoeForCausalLM"],
|
||||||
"Qwen3MoeForCausalLM"
|
|
||||||
],
|
|
||||||
"attention_dropout": 0.0,
|
"attention_dropout": 0.0,
|
||||||
"decoder_sparse_step": 1,
|
"decoder_sparse_step": 1,
|
||||||
"eos_token_id": 151645,
|
"eos_token_id": 151645,
|
||||||
@@ -61,7 +60,7 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
|||||||
"num_key_value_heads": 8,
|
"num_key_value_heads": 8,
|
||||||
"output_router_logits": false,
|
"output_router_logits": false,
|
||||||
"qkv_bias": false,
|
"qkv_bias": false,
|
||||||
"rms_norm_eps": 1e-06,
|
"rms_norm_eps": 1e-6,
|
||||||
"rope_scaling": null,
|
"rope_scaling": null,
|
||||||
"rope_theta": 10000000,
|
"rope_theta": 10000000,
|
||||||
"router_aux_loss_coef": 0.0,
|
"router_aux_loss_coef": 0.0,
|
||||||
@@ -104,7 +103,6 @@ docker run -itd ${DOCKER_DEVICE_CONFIG} \
|
|||||||
"sparsity_config": null
|
"sparsity_config": null
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Online Serving on Multi XPU
|
### Online Serving on Multi XPU
|
||||||
@@ -128,5 +126,15 @@ python3 -m vllm.entrypoints.openai.api_server \
|
|||||||
--enable-chunked-prefill=False \
|
--enable-chunked-prefill=False \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--disable-log-requests \
|
--disable-log-requests \
|
||||||
--gpu-memory-utilization 0.85
|
--gpu-memory-utilization 0.9 \
|
||||||
```
|
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||||
|
"vllm.unified_attention_with_output",
|
||||||
|
"vllm.unified_attention_with_output_kunlun",
|
||||||
|
"vllm.mamba_mixer2",
|
||||||
|
"vllm.mamba_mixer",
|
||||||
|
"vllm.short_conv",
|
||||||
|
"vllm.linear_attention",
|
||||||
|
"vllm.plamo2_mamba_mixer",
|
||||||
|
"vllm.gdn_attention",
|
||||||
|
"vllm.sparse_attn_indexer"]}' 2>&1 | tee output_p800.log
|
||||||
|
```
|
||||||
|
|||||||
@@ -128,9 +128,16 @@ python -m vllm.entrypoints.openai.api_server \
|
|||||||
--no-enable-chunked-prefill \
|
--no-enable-chunked-prefill \
|
||||||
--distributed-executor-backend mp \
|
--distributed-executor-backend mp \
|
||||||
--served-model-name Qwen3-8B \
|
--served-model-name Qwen3-8B \
|
||||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
"vllm.unified_attention_with_output",
|
||||||
"vllm.mamba_mixer2"]}' \
|
"vllm.unified_attention_with_output_kunlun",
|
||||||
|
"vllm.mamba_mixer2",
|
||||||
|
"vllm.mamba_mixer",
|
||||||
|
"vllm.short_conv",
|
||||||
|
"vllm.linear_attention",
|
||||||
|
"vllm.plamo2_mamba_mixer",
|
||||||
|
"vllm.gdn_attention",
|
||||||
|
"vllm.sparse_attn_indexer"]}' \
|
||||||
```
|
```
|
||||||
|
|
||||||
If your service start successfully, you can see the info shown below:
|
If your service start successfully, you can see the info shown below:
|
||||||
|
|||||||
Reference in New Issue
Block a user