提交vllm0.11.0开发分支
This commit is contained in:
@@ -14,4 +14,4 @@ vllm-kunlun uses the following environment variables to configure the system:
|
||||
| `export XMLIR_FORCE_USE_XPU_GRAPH` | `1` | ***\*Forces the enablement of XPU Graph mode.\****. This can capture and optimize the model execution graph, significantly boosting inference performance. |
|
||||
| `export VLLM_HOST_IP` | `$(hostname -i)` | ***\*Sets the host IP address for the vLLM service\****. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment. |
|
||||
| `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false` | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
|
||||
| `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
|
||||
| `USE_ORI_ROPE` | `1` | ***\*Control whether to use the original RoPE (Rotate Position Encoding) implementation\****. Default is `1` (use original/standard RoPE). Setting to `0` may be used to enable QWEN3 (possibly the specific quantization or optimization technique of KunlunCore), but this requires specific model support. |
|
||||
@@ -42,7 +42,7 @@ Online example:
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen3-8B-Instruct\
|
||||
--model /models/Qwen3-8B\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
@@ -52,9 +52,17 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B-Instruct \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention",
|
||||
"vllm.unified_attention_with_output",
|
||||
"vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.mamba_mixer2",
|
||||
"vllm.mamba_mixer",
|
||||
"vllm.short_conv",
|
||||
"vllm.linear_attention",
|
||||
"vllm.plamo2_mamba_mixer",
|
||||
"vllm.gdn_attention",
|
||||
"vllm.sparse_attn_indexer"]}' \
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
@@ -4,30 +4,12 @@
|
||||
|
||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||
| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||
| Qwen2 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen2.5 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
|
||||
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| Qwen3-Coder | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| QwQ-32B | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama2 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama3 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| LLama3.1 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| GLM-4.5 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| GLM-4.5-Air | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| Qwen3-next | 🔜Comming soon | | | | | | |
|
||||
| gpt-oss | 🔜Comming soon | | | | | | |
|
||||
| DeepSeek-V3 | 🔜Comming soon | | | | | | |
|
||||
| DeepSeek-V3.2 | 🔜Comming soon | | | | | | |
|
||||
| Qwen3-Next | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
|
||||
|
||||
## Multimodal Language Models
|
||||
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
|
||||
| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|
||||
|Qianfan-VL | ✅ | | | ✅| |✅ |✅|
|
||||
| Qwen2.5VL | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL2.5 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL3 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternVL3.5 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| InternS1 | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
| Qwen2.5-Omni | 🔜Comming soon | | | | | | |
|
||||
| Qwen3-VL | 🔜Comming soon | | | | | | |
|
||||
| Qwen3-VL | ✅ | | | ✅ | | ✅ | ✅ |
|
||||
|
||||
Reference in New Issue
Block a user