提交vllm0.11.0开发分支

2025-12-10 17:51:24 +08:00
parent deab7dd0b6
commit 7c22d621fb
175 changed files with 31856 additions and 8683 deletions
--- a/docs/source/user_guide/configuration/env_vars.md
+++ b/docs/source/user_guide/configuration/env_vars.md
@@ -14,4 +14,4 @@ vllm-kunlun uses the following environment variables to configure the system:
 | `export XMLIR_FORCE_USE_XPU_GRAPH`       | `1`               | ***\*Forces the enablement of XPU Graph mode.\****. This can capture and optimize the model execution graph, significantly boosting inference performance. |
 | `export VLLM_HOST_IP`                    | `$(hostname -i)`  | ***\*Sets the host IP address for the vLLM service\****. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment. |
 | `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false`           | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
-| `FUSED_QK_ROPE_OP`                           | `0`               | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
+| `USE_ORI_ROPE`                           | `1`               | ***\*Control whether to use the original RoPE (Rotate Position Encoding) implementation\****. Default is `1` (use original/standard RoPE). Setting to `0` may be used to enable QWEN3 (possibly the specific quantization or optimization technique of KunlunCore), but this requires specific model support. |
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -42,7 +42,7 @@ Online example:
 python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
      --port 8000 \
-      --model /models/Qwen3-8B-Instruct\
+      --model /models/Qwen3-8B\
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
@@ -52,9 +52,17 @@ python -m vllm.entrypoints.openai.api_server \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name Qwen3-8B-Instruct \
-      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
-            "vllm.unified_attention", "vllm.unified_attention_with_output",
-            "vllm.mamba_mixer2"]}' \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention", 
+                                                "vllm.unified_attention_with_output",
+                                                "vllm.unified_attention_with_output_kunlun",
+                                                "vllm.mamba_mixer2", 
+                                                "vllm.mamba_mixer", 
+                                                "vllm.short_conv", 
+                                                "vllm.linear_attention", 
+                                                "vllm.plamo2_mamba_mixer", 
+                                                "vllm.gdn_attention", 
+                                                "vllm.sparse_attn_indexer"]}' \
+
 ```


--- a/docs/source/user_guide/support_matrix/supported_models.md
+++ b/docs/source/user_guide/support_matrix/supported_models.md
@@ -4,30 +4,12 @@

 | Model         | Support       | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
 | :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
-| Qwen2         | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
-| Qwen2.5       | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
 | Qwen3         | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
 | Qwen3-Moe     | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
-| Qwen3-Coder   | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
-| QwQ-32B       | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| LLama2        | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| LLama3        | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| LLama3.1      | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| GLM-4.5       | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
-| GLM-4.5-Air   | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
-| Qwen3-next    | 🔜Comming soon |      |      |                 |                 |               |                        |
-| gpt-oss       | 🔜Comming soon |      |      |                 |                 |               |                        |
-| DeepSeek-V3   | 🔜Comming soon |      |      |                 |                 |               |                        |
-| DeepSeek-V3.2 | 🔜Comming soon |      |      |                 |                 |               |                        |
+| Qwen3-Next     | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
+

 ## Multimodal Language Models
 | Model        | Support       | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
 | :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
-|Qianfan-VL    | ✅     |       |      |       ✅|               |✅               |✅|
-| Qwen2.5VL    | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| InternVL2.5  | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| InternVL3    | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| InternVL3.5  | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| InternS1     | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
-| Qwen2.5-Omni | 🔜Comming soon |      |      |                 |                 |               |                        |
-| Qwen3-VL     | 🔜Comming soon |      |      |                 |                 |               |                        |
+| Qwen3-VL    | ✅             |      |      | ✅               |                 | ✅             | ✅                      |