Initial commit for vLLM-Kunlun Plugin

2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@@ -0,0 +1,9 @@
+# Tutorials
+
+:::{toctree}
+:caption: Deployment
+:maxdepth: 1
+single_xpu_Qwen3-8B
+multi_xpu_GLM-4.5
+multi_xpu_Qwen3-Coder-480B-A35B(W8A8)
+:::
--- a/docs/source/tutorials/multi_xpu_GLM-4.5.md
+++ b/docs/source/tutorials/multi_xpu_GLM-4.5.md
@@ -0,0 +1,153 @@
+# Multi XPU (GLM-4.5)
+
+## Run vllm-kunlun on multi XPU
+
+Setup environment using container:
+
+```bash
+docker run -itd \
+        --net=host \
+        --cap-add=SYS_PTRACE --security-opt=seccomp=unconfined \
+        --ulimit=memlock=-1 --ulimit=nofile=120000 --ulimit=stack=67108864 \
+        --shm-size=128G \
+        --privileged \
+        --name=glm-vllm-01011 \
+        -v ${PWD}:/data \
+        -w /workspace \
+        -v /usr/local/bin/:/usr/local/bin/ \
+        -v /lib/x86_64-linux-gnu/libxpunvidia-ml.so.1:/lib/x86_64-linux-gnu/libxpunvidia-ml.so.1 \
+        iregistry.baidu-int.com/hac_test/aiak-inference-llm:xpu_dev_20251113_221821 bash
+
+docker exec -it glm-vllm-01011 /bin/bash
+```
+
+### Offline Inference on multi XPU
+
+Start the server in a container:
+
+```{code-block} bash
+   :substitutions:
+import os
+from vllm import LLM, SamplingParams
+
+def main():
+
+    model_path = "/data/GLM-4.5"
+
+    llm_params = {
+        "model": model_path,
+        "tensor_parallel_size": 8,
+        "trust_remote_code": True,
+        "dtype": "float16",
+        "enable_chunked_prefill": False,
+        "distributed_executor_backend": "mp",
+    }
+
+    llm = LLM(**llm_params)
+
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Hello, who are you?"
+                }
+            ]
+        }
+    ]
+
+    sampling_params = SamplingParams(
+        max_tokens=100,
+        temperature=0.7,
+        top_k=50,
+        top_p=1.0,
+        stop_token_ids=[181896]
+    )
+
+    outputs = llm.chat(messages, sampling_params=sampling_params)
+
+    response = outputs[0].outputs[0].text
+    print("=" * 50)
+    print("Input content:", messages)
+    print("Model response:\n", response)
+    print("=" * 50)
+
+if __name__ == "__main__":
+    main()
+
+```
+
+:::::
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+==================================================
+Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'Hello, who are you?'}]}]
+Model response:
+ <think>
+Well, the user asked a rather direct question about identity. This question seems simple, but there could be several underlying intentions—perhaps they are testing my reliability for the first time, or they simply want to confirm the identity of the conversational partner. From the common positioning of AI assistants, the user has provided a clear and flat way to define identity while leaving room for potential follow-up questions.\n\nThe user used "you" instead of "your", which leans towards a more informal tone, so the response style can be a bit more relaxed. However, since this is the initial response, it is better to maintain a moderate level of professionalism. Mentioning
+==================================================
+```
+
+### Online Serving on Single XPU
+
+Start the vLLM server on a single XPU:
+
+```{code-block} bash
+python -m vllm.entrypoints.openai.api_server \
+      --host localhost \
+      --port 8989 \
+      --model /data/GLM-4.5 \
+      --gpu-memory-utilization 0.95 \
+      --trust-remote-code \
+      --max-model-len 131072 \
+      --tensor-parallel-size 8 \
+      --dtype float16 \
+      --max_num_seqs 128 \
+      --max_num_batched_tokens 4096 \
+      --max-seq-len-to-capture 4096 \
+      --block-size 128 \
+      --no-enable-prefix-caching \
+      --no-enable-chunked-prefill \
+      --distributed-executor-backend mp \
+      --served-model-name GLM-4.5 \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun", "vllm.unified_attention", "vllm.unified_attention_with_output", "vllm.mamba_mixer2"]}'  > log_glm_plugin.txt 2>&1 &
+```
+
+If your service start successfully, you can see the info shown below:
+
+```bash
+(APIServer pid=51171) INFO:     Started server process [51171]
+(APIServer pid=51171) INFO:     Waiting for application startup.
+(APIServer pid=51171) INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts:
+
+```bash
+curl http://localhost:8989/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "GLM-4.5",
+    "messages": [
+      {"role": "user", "content": "Hello, who are you?"}
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7
+  }'
+```
+
+If you query the server successfully, you can see the info shown below (client):
+
+```bash
+{"id":"chatcmpl-6af7318de7394bc4ae569e6324a162fa","object":"chat.completion","created":1763101638,"model":"GLM-4.5","choices":[{"index":0,"message":{"role":"assistant","content":"\n<think>The user asked, \"Hello, who are you?\" This is a question about my identity. First, I need to confirm the user's intent. They might be using this service for the first time or have never interacted with similar AI assistants before, so they want to know my background and capabilities.\n\nNext, I should ensure my answer is clear and friendly, focusing on key points: who I am, who developed me, and what I can do. I should avoid technical jargon and keep the response conversational so it's easy to understand.\n\nAdditionally, the user may have potential needs, such as wanting to know what I am capable of.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_tr
+```
+
+Logs of the vllm server:
+
+```bash
+(APIServer pid=54567) INFO:     127.0.0.1:60338 - "POST /v1/completions HTTP/1.1" 200 OK
+(APIServer pid=54567) INFO 11-13 14:35:48 [loggers.py:123] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
+```
--- a/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
+++ b/docs/source/tutorials/multi_xpu_Qwen3-Coder-480B-A35B(W8A8).md
@@ -0,0 +1,132 @@
+# Multi XPU (Qwen3-Coder-480B-A35B(W8A8))
+
+## Run vllm-kunlun on Multi XPU
+
+Setup environment using container:
+
+```bash
+# !/bin/bash
+# rundocker.sh
+XPU_NUM=8
+DOCKER_DEVICE_CONFIG=""
+if [ $XPU_NUM -gt 0 ]; then
+    for idx in $(seq 0 $((XPU_NUM-1))); do
+        DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
+    done
+    DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
+fi
+
+export build_image="xxxxxxxxxxxxxxxxx" 
+
+docker run -itd ${DOCKER_DEVICE_CONFIG} \
+    --net=host \
+    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+    --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
+    --cap-add=SYS_PTRACE \
+    -v /home/users/vllm-kunlun:/home/vllm-kunlun \
+    -v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
+    --name "$1" \
+    -w /workspace \
+    "$build_image" /bin/bash
+```
+
+### Preparation Weight
+
+* Pull Qwen3-Coder-480B-A35B-Instruct bf16 weights
+* Modify the weights configuration.json file and add the fields quantization_config and compression_config.
+
+```json
+{
+  "architectures": [
+    "Qwen3MoeForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "decoder_sparse_step": 1,
+  "eos_token_id": 151645,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 6144,
+  "initializer_range": 0.02,
+  "intermediate_size": 8192,
+  "max_position_embeddings": 262144,
+  "max_window_layers": 62,
+  "mlp_only_layers": [],
+  "model_type": "qwen3_moe",
+  "moe_intermediate_size": 2560,
+  "norm_topk_prob": true,
+  "num_attention_heads": 96,
+  "num_experts": 160,
+  "num_experts_per_tok": 8,
+  "num_hidden_layers": 62,
+  "num_key_value_heads": 8,
+  "output_router_logits": false,
+  "qkv_bias": false,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 10000000,
+  "router_aux_loss_coef": 0.0,
+  "shared_expert_intermediate_size": 0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.0",
+  "use_cache": true,
+  "use_qk_norm": true,
+  "use_sliding_window": false,
+  "vocab_size": 151936,
+  "quantization_config": {
+    "quant_method": "compressed-tensors"
+  },
+  "compression_config": {
+    "format": "pack_quantized",
+    "config_groups": {
+      "linear_w8a8": {
+        "targets": ["Linear"],
+        "weights": {
+          "type": "int",
+          "num_bits": 8,
+          "strategy": "channel",
+          "group_size": null,
+          "symmetric": true,
+          "dynamic": false
+        },
+        "input_activations": {
+          "type": "int",
+          "num_bits": 8,
+          "strategy": "token",
+          "group_size": null,
+          "symmetric": true,
+          "dynamic": true
+        }
+      }
+    },
+    "ignore": [],
+    "sparsity_config": null
+  }
+}
+
+```
+
+### Online Serving on Multi XPU
+
+Start the vLLM server on multi XPU:
+
+```bash
+python3 -m vllm.entrypoints.openai.api_server \
+ --host 0.0.0.0 \
+ --port 8898 \
+ --model /Qwen/Qwen3-Coder-480B-A35B-Instruct \
+ --dtype float16 \
+ --trust-remote-code \
+ --tensor-parallel-size 8 \
+ --block-size 128 \
+ --max-model-len 40960 \
+ --max-num-seqs 512 \
+ --max-num-batched-tokens 40960 \
+ --max-seq-len-to-capture 40960 \
+ --distributed-executor-backend mp \
+ --enable-chunked-prefill=False \
+ --no-enable-prefix-caching \
+ --disable-log-requests \
+ --gpu-memory-utilization 0.85
+```
--- a/docs/source/tutorials/single_xpu_Qwen3-8B.md
+++ b/docs/source/tutorials/single_xpu_Qwen3-8B.md
@@ -0,0 +1,168 @@
+# Single XPU (Qwen3-8B)
+
+## Run vllm-kunlun on Single XPU
+
+Setup environment using container:
+
+```bash
+# !/bin/bash
+# rundocker.sh
+XPU_NUM=8
+DOCKER_DEVICE_CONFIG=""
+if [ $XPU_NUM -gt 0 ]; then
+    for idx in $(seq 0 $((XPU_NUM-1))); do
+        DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpu${idx}:/dev/xpu${idx}"
+    done
+    DOCKER_DEVICE_CONFIG="${DOCKER_DEVICE_CONFIG} --device=/dev/xpuctrl:/dev/xpuctrl"
+fi
+
+export build_image="xxxxxxxxxxxxxxxxx"
+
+docker run -itd ${DOCKER_DEVICE_CONFIG} \
+    --net=host \
+    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+    --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=32g \
+    --cap-add=SYS_PTRACE \
+    -v /home/users/vllm-kunlun:/home/vllm-kunlun \
+    -v /usr/local/bin/xpu-smi:/usr/local/bin/xpu-smi \
+    --name "$1" \
+    -w /workspace \
+    "$build_image" /bin/bash
+```
+
+### Offline Inference on Single XPU
+
+Start the server in a container:
+
+```{code-block} bash
+from vllm import LLM, SamplingParams
+
+def main():
+
+    model_path = "/models/Qwen3-8B"
+
+    llm_params = {
+        "model": model_path,
+        "tensor_parallel_size": 1,
+        "trust_remote_code": True,
+        "dtype": "float16",
+        "enable_chunked_prefill": False,
+        "distributed_executor_backend": "mp",
+    }
+
+    llm = LLM(**llm_params)
+
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "tell a joke"
+                }
+            ]
+        }
+    ]
+
+    sampling_params = SamplingParams(
+        max_tokens=200,
+        temperature=1.0,
+        top_k=50,
+        top_p=1.0,
+        stop_token_ids=[181896]
+    )
+
+    outputs = llm.chat(messages, sampling_params=sampling_params)
+
+    response = outputs[0].outputs[0].text
+    print("=" * 50)
+    print("Input content:", messages)
+    print("Model response:\n", response)
+    print("=" * 50)
+
+if __name__ == "__main__":
+    main()
+
+```
+
+:::::
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+==================================================
+Input content: [{'role': 'user', 'content': [{'type': 'text', 'text': 'tell a joke'}]}]
+Model response:
+ <think>
+
+Okay, the user asked me to tell a joke. First, I need to consider the user's needs. They might just want to relax or need some entertainment. Next, I need to choose a suitable joke that is not too complicated, easy to understand, and also interesting.
+
+
+The user might expect the joke to be in Chinese, so I need to ensure that the joke conforms to the language habits and cultural background of Chinese. I need to avoid sensitive topics, such as politics, religion, or anything that might cause misunderstanding. Then, I have to consider the structure of the joke, which usually involves a setup and an unexpected ending to create humor.
+
+For example, I could tell a light-hearted story about everyday life, such as animals or common scenarios. For instance, the story of a turtle and a rabbit racing, but with a twist. However, I need to ensure that the joke is of moderate length and not too long, so the user doesn't lose interest. Additionally, I should pay attention to using colloquial language and avoid stiff or complex sentence structures.
+
+I might also need to check if this joke is common to avoid repetition. If the user has heard something similar before, I may need to come up with a different angle.
+==================================================
+```
+
+### Online Serving on Single XPU
+
+Start the vLLM server on a single XPU:
+
+```{code-block} bash
+python -m vllm.entrypoints.openai.api_server \
+      --host 0.0.0.0 \
+      --port 9000 \
+      --model /models/Qwen3-8B\
+      --gpu-memory-utilization 0.9 \
+      --trust-remote-code \
+      --max-model-len 32768 \
+      --tensor-parallel-size 1 \
+      --dtype float16 \
+      --max_num_seqs 128 \
+      --max_num_batched_tokens 32768 \
+      --max-seq-len-to-capture 32768 \
+      --block-size 128 \
+      --no-enable-prefix-caching \
+      --no-enable-chunked-prefill \
+      --distributed-executor-backend mp \
+      --served-model-name Qwen3-8B \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
+            "vllm.unified_attention", "vllm.unified_attention_with_output",
+            "vllm.mamba_mixer2"]}' \
+```
+
+If your service start successfully, you can see the info shown below:
+
+```bash
+(APIServer pid=118459) INFO:     Started server process [118459]
+(APIServer pid=118459) INFO:     Waiting for application startup.
+(APIServer pid=118459) INFO:     Application startup complete.
+```
+
+Once your server is started, you can query the model with input prompts:
+
+```bash
+curl http://localhost:9000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen3-8B",
+        "prompt": "What is your name?",
+        "max_tokens": 100,
+        "temperature": 0
+    }'
+```
+
+If you query the server successfully, you can see the info shown below (client):
+
+```bash
+{"id":"cmpl-80ee8b893dc64053947b0bea86352faa","object":"text_completion","created":1763015742,"model":"Qwen3-8B","choices":[{"index":0,"text":" is the S, and ,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null},"kv_transfer_params":null}
+```
+
+Logs of the vllm server:
+
+```bash
+(APIServer pid=54567) INFO:     127.0.0.1:60338 - "POST /v1/completions HTTP/1.1" 200 OK
+(APIServer pid=54567) INFO 11-13 14:35:48 [loggers.py:123] Engine 000: Avg prompt throughput: 0.5 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
+```