Init attention backend for Intel XPU (#10656)

Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
2025-10-21 11:41:28 +08:00
parent fb6cc7b000
commit b113c72e7a
18 changed files with 1210 additions and 26 deletions
--- a/docs/advanced_features/attention_backend.md
+++ b/docs/advanced_features/attention_backend.md
@@ -26,6 +26,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
 | **AITER (ROCm)**                | ✅                          | ❌               | ✅              | ✅              | ❌                 | ❌             |
 | **Wave (ROCm)**                 | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
 | **Ascend (NPU)**                | ✅                          | ❌               | ❌              | ❌              | ❌                 | ❌             |
+| **Intel XPU**                   | ✅                          | ❌               | ❌              | ❌              | ✅                 | ❌             |

 ### MLA Backends

@@ -190,6 +191,13 @@ python3 -m sglang.launch_server \
  --attention-backend ascend
 ```

+- Intel XPU
+```bash
+python3 -m sglang.launch_server \
+  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
+  --attention-backend intel_xpu
+```
+
 - Wave
 ```bash
 python3 -m sglang.launch_server \
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -75,6 +75,7 @@ Its core features include:
   platforms/tpu.md
   platforms/nvidia_jetson.md
   platforms/ascend_npu.md
+   platforms/xpu.md

 .. toctree::
   :maxdepth: 1
--- a/docs/platforms/xpu.md
+++ b/docs/platforms/xpu.md
@@ -0,0 +1,92 @@
+# XPU
+
+The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on Intel GPU, [see more context about Intel GPU support within PyTorch ecosystem](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html).
+
+Specifically, SGLang is optimized for [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics.html) and [
+Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics.html).
+
+## Optimized Model List
+
+A list of LLMs have been optimized on Intel GPU, and more are on the way:
+
+| Model Name | BF16 |
+|:---:|:---:|
+| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
+| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
+| Qwen2.5-1.5B |   [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) |
+
+**Note:** The model identifiers listed in the table above
+have been verified on [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html).
+
+## Installation
+
+### Install From Source
+
+Currently SGLang XPU only supports installation from source. Please refer to ["Getting Started on Intel GPU"](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) to install XPU dependency.
+
+```bash
+# Create and activate a conda environment
+conda create -n sgl-xpu python=3.12 -y
+conda activate sgl-xpu
+
+# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
+pip3 install torch==2.8.0+xpu torchao torchvision torchaudio pytorch-triton-xpu==3.4.0 --index-url https://download.pytorch.org/whl/xpu
+pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU
+
+# Clone the SGLang code
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+git checkout <YOUR-DESIRED-VERSION>
+
+# Use dedicated toml file
+cd python
+cp pyproject_xpu.toml pyproject.toml
+# Install SGLang dependent libs, and build SGLang main package
+pip install --upgrade pip setuptools
+pip install -v .
+```
+
+### Install Using Docker
+
+The docker for XPU is under active development. Please stay tuned.
+
+## Launch of the Serving Engine
+
+Example command to launch SGLang serving:
+
+```bash
+python -m sglang.launch_server       \
+    --model <MODEL_ID_OR_PATH>       \
+    --trust-remote-code              \
+    --disable-overlap-schedule       \
+    --device xpu                     \
+    --host 0.0.0.0                   \
+    --tp 2                           \   # using multi GPUs
+    --attention-backend intel_xpu    \   # using intel optimized XPU attention backend
+    --page-size                      \   # intel_xpu attention backend supports [32, 64, 128]
+```
+
+## Benchmarking with Requests
+
+You can benchmark the performance via the `bench_serving` script.
+Run the command in another terminal.
+
+```bash
+python -m sglang.bench_serving   \
+    --dataset-name random        \
+    --random-input-len 1024      \
+    --random-output-len 1024     \
+    --num-prompts 1              \
+    --request-rate inf           \
+    --random-range-ratio 1.0
+```
+
+The detail explanations of the parameters can be looked up by the command:
+
+```bash
+python -m sglang.bench_serving -h
+```
+
+Additionally, the requests can be formed with
+[OpenAI Completions API](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
+and sent via the command line (e.g. using `curl`) or via your own script.