Init attention backend for Intel XPU (#10656)
Co-authored-by: guangyey <guangye.yu@intel.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com>
This commit is contained in:
@@ -26,6 +26,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
|
||||
| **AITER (ROCm)** | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
|
||||
| **Wave (ROCm)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
||||
| **Intel XPU** | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
|
||||
|
||||
### MLA Backends
|
||||
|
||||
@@ -190,6 +191,13 @@ python3 -m sglang.launch_server \
|
||||
--attention-backend ascend
|
||||
```
|
||||
|
||||
- Intel XPU
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
|
||||
--attention-backend intel_xpu
|
||||
```
|
||||
|
||||
- Wave
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
|
||||
@@ -75,6 +75,7 @@ Its core features include:
|
||||
platforms/tpu.md
|
||||
platforms/nvidia_jetson.md
|
||||
platforms/ascend_npu.md
|
||||
platforms/xpu.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
92
docs/platforms/xpu.md
Normal file
92
docs/platforms/xpu.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# XPU
|
||||
|
||||
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on Intel GPU, [see more context about Intel GPU support within PyTorch ecosystem](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html).
|
||||
|
||||
Specifically, SGLang is optimized for [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics.html) and [
|
||||
Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics.html).
|
||||
|
||||
## Optimized Model List
|
||||
|
||||
A list of LLMs have been optimized on Intel GPU, and more are on the way:
|
||||
|
||||
| Model Name | BF16 |
|
||||
|:---:|:---:|
|
||||
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
|
||||
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
|
||||
| Qwen2.5-1.5B | [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) |
|
||||
|
||||
**Note:** The model identifiers listed in the table above
|
||||
have been verified on [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html).
|
||||
|
||||
## Installation
|
||||
|
||||
### Install From Source
|
||||
|
||||
Currently SGLang XPU only supports installation from source. Please refer to ["Getting Started on Intel GPU"](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) to install XPU dependency.
|
||||
|
||||
```bash
|
||||
# Create and activate a conda environment
|
||||
conda create -n sgl-xpu python=3.12 -y
|
||||
conda activate sgl-xpu
|
||||
|
||||
# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
|
||||
pip3 install torch==2.8.0+xpu torchao torchvision torchaudio pytorch-triton-xpu==3.4.0 --index-url https://download.pytorch.org/whl/xpu
|
||||
pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU
|
||||
|
||||
# Clone the SGLang code
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
git checkout <YOUR-DESIRED-VERSION>
|
||||
|
||||
# Use dedicated toml file
|
||||
cd python
|
||||
cp pyproject_xpu.toml pyproject.toml
|
||||
# Install SGLang dependent libs, and build SGLang main package
|
||||
pip install --upgrade pip setuptools
|
||||
pip install -v .
|
||||
```
|
||||
|
||||
### Install Using Docker
|
||||
|
||||
The docker for XPU is under active development. Please stay tuned.
|
||||
|
||||
## Launch of the Serving Engine
|
||||
|
||||
Example command to launch SGLang serving:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model <MODEL_ID_OR_PATH> \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device xpu \
|
||||
--host 0.0.0.0 \
|
||||
--tp 2 \ # using multi GPUs
|
||||
--attention-backend intel_xpu \ # using intel optimized XPU attention backend
|
||||
--page-size \ # intel_xpu attention backend supports [32, 64, 128]
|
||||
```
|
||||
|
||||
## Benchmarking with Requests
|
||||
|
||||
You can benchmark the performance via the `bench_serving` script.
|
||||
Run the command in another terminal.
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving \
|
||||
--dataset-name random \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 1024 \
|
||||
--num-prompts 1 \
|
||||
--request-rate inf \
|
||||
--random-range-ratio 1.0
|
||||
```
|
||||
|
||||
The detail explanations of the parameters can be looked up by the command:
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving -h
|
||||
```
|
||||
|
||||
Additionally, the requests can be formed with
|
||||
[OpenAI Completions API](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
|
||||
and sent via the command line (e.g. using `curl`) or via your own script.
|
||||
Reference in New Issue
Block a user