Sync from v0.13
This commit is contained in:
5
docs/getting_started/installation/.nav.yml
Normal file
5
docs/getting_started/installation/.nav.yml
Normal file
@@ -0,0 +1,5 @@
|
||||
nav:
|
||||
- README.md
|
||||
- gpu.md
|
||||
- cpu.md
|
||||
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/
|
||||
29
docs/getting_started/installation/README.md
Normal file
29
docs/getting_started/installation/README.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Installation
|
||||
|
||||
vLLM supports the following hardware platforms:
|
||||
|
||||
- [GPU](gpu.md)
|
||||
- [NVIDIA CUDA](gpu.md#nvidia-cuda)
|
||||
- [AMD ROCm](gpu.md#amd-rocm)
|
||||
- [Intel XPU](gpu.md#intel-xpu)
|
||||
- [CPU](cpu.md)
|
||||
- [Intel/AMD x86](cpu.md#intelamd-x86)
|
||||
- [ARM AArch64](cpu.md#arm-aarch64)
|
||||
- [Apple silicon](cpu.md#apple-silicon)
|
||||
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
||||
|
||||
## Hardware Plugins
|
||||
|
||||
The backends below live **outside** the main `vllm` repository and follow the
|
||||
[Hardware-Pluggable RFC](../../design/plugin_system.md).
|
||||
|
||||
| Accelerator | PyPI / package | Repository |
|
||||
|-------------|----------------|------------|
|
||||
| Google TPU | `tpu-inference` | <https://github.com/vllm-project/tpu-inference> |
|
||||
| Ascend NPU | `vllm-ascend` | <https://github.com/vllm-project/vllm-ascend> |
|
||||
| Intel Gaudi (HPU) | N/A, install from source | <https://github.com/vllm-project/vllm-gaudi> |
|
||||
| MetaX MACA GPU | N/A, install from source | <https://github.com/MetaX-MACA/vLLM-metax> |
|
||||
| Rebellions ATOM / REBEL NPU | `vllm-rbln` | <https://github.com/rebellions-sw/vllm-rbln> |
|
||||
| IBM Spyre AIU | `vllm-spyre` | <https://github.com/vllm-project/vllm-spyre> |
|
||||
| Cambricon MLU | `vllm-mlu` | <https://github.com/Cambricon/vllm-mlu> |
|
||||
| Baidu Kunlun XPU | N/A, install from source | <https://github.com/baidu/vLLM-Kunlun> |
|
||||
87
docs/getting_started/installation/cpu.apple.inc.md
Normal file
87
docs/getting_started/installation/cpu.apple.inc.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS.
|
||||
|
||||
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- OS: `macOS Sonoma` or later
|
||||
- SDK: `XCode 15.4` or later with Command Line Tools
|
||||
- Compiler: `Apple Clang >= 15.0.0`
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built Apple silicon CPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
!!! tip
|
||||
The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts.
|
||||
|
||||
The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation.
|
||||
|
||||
!!! note
|
||||
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which is currently the only supported device.
|
||||
|
||||
!!! example "Troubleshooting"
|
||||
If the build fails with errors like the following where standard C++ headers cannot be found, try to remove and reinstall your
|
||||
[Command Line Tools for Xcode](https://developer.apple.com/download/all/).
|
||||
|
||||
```text
|
||||
[...] fatal error: 'map' file not found
|
||||
1 | #include <map>
|
||||
| ^~~~~
|
||||
1 error generated.
|
||||
[2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
|
||||
|
||||
[...] fatal error: 'cstddef' file not found
|
||||
10 | #include <cstddef>
|
||||
| ^~~~~~~~~
|
||||
1 error generated.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
If the build fails with C++11/C++17 compatibility errors like the following, the issue is that the build system is defaulting to an older C++ standard:
|
||||
|
||||
```text
|
||||
[...] error: 'constexpr' is not a type
|
||||
[...] error: expected ';' before 'constexpr'
|
||||
[...] error: 'constexpr' does not name a type
|
||||
```
|
||||
|
||||
**Solution**: Your compiler might be using an older C++ standard. Edit `cmake/cpu_extension.cmake` and add `set(CMAKE_CXX_STANDARD 17)` before `set(CMAKE_CXX_STANDARD_REQUIRED ON)`.
|
||||
|
||||
To check your compiler's C++ standard support:
|
||||
```bash
|
||||
clang++ -std=c++17 -pedantic -dM -E -x c++ /dev/null | grep __cplusplus
|
||||
```
|
||||
On Apple Clang 16 you should see: `#define __cplusplus 201703L`
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
Currently, there are no pre-built Arm silicon CPU images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
153
docs/getting_started/installation/cpu.arm.inc.md
Normal file
153
docs/getting_started/installation/cpu.arm.inc.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- OS: Linux
|
||||
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
|
||||
- Instruction Set Architecture (ISA): NEON support is required
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries.
|
||||
|
||||
```bash
|
||||
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_VERSION}/cpu
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
pip install vllm==${VLLM_VERSION}+cpu --extra-index-url https://wheels.vllm.ai/${VLLM_VERSION}/cpu
|
||||
```
|
||||
|
||||
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
|
||||
|
||||
**Install the latest code**
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used:
|
||||
|
||||
* `https://wheels.vllm.ai/nightly/cpu/vllm`
|
||||
|
||||
To install from nightly index, run:
|
||||
```bash
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu
|
||||
```
|
||||
|
||||
??? console "pip (there's a caveat)"
|
||||
|
||||
Using `pip` to install from nightly indices is _not supported_, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes).
|
||||
|
||||
If you insist on using `pip`, you have to specify the full URL (link address) of the wheel file (which can be obtained from https://wheels.vllm.ai/nightly/cpu/vllm).
|
||||
|
||||
```bash
|
||||
pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!)
|
||||
```
|
||||
|
||||
**Install specific revisions**
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch
|
||||
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
```bash
|
||||
sudo apt-get update -y
|
||||
sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
|
||||
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
||||
```
|
||||
|
||||
Second, clone the vLLM project:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git vllm_source
|
||||
cd vllm_source
|
||||
```
|
||||
|
||||
Third, install required dependencies:
|
||||
|
||||
```bash
|
||||
uv pip install -r requirements/cpu-build.txt --torch-backend cpu
|
||||
uv pip install -r requirements/cpu.txt --torch-backend cpu
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
Finally, build and install vLLM:
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation
|
||||
```
|
||||
|
||||
If you want to develop vLLM, install it in editable mode instead.
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation
|
||||
```
|
||||
|
||||
Testing has been conducted on AWS Graviton3 instances for compatibility.
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.
|
||||
|
||||
Stable vLLM Docker images are being pre-built for Arm from version 0.12.0. Available image tags are here: [https://gallery.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo).
|
||||
|
||||
```bash
|
||||
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${VLLM_VERSION}
|
||||
```
|
||||
|
||||
You can also access the latest code with Docker images. These are not intended for production use and are meant for CI and testing only. They will expire after several days.
|
||||
|
||||
The latest code can contain bugs and may not be stable. Please use it with caution.
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=6299628d326f429eba78736acb44e76749b281f5 # use full commit hash from the main branch
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
--tag vllm-cpu-env .
|
||||
|
||||
# Launching OpenAI server
|
||||
docker run --rm \
|
||||
--privileged=true \
|
||||
--shm-size=4g \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
|
||||
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
|
||||
vllm-cpu-env \
|
||||
--model=meta-llama/Llama-3.2-1B-Instruct \
|
||||
--dtype=bfloat16 \
|
||||
other vLLM OpenAI server arguments
|
||||
```
|
||||
|
||||
!!! tip
|
||||
An alternative of `--privileged=true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
296
docs/getting_started/installation/cpu.md
Normal file
296
docs/getting_started/installation/cpu.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# CPU
|
||||
|
||||
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
|
||||
|
||||
=== "IBM Z (S390X)"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python: 3.10 -- 3.13
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
|
||||
|
||||
=== "IBM Z (S390X)"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Create a new Python environment
|
||||
|
||||
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
When specifying the index URL, please make sure to use the `cpu` variant subdirectory.
|
||||
For example, the nightly build index is: `https://wheels.vllm.ai/nightly/cpu/`.
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-wheels"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-wheels"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-wheels"
|
||||
|
||||
=== "IBM Z (S390X)"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-wheels"
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
#### Set up using Python-only build (without compilation) {#python-only-build}
|
||||
|
||||
Please refer to the instructions for [Python-only build on GPU](./gpu.md#python-only-build), and replace the build commands with:
|
||||
|
||||
```bash
|
||||
VLLM_USE_PRECOMPILED=1 VLLM_PRECOMPILED_WHEEL_VARIANT=cpu VLLM_TARGET_DEVICE=cpu uv pip install --editable .
|
||||
```
|
||||
|
||||
#### Full build (with compilation) {#full-build}
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "IBM Z (s390x)"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-images"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-images"
|
||||
|
||||
=== "IBM Z (S390X)"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-images"
|
||||
|
||||
### Build image from source
|
||||
|
||||
=== "Intel/AMD x86"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
|
||||
|
||||
=== "ARM AArch64"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
|
||||
|
||||
=== "Apple silicon"
|
||||
|
||||
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
|
||||
|
||||
=== "IBM Z (S390X)"
|
||||
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
|
||||
|
||||
## Related runtime environment variables
|
||||
|
||||
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM to run more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
|
||||
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists, `auto` (by default), or `nobind` (to disable binding to individual CPU cores and to inherit user-defined OpenMP variables). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively. If set to `nobind`, the number of OpenMP threads is determined by the standard `OMP_NUM_THREADS` environment variable.
|
||||
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
|
||||
- `CPU_VISIBLE_MEMORY_NODES`: specify visible NUMA memory nodes for vLLM CPU workers, similar to ```CUDA_VISIBLE_DEVICES```. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. The variable provides more control for the auto thread-binding feature, such as masking nodes and changing nodes binding sequence.
|
||||
- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
|
||||
|
||||
## FAQ
|
||||
|
||||
### Which `dtype` should be used?
|
||||
|
||||
- Currently, vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.
|
||||
|
||||
### How to launch a vLLM service on CPU?
|
||||
|
||||
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
|
||||
|
||||
```bash
|
||||
export VLLM_CPU_KVCACHE_SPACE=40
|
||||
export VLLM_CPU_OMP_THREADS_BIND=0-30
|
||||
vllm serve facebook/opt-125m --dtype=bfloat16
|
||||
```
|
||||
|
||||
or using default auto thread binding:
|
||||
|
||||
```bash
|
||||
export VLLM_CPU_KVCACHE_SPACE=40
|
||||
export VLLM_CPU_NUM_OF_RESERVED_CPU=1
|
||||
vllm serve facebook/opt-125m --dtype=bfloat16
|
||||
```
|
||||
|
||||
Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`.
|
||||
|
||||
### What are supported models on CPU?
|
||||
|
||||
For the full and up-to-date list of models validated on CPU platforms, please see the official documentation: [Supported Models on CPU](https://docs.vllm.ai/en/latest/models/hardware_supported_models/cpu)
|
||||
|
||||
### How to find benchmark configuration examples for supported CPU models?
|
||||
|
||||
For any model listed under [Supported Models on CPU](https://docs.vllm.ai/en/latest/models/hardware_supported_models/cpu), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in [cpu test cases](https://github.com/vllm-project/vllm/blob/main/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json)
|
||||
For details on how these optimized configurations are determined, see: [performance-benchmark-details](https://github.com/vllm-project/vllm/tree/main/.buildkite/performance-benchmarks#performance-benchmark-details).
|
||||
To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](https://docs.vllm.ai/en/latest/contributing/benchmarks/#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment.
|
||||
|
||||
Below is an example command to benchmark all CPU-supported models using optimized configurations.
|
||||
|
||||
```bash
|
||||
ON_CPU=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
The benchmark results will be saved in `./benchmark/results/`.
|
||||
In the directory, the generated `.commands` files contain all example commands for the benchmark.
|
||||
|
||||
We recommend configuring tensor-parallel-size to match the number of NUMA nodes on your system. Note that the current release does not support tensor-parallel-size=6.
|
||||
To determine the number of NUMA nodes available, use the following command:
|
||||
|
||||
```bash
|
||||
lscpu | grep "NUMA node(s):" | awk '{print $3}'
|
||||
```
|
||||
|
||||
For performance reference, users may also consult the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm&deviceName=cpu)
|
||||
, which publishes default-model CPU results produced using the same Benchmark Suite.
|
||||
|
||||
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
|
||||
|
||||
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following.
|
||||
|
||||
- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
|
||||
|
||||
??? console "Commands"
|
||||
|
||||
```console
|
||||
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
|
||||
|
||||
# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
|
||||
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
|
||||
0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
|
||||
1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
|
||||
2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
|
||||
3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
|
||||
4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
|
||||
5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
|
||||
6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
|
||||
7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
|
||||
8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
|
||||
9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
|
||||
10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
|
||||
11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
|
||||
12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
|
||||
13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
|
||||
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
|
||||
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
|
||||
|
||||
# On this platform, it is recommended to only bind openMP threads on logical CPU cores 0-7 or 8-15
|
||||
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
|
||||
$ python examples/offline_inference/basic/basic.py
|
||||
```
|
||||
|
||||
- When deploying vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on the same NUMA node to avoid cross NUMA node memory access.
|
||||
|
||||
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
|
||||
|
||||
This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
|
||||
|
||||
### How to do performance tuning for vLLM CPU?
|
||||
|
||||
First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
|
||||
|
||||
Use multiples of 32 as `--block-size`, which is 128 by default.
|
||||
|
||||
Inference batch size is an important parameter for the performance. A larger batch usually provides higher throughput, a smaller batch provides lower latency. Tuning the max batch size starting from the default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
|
||||
|
||||
- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
|
||||
- Offline Inference: `4096 * world_size`
|
||||
- Online Serving: `2048 * world_size`
|
||||
- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
|
||||
- Offline Inference: `256 * world_size`
|
||||
- Online Serving: `128 * world_size`
|
||||
|
||||
vLLM CPU supports data parallel (DP), tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more details of tuning DP, TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommended to use DP, TP and PP together if there are enough CPU sockets and memory nodes.
|
||||
|
||||
### Which quantization configs does vLLM CPU support?
|
||||
|
||||
- vLLM CPU supports quantizations:
|
||||
- AWQ (x86 only)
|
||||
- GPTQ (x86 only)
|
||||
- compressed-tensor INT8 W8A8 (x86, s390x)
|
||||
|
||||
### (x86 only) What is the purpose of `VLLM_CPU_SGL_KERNEL`?
|
||||
|
||||
- Both of them require `amx` CPU flag.
|
||||
- `VLLM_CPU_SGL_KERNEL` can provide better performance for MoE models and small-batch scenarios.
|
||||
|
||||
### Why do I see `get_mempolicy: Operation not permitted` when running in Docker?
|
||||
|
||||
In some container environments (like Docker), NUMA-related syscalls used by vLLM (e.g., `get_mempolicy`, `migrate_pages`) are blocked/denied in the runtime's default seccomp/capabilities settings. This may lead to warnings like `get_mempolicy: Operation not permitted`. Functionality is not affected, but NUMA memory binding/migration optimizations may not take effect and performance can be suboptimal.
|
||||
|
||||
To enable these optimizations inside Docker with the least privilege, you can follow below tips:
|
||||
|
||||
```bash
|
||||
docker run ... --cap-add SYS_NICE --security-opt seccomp=unconfined ...
|
||||
|
||||
# 1) `--cap-add SYS_NICE` is to address `get_mempolicy` EPERM issue.
|
||||
|
||||
# 2) `--security-opt seccomp=unconfined` is to enable `migrate_pages` for `numa_migrate_pages()`.
|
||||
# Actually, `seccomp=unconfined` bypasses the seccomp for container,
|
||||
# if it's unacceptable, you can customize your own seccomp profile,
|
||||
# based on docker/runtime default.json and add `migrate_pages` to `SCMP_ACT_ALLOW` list.
|
||||
|
||||
# reference : https://docs.docker.com/engine/security/seccomp/
|
||||
```
|
||||
|
||||
Alternatively, running with `--privileged=true` also works but is broader and not generally recommended.
|
||||
|
||||
In K8S, the following configuration can be added to workload yaml to achieve the same effect as above:
|
||||
|
||||
```yaml
|
||||
securityContext:
|
||||
seccompProfile:
|
||||
type: Unconfined
|
||||
capabilities:
|
||||
add:
|
||||
- SYS_NICE
|
||||
```
|
||||
98
docs/getting_started/installation/cpu.s390x.inc.md
Normal file
98
docs/getting_started/installation/cpu.s390x.inc.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform.
|
||||
|
||||
Currently, the CPU implementation for s390x architecture supports FP32 datatype only.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- OS: `Linux`
|
||||
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
|
||||
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
|
||||
- Build install python packages: `pyarrow`, `torch` and `torchvision`
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built IBM Z CPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
|
||||
|
||||
```bash
|
||||
dnf install -y \
|
||||
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
|
||||
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
|
||||
openssl-devel openblas openblas-devel wget autoconf automake libtool cmake numactl-devel
|
||||
```
|
||||
|
||||
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
|
||||
|
||||
```bash
|
||||
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
|
||||
. "$HOME/.cargo/env"
|
||||
```
|
||||
|
||||
Execute the following commands to build and install vLLM from source.
|
||||
|
||||
!!! tip
|
||||
Please build the following dependencies, `torchvision`, `pyarrow` from source before building vLLM.
|
||||
|
||||
```bash
|
||||
sed -i '/^torch/d' requirements/build.txt # remove torch from requirements/build.txt since we use nightly builds
|
||||
uv pip install -v \
|
||||
--torch-backend auto \
|
||||
-r requirements/build.txt \
|
||||
-r requirements/cpu.txt \
|
||||
VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
|
||||
uv pip install dist/*.whl
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
sed -i '/^torch/d' requirements/build.txt # remove torch from requirements/build.txt since we use nightly builds
|
||||
pip install -v \
|
||||
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
|
||||
-r requirements/build.txt \
|
||||
-r requirements/cpu.txt \
|
||||
VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
|
||||
pip install dist/*.whl
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
Currently, there are no pre-built IBM Z CPU images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.s390x \
|
||||
--tag vllm-cpu-env .
|
||||
|
||||
# Launch OpenAI server
|
||||
docker run --rm \
|
||||
--privileged true \
|
||||
--shm-size 4g \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
|
||||
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
|
||||
vllm-cpu-env \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--dtype float \
|
||||
other vLLM OpenAI server arguments
|
||||
```
|
||||
|
||||
!!! tip
|
||||
An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
135
docs/getting_started/installation/cpu.x86.inc.md
Normal file
135
docs/getting_started/installation/cpu.x86.inc.md
Normal file
@@ -0,0 +1,135 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- OS: Linux
|
||||
- CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional)
|
||||
|
||||
!!! tip
|
||||
Use `lscpu` to check the CPU flags.
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built x86 CPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
```bash
|
||||
sudo apt-get update -y
|
||||
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
|
||||
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
||||
```
|
||||
|
||||
Clone the vLLM project:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git vllm_source
|
||||
cd vllm_source
|
||||
```
|
||||
|
||||
Install the required dependencies:
|
||||
|
||||
```bash
|
||||
uv pip install -r requirements/cpu-build.txt --torch-backend cpu
|
||||
uv pip install -r requirements/cpu.txt --torch-backend cpu
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
Build and install vLLM:
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation
|
||||
```
|
||||
|
||||
If you want to develop vLLM, install it in editable mode instead.
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation
|
||||
```
|
||||
|
||||
Optionally, build a portable wheel which you can then install elsewhere:
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu uv build --wheel
|
||||
```
|
||||
|
||||
```bash
|
||||
uv pip install dist/*.whl
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=cpu python -m build --wheel --no-isolation
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install dist/*.whl
|
||||
```
|
||||
|
||||
!!! example "Troubleshooting"
|
||||
- **NumPy ≥2.0 error**: Downgrade using `pip install "numpy<2.0"`.
|
||||
- **CMake picks up CUDA**: Add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON` to prevent CUDA detection during CPU builds, even if CUDA is installed.
|
||||
- `AMD` requires at least 4th gen processors (Zen 4/Genoa) or higher to support [AVX512](https://www.phoronix.com/review/amd-zen4-avx512) to run vLLM on CPU.
|
||||
- If you receive an error such as: `Could not find a version that satisfies the requirement torch==X.Y.Z+cpu+cpu`, consider updating [pyproject.toml](https://github.com/vllm-project/vllm/blob/main/pyproject.toml) to help pip resolve the dependency.
|
||||
```toml title="pyproject.toml"
|
||||
[build-system]
|
||||
requires = [
|
||||
"cmake>=3.26.1",
|
||||
...
|
||||
"torch==X.Y.Z+cpu" # <-------
|
||||
]
|
||||
```
|
||||
- If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
[https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
|
||||
|
||||
!!! warning
|
||||
If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. It is recommended to build images for these machines with the appropriate build arguments (e.g., `--build-arg VLLM_CPU_DISABLE_AVX512=true`, `--build-arg VLLM_CPU_AVX512BF16=false`, or `--build-arg VLLM_CPU_AVX512VNNI=false`) to disable unsupported features. Please note that without `avx512f`, AVX2 will be used and this version is not recommended because it only has basic feature support.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
|
||||
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
|
||||
--build-arg VLLM_CPU_DISABLE_AVX512=false (default)|true \
|
||||
--tag vllm-cpu-env \
|
||||
--target vllm-openai .
|
||||
|
||||
# Launching OpenAI server
|
||||
docker run --rm \
|
||||
--security-opt seccomp=unconfined \
|
||||
--cap-add SYS_NICE \
|
||||
--shm-size=4g \
|
||||
-p 8000:8000 \
|
||||
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
|
||||
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
|
||||
vllm-cpu-env \
|
||||
--model=meta-llama/Llama-3.2-1B-Instruct \
|
||||
--dtype=bfloat16 \
|
||||
other vLLM OpenAI server arguments
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
# --8<-- [end:extra-information]
|
||||
17
docs/getting_started/installation/device.template.md
Normal file
17
docs/getting_started/installation/device.template.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Installation
|
||||
|
||||
## Requirements
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
### Build image from source
|
||||
|
||||
## Extra information
|
||||
264
docs/getting_started/installation/gpu.cuda.inc.md
Normal file
264
docs/getting_started/installation/gpu.cuda.inc.md
Normal file
@@ -0,0 +1,264 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
!!! note
|
||||
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
|
||||
|
||||
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
|
||||
|
||||
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
```bash
|
||||
uv pip install vllm --torch-backend=auto
|
||||
```
|
||||
|
||||
??? console "pip"
|
||||
```bash
|
||||
# Install vLLM with CUDA 12.9.
|
||||
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
|
||||
```
|
||||
|
||||
We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu128`), set `--torch-backend=cu128` (or `UV_TORCH_BACKEND=cu128`). If this doesn't work, try running `uv self update` to update `uv` first.
|
||||
|
||||
!!! note
|
||||
NVIDIA Blackwell GPUs (B200, GB200) require a minimum of CUDA 12.8, so make sure you are installing PyTorch wheels with at least that version. PyTorch itself offers a [dedicated interface](https://pytorch.org/get-started/locally/) to determine the appropriate pip command to run for a given target configuration.
|
||||
|
||||
As of now, vLLM's binaries are compiled with CUDA 12.9 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.8, 13.0, and public PyTorch release versions:
|
||||
|
||||
```bash
|
||||
# Install vLLM with a specific CUDA version (e.g., 13.0).
|
||||
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
|
||||
export CUDA_VERSION=130 # or other
|
||||
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux_2_31_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
|
||||
```
|
||||
|
||||
#### Install the latest code
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used:
|
||||
|
||||
* `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9.
|
||||
* `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency.
|
||||
|
||||
To install from nightly index, run:
|
||||
|
||||
```bash
|
||||
uv pip install -U vllm \
|
||||
--torch-backend=auto \
|
||||
--extra-index-url https://wheels.vllm.ai/nightly # add variant subdirectory here if needed
|
||||
```
|
||||
|
||||
!!! warning "`pip` caveat"
|
||||
|
||||
Using `pip` to install from nightly indices is _not supported_, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes).
|
||||
|
||||
If you insist on using `pip`, you have to specify the full URL of the wheel file (which can be obtained from the web page).
|
||||
|
||||
```bash
|
||||
pip install -U https://wheels.vllm.ai/nightly/vllm-0.11.2.dev399%2Bg3c7461c18-cp38-abi3-manylinux_2_31_x86_64.whl # current nightly build (the filename will change!)
|
||||
pip install -U https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-0.11.2.dev399%2Bg3c7461c18-cp38-abi3-manylinux_2_31_x86_64.whl # from specific commit
|
||||
```
|
||||
|
||||
##### Install specific revisions
|
||||
|
||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||
uv pip install vllm \
|
||||
--torch-backend=auto \
|
||||
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed
|
||||
```
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
#### Set up using Python-only build (without compilation) {#python-only-build}
|
||||
|
||||
If you only need to change Python code, you can build and install vLLM without compilation. Using `uv pip`'s [`--editable` flag](https://docs.astral.sh/uv/pip/packages/#editable-packages), changes you make to the code will be reflected when you run vLLM:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
|
||||
```
|
||||
|
||||
This command will do the following:
|
||||
|
||||
1. Look for the current branch in your vLLM clone.
|
||||
1. Identify the corresponding base commit in the main branch.
|
||||
1. Download the pre-built wheel of the base commit.
|
||||
1. Use its compiled libraries in the installation.
|
||||
|
||||
!!! note
|
||||
1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
|
||||
2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
|
||||
|
||||
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
|
||||
|
||||
```bash
|
||||
export VLLM_PRECOMPILED_WHEEL_COMMIT=$(git rev-parse HEAD~1) # or earlier commit on main
|
||||
export VLLM_USE_PRECOMPILED=1
|
||||
uv pip install --editable .
|
||||
```
|
||||
|
||||
There are more environment variables to control the behavior of Python-only build:
|
||||
|
||||
* `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped.
|
||||
* `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch.
|
||||
* `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cpu`. If not specified, the CUDA variant with `VLLM_MAIN_CUDA_VERSION` will be tried, then fallback to the default variant on the remote index.
|
||||
|
||||
You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).
|
||||
|
||||
!!! note
|
||||
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
|
||||
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [Install the latest code](#install-the-latest-code) for instructions on how to install a specified wheel.
|
||||
|
||||
#### Full build (with compilation) {#full-build}
|
||||
|
||||
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
!!! tip
|
||||
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
|
||||
|
||||
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
|
||||
As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
|
||||
|
||||
When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
|
||||
|
||||
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
|
||||
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
|
||||
|
||||
!!! note "Faster Kernel Development"
|
||||
For frequent C++/CUDA kernel changes, after the initial `uv pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code.
|
||||
|
||||
##### Use an existing PyTorch installation
|
||||
|
||||
There are scenarios where the PyTorch dependency cannot be easily installed with `uv`, for example, when building vLLM with non-default PyTorch builds (like nightly or a custom build).
|
||||
|
||||
To build vLLM using an existing PyTorch installation:
|
||||
|
||||
```bash
|
||||
# install PyTorch first, either from PyPI or from source
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
python use_existing_torch.py
|
||||
uv pip install -r requirements/build.txt
|
||||
uv pip install --no-build-isolation -e .
|
||||
```
|
||||
|
||||
Alternatively: if you are exclusively using `uv` to create and manage virtual environments, it has [a unique mechanism](https://docs.astral.sh/uv/concepts/projects/config/#disabling-build-isolation)
|
||||
for disabling build isolation for specific packages. vLLM can leverage this mechanism to specify `torch` as the package to disable build isolation for:
|
||||
|
||||
```bash
|
||||
# install PyTorch first, either from PyPI or from source
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
# pip install -e . does not work directly, only uv can do this
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
##### Use the local cutlass for compilation
|
||||
|
||||
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
|
||||
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass uv pip install -e .
|
||||
```
|
||||
|
||||
##### Troubleshooting
|
||||
|
||||
To avoid your system being overloaded, you can limit the number of compilation jobs
|
||||
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
|
||||
|
||||
```bash
|
||||
export MAX_JOBS=6
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
|
||||
A side effect is a much slower build process.
|
||||
|
||||
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
# Use `--ipc=host` to make sure the shared memory is large enough.
|
||||
docker run \
|
||||
--gpus all \
|
||||
-it \
|
||||
--rm \
|
||||
--ipc=host nvcr.io/nvidia/pytorch:23.10-py3
|
||||
```
|
||||
|
||||
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
||||
|
||||
```bash
|
||||
export CUDA_HOME=/usr/local/cuda
|
||||
export PATH="${CUDA_HOME}/bin:$PATH"
|
||||
```
|
||||
|
||||
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
|
||||
|
||||
```bash
|
||||
nvcc --version # verify that nvcc is in your PATH
|
||||
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
|
||||
```
|
||||
|
||||
#### Unsupported OS build
|
||||
|
||||
vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.
|
||||
|
||||
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
|
||||
|
||||
```bash
|
||||
export VLLM_TARGET_DEVICE=empty
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.
|
||||
|
||||
Another way to access the latest code is to use the docker images:
|
||||
|
||||
```bash
|
||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
|
||||
```
|
||||
|
||||
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
|
||||
|
||||
The latest code can contain bugs and may not be stable. Please use it with caution.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
See [Building vLLM's Docker Image from Source](../../deployment/docker.md#building-vllms-docker-image-from-source) for instructions on building the Docker image.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
125
docs/getting_started/installation/gpu.md
Normal file
125
docs/getting_started/installation/gpu.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# GPU
|
||||
|
||||
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:installation"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:installation"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
|
||||
|
||||
## Requirements
|
||||
|
||||
- OS: Linux
|
||||
- Python: 3.10 -- 3.13
|
||||
|
||||
!!! note
|
||||
vLLM does not support Windows natively. To run vLLM on Windows, you can use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use some community-maintained forks, e.g. [https://github.com/SystemPanic/vllm-windows](https://github.com/SystemPanic/vllm-windows).
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:requirements"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:requirements"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Create a new Python environment
|
||||
|
||||
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:set-up-using-python"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:set-up-using-python"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
|
||||
|
||||
### Pre-built wheels {#pre-built-wheels}
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-wheels"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-wheels"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-images"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
|
||||
|
||||
### Build image from source
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-image-from-source"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-image-from-source"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
|
||||
|
||||
## Supported features
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:supported-features"
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:supported-features"
|
||||
|
||||
=== "Intel XPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
|
||||
245
docs/getting_started/installation/gpu.rocm.inc.md
Normal file
245
docs/getting_started/installation/gpu.rocm.inc.md
Normal file
@@ -0,0 +1,245 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above.
|
||||
|
||||
!!! tip
|
||||
[Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150)
|
||||
- ROCm 6.3 or above
|
||||
- MI350 requires ROCm 7.0 or above
|
||||
- Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
There is no extra information on creating a new Python environment for this device.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built ROCm wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
!!! tip
|
||||
- If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps.
|
||||
|
||||
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
|
||||
|
||||
- [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
|
||||
- [PyTorch](https://pytorch.org/)
|
||||
|
||||
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release_2.8.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
|
||||
|
||||
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
|
||||
|
||||
```bash
|
||||
# Install PyTorch
|
||||
pip uninstall torch -y
|
||||
pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.0
|
||||
```
|
||||
|
||||
1. Install [Triton for ROCm](https://github.com/ROCm/triton.git)
|
||||
|
||||
Install ROCm's Triton following the instructions from [ROCm/triton](https://github.com/ROCm/triton.git)
|
||||
|
||||
```bash
|
||||
python3 -m pip install ninja cmake wheel pybind11
|
||||
pip uninstall -y triton
|
||||
git clone https://github.com/ROCm/triton.git
|
||||
cd triton
|
||||
# git checkout $TRITON_BRANCH
|
||||
git checkout f9e5bf54
|
||||
if [ ! -f setup.py ]; then cd python; fi
|
||||
python3 setup.py install
|
||||
cd ../..
|
||||
```
|
||||
|
||||
!!! note
|
||||
- The validated `$TRITON_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
|
||||
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/Dao-AILab/flash-attention.git)
|
||||
|
||||
Install ROCm's flash attention (v2.8.0) following the instructions from [ROCm/flash-attention](https://github.com/Dao-AILab/flash-attention#amd-rocm-support)
|
||||
|
||||
For example, for ROCm 7.0, suppose your gfx arch is `gfx942`. To get your gfx architecture, run `rocminfo |grep gfx`.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/Dao-AILab/flash-attention.git
|
||||
cd flash-attention
|
||||
# git checkout $FA_BRANCH
|
||||
git checkout 0e60e394
|
||||
git submodule update --init
|
||||
GPU_ARCHS="gfx942" python3 setup.py install
|
||||
cd ..
|
||||
```
|
||||
|
||||
!!! note
|
||||
- The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
|
||||
|
||||
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
|
||||
|
||||
```bash
|
||||
python3 -m pip uninstall -y aiter
|
||||
git clone --recursive https://github.com/ROCm/aiter.git
|
||||
cd aiter
|
||||
git checkout $AITER_BRANCH_OR_COMMIT
|
||||
git submodule sync; git submodule update --init --recursive
|
||||
python3 setup.py develop
|
||||
```
|
||||
|
||||
!!! note
|
||||
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
|
||||
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
|
||||
|
||||
|
||||
4. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
|
||||
|
||||
???+ console "Commands"
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
|
||||
# Build & install AMD SMI
|
||||
pip install /opt/rocm/share/amd_smi
|
||||
|
||||
# Install dependencies
|
||||
pip install --upgrade numba \
|
||||
scipy \
|
||||
huggingface-hub[cli,hf_transfer] \
|
||||
setuptools_scm
|
||||
pip install -r requirements/rocm.txt
|
||||
|
||||
# To build for a single architecture (e.g., MI300) for faster installation (recommended):
|
||||
export PYTORCH_ROCM_ARCH="gfx942"
|
||||
|
||||
# To build vLLM for multiple arch MI210/MI250/MI300, use this instead
|
||||
# export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
||||
|
||||
python3 setup.py develop
|
||||
```
|
||||
|
||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||
|
||||
!!! tip
|
||||
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
||||
|
||||
!!! tip
|
||||
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
|
||||
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html).
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
|
||||
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
|
||||
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed.
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker pull rocm/vllm-dev:nightly # to get the latest image
|
||||
docker run -it --rm \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/your/models>:/app/models \
|
||||
-e HF_HOME="/app/models" \
|
||||
rocm/vllm-dev:nightly
|
||||
```
|
||||
|
||||
!!! tip
|
||||
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
|
||||
for instructions on how to use this prebuilt docker image.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
Building the Docker image from source is the recommended way to use vLLM with ROCm.
|
||||
|
||||
??? info "(Optional) Build an image with ROCm software stack"
|
||||
|
||||
Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
|
||||
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
|
||||
If you choose to build this rocm_base image yourself, the steps are as follows.
|
||||
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
```json
|
||||
{
|
||||
"features": {
|
||||
"buildkit": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default:
|
||||
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build \
|
||||
-f docker/Dockerfile.rocm_base \
|
||||
-t rocm/vllm-dev:base .
|
||||
```
|
||||
|
||||
#### Build an image with vLLM
|
||||
|
||||
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
```bash
|
||||
{
|
||||
"features": {
|
||||
"buildkit": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
[docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 7.0 by default, but also supports ROCm 5.7, 6.0, 6.1, 6.2, 6.3, and 6.4, in older vLLM branches.
|
||||
It provides flexibility to customize the build of docker image using the following arguments:
|
||||
|
||||
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
|
||||
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
|
||||
|
||||
Their values can be passed in when running `docker build` with `--build-arg` options.
|
||||
|
||||
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default:
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
||||
```
|
||||
|
||||
To run the above docker image `vllm-rocm`, use the below command:
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker run -it \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/model>:/app/model \
|
||||
vllm-rocm
|
||||
```
|
||||
|
||||
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
84
docs/getting_started/installation/gpu.xpu.inc.md
Normal file
84
docs/getting_started/installation/gpu.xpu.inc.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# --8<-- [start:installation]
|
||||
|
||||
vLLM initially supports basic model inference and serving on Intel GPU platform.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
|
||||
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
|
||||
- OneAPI requirements: oneAPI 2025.1
|
||||
- Python: 3.12
|
||||
!!! warning
|
||||
The provided IPEX whl is Python3.12 specific so this version is a MUST.
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
|
||||
There is no extra information on creating a new Python environment for this device.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
|
||||
Currently, there are no pre-built XPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
|
||||
- First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.1 or later.
|
||||
- Second, install Python packages for vLLM XPU backend building:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
cd vllm
|
||||
pip install --upgrade pip
|
||||
pip install -v -r requirements/xpu.txt
|
||||
```
|
||||
|
||||
- Then, build and install vLLM XPU backend:
|
||||
|
||||
```bash
|
||||
VLLM_TARGET_DEVICE=xpu python setup.py install
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm).
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||
docker run -it \
|
||||
--rm \
|
||||
--network=host \
|
||||
--device /dev/dri:/dev/dri \
|
||||
-v /dev/dri/by-path:/dev/dri/by-path \
|
||||
--ipc=host \
|
||||
--privileged \
|
||||
vllm-xpu-env
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
|
||||
|
||||
```bash
|
||||
vllm serve facebook/opt-13b \
|
||||
--dtype=bfloat16 \
|
||||
--max_model_len=1024 \
|
||||
--distributed-executor-backend=mp \
|
||||
--pipeline-parallel-size=2 \
|
||||
-tp=8
|
||||
```
|
||||
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
# --8<-- [start:distributed-backend]
|
||||
|
||||
XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
|
||||
|
||||
# --8<-- [end:distributed-backend]
|
||||
@@ -0,0 +1,6 @@
|
||||
On NVIDIA CUDA only, it's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
|
||||
|
||||
```bash
|
||||
uv venv --python 3.12 --seed
|
||||
source .venv/bin/activate
|
||||
```
|
||||
307
docs/getting_started/quickstart.md
Normal file
307
docs/getting_started/quickstart.md
Normal file
@@ -0,0 +1,307 @@
|
||||
# Quickstart
|
||||
|
||||
This guide will help you quickly get started with vLLM to perform:
|
||||
|
||||
- [Offline batched inference](#offline-batched-inference)
|
||||
- [Online serving using OpenAI-compatible server](#openai-compatible-server)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- OS: Linux
|
||||
- Python: 3.10 -- 3.13
|
||||
|
||||
## Installation
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
|
||||
|
||||
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
|
||||
|
||||
```bash
|
||||
uv venv --python 3.12 --seed
|
||||
source .venv/bin/activate
|
||||
uv pip install vllm --torch-backend=auto
|
||||
```
|
||||
|
||||
`uv` can [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`).
|
||||
|
||||
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
|
||||
|
||||
```bash
|
||||
uv run --with vllm vllm --help
|
||||
```
|
||||
|
||||
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
|
||||
|
||||
```bash
|
||||
conda create -n myenv python=3.12 -y
|
||||
conda activate myenv
|
||||
pip install --upgrade uv
|
||||
uv pip install vllm --torch-backend=auto
|
||||
```
|
||||
|
||||
=== "AMD ROCm"
|
||||
|
||||
Use a pre-built docker image from Docker Hub. The public stable image is [rocm/vllm:latest](https://hub.docker.com/r/rocm/vllm). There is also a development image at [rocm/vllm-dev](https://hub.docker.com/r/rocm/vllm-dev).
|
||||
|
||||
The `-v` flag in the `docker run` command below mounts a local directory into the container. Replace `<path/to/your/models>` with the path on your host machine to the directory containing your models. The models will then be accessible inside the container at `/app/models`.
|
||||
|
||||
???+ console "Commands"
|
||||
```bash
|
||||
docker pull rocm/vllm-dev:nightly # to get the latest image
|
||||
docker run -it --rm \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/your/models>:/app/models \
|
||||
-e HF_HOME="/app/models" \
|
||||
rocm/vllm-dev:nightly
|
||||
```
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
To run vLLM on Google TPUs, you need to install the `vllm-tpu` package.
|
||||
|
||||
```bash
|
||||
uv pip install vllm-tpu
|
||||
```
|
||||
|
||||
!!! note
|
||||
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
|
||||
|
||||
!!! note
|
||||
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
|
||||
|
||||
## Offline Batched Inference
|
||||
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
|
||||
|
||||
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
|
||||
|
||||
- [LLM][vllm.LLM] is the main class for running offline inference with vLLM engine.
|
||||
- [SamplingParams][vllm.SamplingParams] specifies the parameters for the sampling process.
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
```
|
||||
|
||||
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](../api/README.md#inference-parameters).
|
||||
|
||||
!!! important
|
||||
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
|
||||
|
||||
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
```
|
||||
|
||||
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
|
||||
|
||||
```python
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
```
|
||||
|
||||
!!! note
|
||||
By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=True
|
||||
```
|
||||
|
||||
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
|
||||
|
||||
```python
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
!!! note
|
||||
The `llm.generate` method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the `llm.chat` method and pass a list of messages which have the same format as those passed to OpenAI's `client.chat.completions`:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Using tokenizer to apply chat template
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("/path/to/chat_model")
|
||||
messages_list = [
|
||||
[{"role": "user", "content": prompt}]
|
||||
for prompt in prompts
|
||||
]
|
||||
texts = tokenizer.apply_chat_template(
|
||||
messages_list,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
)
|
||||
|
||||
# Generate outputs
|
||||
outputs = llm.generate(texts, sampling_params)
|
||||
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
# Using chat interface.
|
||||
outputs = llm.chat(messages_list, sampling_params)
|
||||
for idx, output in enumerate(outputs):
|
||||
prompt = prompts[idx]
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
## OpenAI-Compatible Server
|
||||
|
||||
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
||||
By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints.
|
||||
|
||||
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||
```
|
||||
|
||||
!!! note
|
||||
By default, the server uses a predefined chat template stored in the tokenizer.
|
||||
You can learn about overriding it [here](../serving/openai_compatible_server.md#chat-template).
|
||||
!!! important
|
||||
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||
|
||||
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||
|
||||
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
|
||||
You can pass multiple keys after `--api-key`, and the server will accept any of the keys passed, this can be useful for key rotation.
|
||||
|
||||
### OpenAI Completions API with vLLM
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
completion = client.completions.create(
|
||||
model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||
prompt="San Francisco is a",
|
||||
)
|
||||
print("Completion result:", completion)
|
||||
```
|
||||
|
||||
A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
|
||||
|
||||
### OpenAI Chat Completions API with vLLM
|
||||
|
||||
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
||||
|
||||
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Who won the world series in 2020?"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Alternatively, you can use the `openai` Python package:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
# Set OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Tell me a joke."},
|
||||
],
|
||||
)
|
||||
print("Chat response:", chat_response)
|
||||
```
|
||||
|
||||
## On Attention Backends
|
||||
|
||||
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
|
||||
|
||||
If desired, you can also manually set the backend of your choice using the `--attention-backend` CLI argument:
|
||||
|
||||
```bash
|
||||
# For online serving
|
||||
vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN
|
||||
|
||||
# For offline inference
|
||||
python script.py --attention-backend FLASHINFER
|
||||
```
|
||||
|
||||
Some of the available backend options include:
|
||||
|
||||
- On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`.
|
||||
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
|
||||
|
||||
For AMD ROCm, you can further control the specific Attention implementation using the following options:
|
||||
|
||||
- Triton Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||
- AITER Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||
- Triton Prefill-Decode Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=true` as a CLI argument.
|
||||
- AITER Multi-head Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=1` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||
|
||||
!!! warning
|
||||
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
|
||||
Reference in New Issue
Block a user