forked from EngineX-Cambricon/enginex-mlu370-vllm
add qwen3
This commit is contained in:
178
vllm-v0.6.2/docs/source/getting_started/amd-installation.rst
Normal file
178
vllm-v0.6.2/docs/source/getting_started/amd-installation.rst
Normal file
@@ -0,0 +1,178 @@
|
||||
.. _installation_rocm:
|
||||
|
||||
Installation with ROCm
|
||||
======================
|
||||
|
||||
vLLM supports AMD GPUs with ROCm 6.2.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* OS: Linux
|
||||
* Python: 3.9 -- 3.12
|
||||
* GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
|
||||
* ROCm 6.2
|
||||
|
||||
Installation options:
|
||||
|
||||
#. :ref:`Build from source with docker <build_from_source_docker_rocm>`
|
||||
#. :ref:`Build from source <build_from_source_rocm>`
|
||||
|
||||
.. _build_from_source_docker_rocm:
|
||||
|
||||
Option 1: Build from source with docker (recommended)
|
||||
-----------------------------------------------------
|
||||
|
||||
You can build and install vLLM from source.
|
||||
|
||||
First, build a docker image from `Dockerfile.rocm <https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm>`_ and launch a docker container from the image.
|
||||
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
{
|
||||
"features": {
|
||||
"buildkit": true
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
`Dockerfile.rocm <https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm>`_ uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
|
||||
It provides flexibility to customize the build of docker image using the following arguments:
|
||||
|
||||
* `BASE_IMAGE`: specifies the base image used when running ``docker build``, specifically the PyTorch on ROCm base image.
|
||||
* `BUILD_FA`: specifies whether to build CK flash-attention. The default is 1. For `Radeon RX 7900 series (gfx1100) <https://rocm.docs.amd.com/projects/radeon/en/latest/index.html>`_, this should be set to 0 before flash-attention supports this target.
|
||||
* `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build CK flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942`
|
||||
* `FA_BRANCH`: specifies the branch used to build the CK flash-attention in `ROCm's flash-attention repo <https://github.com/ROCmSoftwarePlatform/flash-attention>`_. The default is `ae7928c`
|
||||
* `BUILD_TRITON`: specifies whether to build triton flash-attention. The default value is 1.
|
||||
|
||||
Their values can be passed in when running ``docker build`` with ``--build-arg`` options.
|
||||
|
||||
|
||||
To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
|
||||
|
||||
To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should specify ``BUILD_FA`` as below:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm .
|
||||
|
||||
To run the above docker image ``vllm-rocm``, use the below command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ docker run -it \
|
||||
--network=host \
|
||||
--group-add=video \
|
||||
--ipc=host \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--device /dev/kfd \
|
||||
--device /dev/dri \
|
||||
-v <path/to/model>:/app/model \
|
||||
vllm-rocm \
|
||||
bash
|
||||
|
||||
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
|
||||
|
||||
|
||||
.. _build_from_source_rocm:
|
||||
|
||||
Option 2: Build from source
|
||||
---------------------------
|
||||
|
||||
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
|
||||
|
||||
- `ROCm <https://rocm.docs.amd.com/en/latest/deploy/linux/index.html>`_
|
||||
- `PyTorch <https://pytorch.org/>`_
|
||||
|
||||
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`.
|
||||
|
||||
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch `Getting Started <https://pytorch.org/get-started/locally/>`_
|
||||
|
||||
|
||||
1. Install `Triton flash attention for ROCm <https://github.com/ROCm/triton>`_
|
||||
|
||||
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from `ROCm/triton <https://github.com/ROCm/triton/blob/triton-mlir/README.md>`_
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python3 -m pip install ninja cmake wheel pybind11
|
||||
$ pip uninstall -y triton
|
||||
$ git clone https://github.com/OpenAI/triton.git
|
||||
$ cd triton
|
||||
$ git checkout e192dba
|
||||
$ cd python
|
||||
$ pip3 install .
|
||||
$ cd ../..
|
||||
|
||||
.. note::
|
||||
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
|
||||
|
||||
2. Optionally, if you choose to use CK flash attention, you can install `flash attention for ROCm <https://github.com/ROCm/flash-attention/tree/ck_tile>`_
|
||||
|
||||
|
||||
Install ROCm's flash attention (v2.5.9.post1) following the instructions from `ROCm/flash-attention <https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support>`_
|
||||
Alternatively, wheels intended for vLLM use can be accessed under the releases.
|
||||
|
||||
For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`.
|
||||
Note to get your gfx architecture, run `rocminfo |grep gfx`.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ git clone https://github.com/ROCm/flash-attention.git
|
||||
$ cd flash-attention
|
||||
$ git checkout 3cea2fb
|
||||
$ git submodule update --init
|
||||
$ GPU_ARCHS="gfx90a" python3 setup.py install
|
||||
$ cd ..
|
||||
|
||||
.. note::
|
||||
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
|
||||
|
||||
3. Build vLLM.
|
||||
|
||||
For example, vLLM on ROCM 6.2 can be built with the following steps:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ pip install --upgrade pip
|
||||
|
||||
$ # Install PyTorch
|
||||
$ pip uninstall torch -y
|
||||
$ pip install --no-cache-dir --pre torch==2.6.0.dev20240918 --index-url https://download.pytorch.org/whl/nightly/rocm6.2
|
||||
|
||||
$ # Build & install AMD SMI
|
||||
$ pip install /opt/rocm/share/amd_smi
|
||||
|
||||
$ # Install dependencies
|
||||
$ pip install --upgrade numba scipy huggingface-hub[cli]
|
||||
$ pip install "numpy<2"
|
||||
$ pip install -r requirements-rocm.txt
|
||||
|
||||
$ # Build vLLM for MI210/MI250/MI300.
|
||||
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
||||
$ python3 setup.py develop
|
||||
|
||||
|
||||
This may take 5-10 minutes. Currently, :code:`pip install .` does not work for ROCm installation.
|
||||
|
||||
|
||||
.. tip::
|
||||
|
||||
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
||||
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
||||
- To use CK flash-attention or PyTorch naive attention, please use this flag ``export VLLM_USE_TRITON_FLASH_ATTN=0`` to turn off triton flash attention.
|
||||
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
||||
|
||||
|
||||
.. tip::
|
||||
- For MI300x (gfx942) users, to achieve optimal performance, please refer to `MI300x tuning guide <https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html>`_ for performance optimization and tuning tips on system and workflow level.
|
||||
For vLLM, please refer to `vLLM performance optimization <https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization>`_.
|
||||
|
||||
|
||||
164
vllm-v0.6.2/docs/source/getting_started/cpu-installation.rst
Normal file
164
vllm-v0.6.2/docs/source/getting_started/cpu-installation.rst
Normal file
@@ -0,0 +1,164 @@
|
||||
.. _installation_cpu:
|
||||
|
||||
Installation with CPU
|
||||
========================
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:
|
||||
|
||||
- Tensor Parallel (``-tp = N``)
|
||||
- Quantization (``INT8 W8A8, AWQ``)
|
||||
|
||||
.. note::
|
||||
More advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
|
||||
|
||||
Table of contents:
|
||||
|
||||
#. :ref:`Requirements <cpu_backend_requirements>`
|
||||
#. :ref:`Quick start using Dockerfile <cpu_backend_quick_start_dockerfile>`
|
||||
#. :ref:`Build from source <build_cpu_backend_from_source>`
|
||||
#. :ref:`Related runtime environment variables <env_intro>`
|
||||
#. :ref:`Intel Extension for PyTorch <ipex_guidance>`
|
||||
#. :ref:`Performance tips <cpu_backend_performance_tips>`
|
||||
|
||||
.. _cpu_backend_requirements:
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* OS: Linux
|
||||
* Compiler: gcc/g++>=12.3.0 (optional, recommended)
|
||||
* Instruction set architecture (ISA) requirement: AVX512 (optional, recommended)
|
||||
|
||||
.. _cpu_backend_quick_start_dockerfile:
|
||||
|
||||
Quick start using Dockerfile
|
||||
----------------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
|
||||
$ docker run -it \
|
||||
--rm \
|
||||
--network=host \
|
||||
--cpuset-cpus=<cpu-id-list, optional> \
|
||||
--cpuset-mems=<memory-node, optional> \
|
||||
vllm-cpu-env
|
||||
|
||||
.. _build_cpu_backend_from_source:
|
||||
|
||||
Build from source
|
||||
-----------------
|
||||
|
||||
- First, install recommended compiler. We recommend to use ``gcc/g++ >= 12.3.0`` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sudo apt-get update -y
|
||||
$ sudo apt-get install -y gcc-12 g++-12 libnuma-dev
|
||||
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
||||
|
||||
- Second, install Python packages for vLLM CPU backend building:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ pip install --upgrade pip
|
||||
$ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
|
||||
$ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
- Finally, build and install vLLM CPU backend:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_TARGET_DEVICE=cpu python setup.py install
|
||||
|
||||
.. note::
|
||||
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
|
||||
|
||||
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.
|
||||
|
||||
.. _env_intro:
|
||||
|
||||
Related runtime environment variables
|
||||
-------------------------------------
|
||||
|
||||
- ``VLLM_CPU_KVCACHE_SPACE``: specify the KV Cache size (e.g, ``VLLM_CPU_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
|
||||
|
||||
- ``VLLM_CPU_OMP_THREADS_BIND``: specify the CPU cores dedicated to the OpenMP threads. For example, ``VLLM_CPU_OMP_THREADS_BIND=0-31`` means there will be 32 OpenMP threads bound on 0-31 CPU cores. ``VLLM_CPU_OMP_THREADS_BIND=0-31|32-63`` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores.
|
||||
|
||||
.. _ipex_guidance:
|
||||
|
||||
Intel Extension for PyTorch
|
||||
---------------------------
|
||||
|
||||
- `Intel Extension for PyTorch (IPEX) <https://github.com/intel/intel-extension-for-pytorch>`_ extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
|
||||
|
||||
.. _cpu_backend_performance_tips:
|
||||
|
||||
Performance tips
|
||||
-----------------
|
||||
|
||||
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
|
||||
$ find / -name *libtcmalloc* # find the dynamic link library path
|
||||
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
|
||||
$ python examples/offline_inference.py # run vLLM
|
||||
|
||||
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_CPU_KVCACHE_SPACE=40
|
||||
$ export VLLM_CPU_OMP_THREADS_BIND=0-29
|
||||
$ vllm serve facebook/opt-125m
|
||||
|
||||
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using ``VLLM_CPU_OMP_THREADS_BIND``. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
|
||||
|
||||
# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
|
||||
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
|
||||
0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
|
||||
1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
|
||||
2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
|
||||
3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
|
||||
4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
|
||||
5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
|
||||
6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
|
||||
7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
|
||||
8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
|
||||
9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
|
||||
10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
|
||||
11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
|
||||
12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
|
||||
13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
|
||||
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
|
||||
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
|
||||
|
||||
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
|
||||
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
|
||||
$ python examples/offline_inference.py
|
||||
|
||||
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using ``VLLM_CPU_OMP_THREADS_BIND`` to avoid cross NUMA node memory access.
|
||||
|
||||
CPU Backend Considerations
|
||||
--------------------------
|
||||
|
||||
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
|
||||
|
||||
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
|
||||
|
||||
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the `topology <https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa>`_. For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.
|
||||
|
||||
* Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With `TP feature on CPU <https://github.com/vllm-project/vllm/pull/6125>`_ merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
||||
|
||||
|
||||
* Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
|
||||
142
vllm-v0.6.2/docs/source/getting_started/debugging.rst
Normal file
142
vllm-v0.6.2/docs/source/getting_started/debugging.rst
Normal file
@@ -0,0 +1,142 @@
|
||||
.. _debugging:
|
||||
|
||||
===============
|
||||
Debugging Tips
|
||||
===============
|
||||
|
||||
This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please `search existing issues <https://github.com/vllm-project/vllm/issues?q=is%3Aissue>`_ first to see if it has already been reported. If not, please `file a new issue <https://github.com/vllm-project/vllm/issues/new/choose>`_, providing as much relevant information as possible.
|
||||
|
||||
.. note::
|
||||
|
||||
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
|
||||
|
||||
Hangs downloading a model
|
||||
----------------------------------------
|
||||
If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection.
|
||||
It's recommended to download the model first using the `huggingface-cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and passing the local path to the model to vLLM. This way, you can isolate the issue.
|
||||
|
||||
Hangs loading a model from disk
|
||||
----------------------------------------
|
||||
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
|
||||
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
|
||||
|
||||
.. note::
|
||||
|
||||
To isolate the model downloading and loading issue, you can use the ``--load-format dummy`` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
|
||||
|
||||
Model is too large
|
||||
----------------------------------------
|
||||
If the model is too large to fit in a single GPU, you might want to `consider tensor parallelism <https://docs.vllm.ai/en/latest/serving/distributed_serving.html#distributed-inference-and-serving>`_ to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `this example <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
|
||||
|
||||
Enable more logging
|
||||
----------------------------------------
|
||||
If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:
|
||||
|
||||
- ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging.
|
||||
- ``export CUDA_LAUNCH_BLOCKING=1`` to identify which CUDA kernel is causing the problem.
|
||||
- ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL.
|
||||
- ``export VLLM_TRACE_FUNCTION=1`` to record all function calls for inspection in the log files to tell which function crashes or hangs.
|
||||
|
||||
Incorrect network setup
|
||||
----------------------------------------
|
||||
The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl`` and the IP address should be the correct one.
|
||||
If it's not, override the IP address using the environment variable ``export VLLM_HOST_IP=<your_ip_address>``.
|
||||
|
||||
You might also need to set ``export NCCL_SOCKET_IFNAME=<your_network_interface>`` and ``export GLOO_SOCKET_IFNAME=<your_network_interface>`` to specify the network interface for the IP address.
|
||||
|
||||
Error near ``self.graph.replay()``
|
||||
----------------------------------------
|
||||
If vLLM crashes and the error trace captures it somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a CUDA error inside CUDAGraph.
|
||||
To identify the particular CUDA operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the :class:`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
|
||||
|
||||
Incorrect hardware/driver
|
||||
----------------------------------------
|
||||
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Test PyTorch NCCL
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
dist.init_process_group(backend="nccl")
|
||||
local_rank = dist.get_rank() % torch.cuda.device_count()
|
||||
torch.cuda.set_device(local_rank)
|
||||
data = torch.FloatTensor([1,] * 128).to("cuda")
|
||||
dist.all_reduce(data, op=dist.ReduceOp.SUM)
|
||||
torch.cuda.synchronize()
|
||||
value = data.mean().item()
|
||||
world_size = dist.get_world_size()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch NCCL is successful!")
|
||||
|
||||
# Test PyTorch GLOO
|
||||
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
|
||||
cpu_data = torch.FloatTensor([1,] * 128)
|
||||
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
|
||||
value = cpu_data.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch GLOO is successful!")
|
||||
|
||||
if world_size <= 1:
|
||||
exit()
|
||||
|
||||
# Test vLLM NCCL, with cuda graph
|
||||
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
|
||||
|
||||
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
|
||||
pynccl.disabled = False
|
||||
|
||||
s = torch.cuda.Stream()
|
||||
with torch.cuda.stream(s):
|
||||
data.fill_(1)
|
||||
pynccl.all_reduce(data, stream=s)
|
||||
value = data.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL is successful!")
|
||||
|
||||
g = torch.cuda.CUDAGraph()
|
||||
with torch.cuda.graph(cuda_graph=g, stream=s):
|
||||
pynccl.all_reduce(data, stream=torch.cuda.current_stream())
|
||||
|
||||
data.fill_(1)
|
||||
g.replay()
|
||||
torch.cuda.current_stream().synchronize()
|
||||
value = data.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL with cuda graph is successful!")
|
||||
|
||||
dist.destroy_process_group(gloo_group)
|
||||
dist.destroy_process_group()
|
||||
|
||||
If you are testing with a single node, adjust ``--nproc-per-node`` to the number of GPUs you want to use:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
||||
|
||||
If you are testing with multi-nodes, adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup and set ``MASTER_ADDR`` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
|
||||
|
||||
If the script runs successfully, you should see the message ``sanity check is successful!``.
|
||||
|
||||
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as ``export NCCL_P2P_DISABLE=1`` to see if it helps. Please check `their documentation <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`__ for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
|
||||
|
||||
.. note::
|
||||
|
||||
A multi-node environment is more complicated than a single-node one. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
|
||||
|
||||
- In the first node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py``.
|
||||
- In the second node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py``.
|
||||
|
||||
Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup, being sure to execute different commands (with different ``--node-rank``) on different nodes.
|
||||
|
||||
Known Issues
|
||||
----------------------------------------
|
||||
- In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq <https://github.com/zeromq/pyzmq/issues/2000>`_ , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of ``vllm`` to include the `fix <https://github.com/vllm-project/vllm/pull/6759>`_.
|
||||
@@ -0,0 +1,8 @@
|
||||
Examples
|
||||
=================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Scripts
|
||||
|
||||
%EXAMPLE_DOCS%
|
||||
402
vllm-v0.6.2/docs/source/getting_started/gaudi-installation.rst
Normal file
402
vllm-v0.6.2/docs/source/getting_started/gaudi-installation.rst
Normal file
@@ -0,0 +1,402 @@
|
||||
Installation with Intel® Gaudi® AI Accelerators
|
||||
===============================================
|
||||
|
||||
This README provides instructions on running vLLM with Intel Gaudi devices.
|
||||
|
||||
Requirements and Installation
|
||||
=============================
|
||||
|
||||
Please follow the instructions provided in the `Gaudi Installation
|
||||
Guide <https://docs.habana.ai/en/latest/Installation_Guide/index.html>`__
|
||||
to set up the execution environment. To achieve the best performance,
|
||||
please follow the methods outlined in the `Optimizing Training Platform
|
||||
Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html>`__.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
- OS: Ubuntu 22.04 LTS
|
||||
- Python: 3.10
|
||||
- Intel Gaudi accelerator
|
||||
- Intel Gaudi software version 1.18.0
|
||||
|
||||
|
||||
Quick start using Dockerfile
|
||||
----------------------------
|
||||
.. code:: console
|
||||
|
||||
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
|
||||
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
|
||||
|
||||
|
||||
.. tip::
|
||||
If you're observing the following error: ``docker: Error response from daemon: Unknown runtime specified habana.``, please refer to "Install Using Containers" section of `Intel Gaudi Software Stack and Driver Installation <https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html>`__. Make sure you have ``habana-container-runtime`` package installed and that ``habana`` container runtime is registered.
|
||||
|
||||
|
||||
Build from source
|
||||
-----------------
|
||||
|
||||
Environment verification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To verify that the Intel Gaudi software was correctly installed, run:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
|
||||
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
|
||||
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
|
||||
$ pip list | grep neural # verify that neural_compressor is installed
|
||||
|
||||
Refer to `Intel Gaudi Software Stack
|
||||
Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade>`__
|
||||
for more details.
|
||||
|
||||
Run Docker Image
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
It is highly recommended to use the latest Docker image from Intel Gaudi
|
||||
vault. Refer to the `Intel Gaudi
|
||||
documentation <https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers>`__
|
||||
for more details.
|
||||
|
||||
Use the following commands to run a Docker image:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||
|
||||
Build and Install vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To build and install vLLM from source, run:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ git clone https://github.com/vllm-project/vllm.git
|
||||
$ cd vllm
|
||||
$ python setup.py develop
|
||||
|
||||
|
||||
Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork <https://github.com/HabanaAI/vllm-fork>`__, run the following:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ git clone https://github.com/HabanaAI/vllm-fork.git
|
||||
$ cd vllm-fork
|
||||
$ git checkout habana_main
|
||||
$ python setup.py develop
|
||||
|
||||
|
||||
Supported Features
|
||||
==================
|
||||
|
||||
- `Offline batched
|
||||
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
|
||||
- Online inference via `OpenAI-Compatible
|
||||
Server <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server>`__
|
||||
- HPU autodetection - no need to manually select device within vLLM
|
||||
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
|
||||
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
|
||||
prefill attention, Root Mean Square Layer Normalization, Rotary
|
||||
Positional Encoding
|
||||
- Tensor parallelism support for multi-card inference
|
||||
- Inference with `HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__
|
||||
for accelerating low-batch latency and throughput
|
||||
- Attention with Linear Biases (ALiBi)
|
||||
|
||||
Unsupported Features
|
||||
====================
|
||||
|
||||
- Beam search
|
||||
- LoRA adapters
|
||||
- Quantization
|
||||
- Prefill chunking (mixed-batch inferencing)
|
||||
|
||||
Supported Configurations
|
||||
========================
|
||||
|
||||
The following configurations have been validated to be function with
|
||||
Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||
|
||||
- `meta-llama/Llama-2-7b <https://huggingface.co/meta-llama/Llama-2-7b>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Llama-2-7b-chat-hf <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3-8B <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3-8B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3.1-8B <https://huggingface.co/meta-llama/Meta-Llama-3.1-8B>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3.1-8B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct>`__
|
||||
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
|
||||
datatype with random or greedy sampling
|
||||
- `meta-llama/Llama-2-70b <https://huggingface.co/meta-llama/Llama-2-70b>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
- `meta-llama/Llama-2-70b-chat-hf <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3-70B <https://huggingface.co/meta-llama/Meta-Llama-3-70B>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3-70B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3.1-70B <https://huggingface.co/meta-llama/Meta-Llama-3.1-70B>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
- `meta-llama/Meta-Llama-3.1-70B-Instruct <https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct>`__
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
|
||||
Performance Tuning
|
||||
==================
|
||||
|
||||
Execution modes
|
||||
---------------
|
||||
|
||||
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.
|
||||
|
||||
.. list-table:: vLLM execution modes
|
||||
:widths: 25 25 50
|
||||
:header-rows: 1
|
||||
|
||||
* - ``PT_HPU_LAZY_MODE``
|
||||
- ``enforce_eager``
|
||||
- execution mode
|
||||
* - 0
|
||||
- 0
|
||||
- torch.compile
|
||||
* - 0
|
||||
- 1
|
||||
- PyTorch eager mode
|
||||
* - 1
|
||||
- 0
|
||||
- HPU Graphs
|
||||
* - 1
|
||||
- 1
|
||||
- PyTorch lazy mode
|
||||
|
||||
.. warning::
|
||||
In 1.18.0, all modes utilizing ``PT_HPU_LAZY_MODE=0`` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
|
||||
|
||||
|
||||
Bucketing mechanism
|
||||
-------------------
|
||||
|
||||
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
|
||||
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
|
||||
|
||||
.. note::
|
||||
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
|
||||
|
||||
Bucketing ranges are determined with 3 parameters - ``min``, ``step`` and ``max``. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
|
||||
|
||||
.. code-block::
|
||||
|
||||
INFO 08-01 21:37:59 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
|
||||
INFO 08-01 21:37:59 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
|
||||
INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
|
||||
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
|
||||
|
||||
``min`` determines the lowest value of the bucket. ``step`` determines the interval between buckets, and ``max`` determines the upper bound of the bucket. Furthermore, interval between ``min`` and ``step`` has special handling - ``min`` gets multiplied by consecutive powers of two, until ``step`` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
|
||||
|
||||
Example (with ramp-up)
|
||||
|
||||
.. code-block::
|
||||
|
||||
min = 2, step = 32, max = 64
|
||||
=> ramp_up = (2, 4, 8, 16)
|
||||
=> stable = (32, 64)
|
||||
=> buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
|
||||
|
||||
Example (without ramp-up)
|
||||
|
||||
.. code-block::
|
||||
|
||||
min = 128, step = 128, max = 512
|
||||
=> ramp_up = ()
|
||||
=> stable = (128, 256, 384, 512)
|
||||
=> buckets = ramp_up + stable => (128, 256, 384, 512)
|
||||
|
||||
|
||||
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
|
||||
|
||||
.. warning::
|
||||
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
|
||||
|
||||
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as ``(4, 512)`` prefill bucket, as ``batch_size`` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as ``(4, 512)`` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a ``(2, 512)`` bucket, or context length increases above 512 tokens, in which case it will become ``(4, 640)`` bucket.
|
||||
|
||||
.. note::
|
||||
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
|
||||
|
||||
Warmup
|
||||
------
|
||||
|
||||
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
|
||||
|
||||
.. code-block::
|
||||
|
||||
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
|
||||
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
|
||||
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
|
||||
...
|
||||
INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
|
||||
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
|
||||
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
|
||||
INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
|
||||
...
|
||||
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
|
||||
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
|
||||
|
||||
This example uses the same buckets as in *Bucketing mechanism* section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
|
||||
|
||||
.. tip::
|
||||
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
|
||||
|
||||
HPU Graph capture
|
||||
-----------------
|
||||
|
||||
`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
|
||||
|
||||
|
||||
When HPU Graphs are being used, they share the common memory pool ("usable memory") as KV cache, determined by ``gpu_memory_utilization`` flag (``0.9`` by default).
|
||||
Before KV cache gets allocated, model weights are loaded onto the device, and a forward pass of the model is executed on dummy data, to estimate memory usage.
|
||||
Only after that, ``gpu_memory_utilization`` flag is utilized - at its default value, will mark 90% of free device memory at that point as usable.
|
||||
Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured.
|
||||
Environment variable ``VLLM_GRAPH_RESERVED_MEM`` defines the ratio of memory reserved for HPU Graphs capture.
|
||||
With its default value (``VLLM_GRAPH_RESERVED_MEM=0.1``), 10% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 90% will be utilized for KV cache.
|
||||
Environment variable ``VLLM_GRAPH_PROMPT_RATIO`` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (``VLLM_GRAPH_PROMPT_RATIO=0.3``), both stages have equal memory constraints.
|
||||
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. ``VLLM_GRAPH_PROMPT_RATIO=0.2`` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
|
||||
|
||||
.. note::
|
||||
``gpu_memory_utilization`` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, ``gpu_memory_utilization`` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
|
||||
|
||||
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
|
||||
- ``max_bs`` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. ``(64, 128)``, ``(64, 256)``, ``(32, 128)``, ``(32, 256)``, ``(1, 128)``, ``(1,256)``), default strategy for decode
|
||||
- ``min_tokens`` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (``batch_size*sequence_length``), default strategy for prompt
|
||||
|
||||
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by ``max_bs`` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in ``min_tokens`` strategy.
|
||||
|
||||
|
||||
.. note::
|
||||
``VLLM_GRAPH_PROMPT_RATIO`` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * ``VLLM_GRAPH_PROMPT_RATIO``) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
|
||||
|
||||
|
||||
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
|
||||
|
||||
.. code-block::
|
||||
|
||||
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
|
||||
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
|
||||
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
|
||||
INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
|
||||
INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
|
||||
INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
|
||||
INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
|
||||
INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
|
||||
INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
|
||||
INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
|
||||
INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
|
||||
INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
|
||||
...
|
||||
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
|
||||
INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
|
||||
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
|
||||
...
|
||||
INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
|
||||
INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
|
||||
...
|
||||
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
|
||||
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
|
||||
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
|
||||
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
|
||||
INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
|
||||
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
|
||||
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
|
||||
INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
|
||||
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
|
||||
|
||||
|
||||
Recommended vLLM Parameters
|
||||
---------------------------
|
||||
|
||||
- We recommend running inference on Gaudi 2 with ``block_size`` of 128
|
||||
for BF16 data type. Using default values (16, 32) might lead to
|
||||
sub-optimal performance due to Matrix Multiplication Engine
|
||||
under-utilization (see `Gaudi
|
||||
Architecture <https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html>`__).
|
||||
- For max throughput on Llama 7B, we recommend running with batch size
|
||||
of 128 or 256 and max context length of 2048 with HPU Graphs enabled.
|
||||
If you encounter out-of-memory issues, see troubleshooting section.
|
||||
|
||||
Environment variables
|
||||
---------------------
|
||||
|
||||
**Diagnostic and profiling knobs:**
|
||||
|
||||
- ``VLLM_PROFILER_ENABLED``: if ``true``, high level profiler will be enabled. Resulting JSON traces can be viewed in `perfetto.habana.ai <https://perfetto.habana.ai/#!/viewer>`__. Disabled by default.
|
||||
- ``VLLM_HPU_LOG_STEP_GRAPH_COMPILATION``: if ``true``, will log graph compilations per each vLLM engine step, only when there was any - highly recommended to use alongside ``PT_HPU_METRICS_GC_DETAILS=1``. Disabled by default.
|
||||
- ``VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL``: if ``true``, will log graph compilations per each vLLM engine step, always, even if there were none. Disabled by default.
|
||||
- ``VLLM_HPU_LOG_STEP_CPU_FALLBACKS``: if ``true``, will log cpu fallbacks per each vLLM engine step, only when there was any. Disabled by default.
|
||||
- ``VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL``: if ``true``, will log cpu fallbacks per each vLLM engine step, always, even if there were none. Disabled by default.
|
||||
|
||||
**Performance tuning knobs:**
|
||||
|
||||
- ``VLLM_SKIP_WARMUP``: if ``true``, warmup will be skipped, ``false`` by default
|
||||
- ``VLLM_GRAPH_RESERVED_MEM``: percentage of memory dedicated for HPUGraph capture, ``0.1`` by default
|
||||
- ``VLLM_GRAPH_PROMPT_RATIO``: percentage of reserved graph memory dedicated for prompt graphs, ``0.3`` by default
|
||||
- ``VLLM_GRAPH_PROMPT_STRATEGY``: strategy determining order of prompt graph capture, ``min_tokens`` or ``max_bs``, ``min_tokens`` by default
|
||||
- ``VLLM_GRAPH_DECODE_STRATEGY``: strategy determining order of decode graph capture, ``min_tokens`` or ``max_bs``, ``max_bs`` by default
|
||||
- ``VLLM_{phase}_{dim}_BUCKET_{param}`` - collection of 12 environment variables configuring ranges of bucketing mechanism
|
||||
|
||||
- ``{phase}`` is either ``PROMPT`` or ``DECODE``
|
||||
- ``{dim}`` is either ``BS``, ``SEQ`` or ``BLOCK``
|
||||
- ``{param}`` is either ``MIN``, ``STEP`` or ``MAX``
|
||||
- Default values:
|
||||
|
||||
- Prompt:
|
||||
- batch size min (``VLLM_PROMPT_BS_BUCKET_MIN``): ``1``
|
||||
- batch size step (``VLLM_PROMPT_BS_BUCKET_STEP``): ``min(max_num_seqs, 32)``
|
||||
- batch size max (``VLLM_PROMPT_BS_BUCKET_MAX``): ``min(max_num_seqs, 64)``
|
||||
- sequence length min (``VLLM_PROMPT_SEQ_BUCKET_MIN``): ``block_size``
|
||||
- sequence length step (``VLLM_PROMPT_SEQ_BUCKET_STEP``): ``block_size``
|
||||
- sequence length max (``VLLM_PROMPT_SEQ_BUCKET_MAX``): ``max_model_len``
|
||||
|
||||
- Decode:
|
||||
- batch size min (``VLLM_DECODE_BS_BUCKET_MIN``): ``1``
|
||||
- batch size step (``VLLM_DECODE_BS_BUCKET_STEP``): ``min(max_num_seqs, 32)``
|
||||
- batch size max (``VLLM_DECODE_BS_BUCKET_MAX``): ``max_num_seqs``
|
||||
- sequence length min (``VLLM_DECODE_BLOCK_BUCKET_MIN``): ``block_size``
|
||||
- sequence length step (``VLLM_DECODE_BLOCK_BUCKET_STEP``): ``block_size``
|
||||
- sequence length max (``VLLM_DECODE_BLOCK_BUCKET_MAX``): ``max(128, (max_num_seqs*max_model_len)/block_size)``
|
||||
|
||||
|
||||
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
|
||||
|
||||
- ``PT_HPU_LAZY_MODE``: if ``0``, PyTorch Eager backend for Gaudi will be used, if ``1`` PyTorch Lazy backend for Gaudi will be used, ``1`` is default
|
||||
- ``PT_HPU_ENABLE_LAZY_COLLECTIVES``: required to be ``true`` for tensor parallel inference with HPU Graphs
|
||||
|
||||
Troubleshooting: Tweaking HPU Graphs
|
||||
====================================
|
||||
|
||||
If you experience device out-of-memory issues or want to attempt
|
||||
inference at higher batch sizes, try tweaking HPU Graphs by following
|
||||
the below:
|
||||
|
||||
- Tweak ``gpu_memory_utilization`` knob. It will decrease the
|
||||
allocation of KV cache, leaving some headroom for capturing graphs
|
||||
with larger batch size. By default ``gpu_memory_utilization`` is set
|
||||
to 0.9. It attempts to allocate ~90% of HBM left for KV cache after
|
||||
short profiling run. Note that decreasing reduces the number of KV
|
||||
cache blocks you have available, and therefore reduces the effective
|
||||
maximum number of tokens you can handle at a given time.
|
||||
|
||||
- If this method is not efficient, you can disable ``HPUGraph``
|
||||
completely. With HPU Graphs disabled, you are trading latency and
|
||||
throughput at lower batches for potentially higher throughput on
|
||||
higher batches. You can do that by adding ``--enforce-eager`` flag to
|
||||
server (for online inference), or by passing ``enforce_eager=True``
|
||||
argument to LLM constructor (for offline inference).
|
||||
219
vllm-v0.6.2/docs/source/getting_started/installation.rst
Normal file
219
vllm-v0.6.2/docs/source/getting_started/installation.rst
Normal file
@@ -0,0 +1,219 @@
|
||||
.. _installation:
|
||||
|
||||
============
|
||||
Installation
|
||||
============
|
||||
|
||||
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
|
||||
|
||||
Requirements
|
||||
============
|
||||
|
||||
* OS: Linux
|
||||
* Python: 3.9 -- 3.12
|
||||
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
|
||||
|
||||
Install released versions
|
||||
=========================
|
||||
|
||||
You can install vLLM using pip:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ # (Recommended) Create a new conda environment.
|
||||
$ conda create -n myenv python=3.10 -y
|
||||
$ conda activate myenv
|
||||
|
||||
$ # Install vLLM with CUDA 12.1.
|
||||
$ pip install vllm
|
||||
|
||||
.. note::
|
||||
|
||||
Although we recommend using ``conda`` to create and manage Python environments, it is highly recommended to use ``pip`` to install vLLM. This is because ``pip`` can install ``torch`` with separate library packages like ``NCCL``, while ``conda`` installs ``torch`` with statically linked ``NCCL``. This can cause issues when vLLM tries to use ``NCCL``. See `this issue <https://github.com/vllm-project/vllm/issues/8420>`_ for more details.
|
||||
|
||||
.. note::
|
||||
|
||||
As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default.
|
||||
We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ # Install vLLM with CUDA 11.8.
|
||||
$ export VLLM_VERSION=0.6.1.post1
|
||||
$ export PYTHON_VERSION=310
|
||||
$ pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
|
||||
|
||||
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
|
||||
|
||||
Therefore, it is recommended to install vLLM with a **fresh new** conda environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See below for instructions.
|
||||
|
||||
|
||||
.. _install-the-latest-code:
|
||||
|
||||
Install the latest code
|
||||
=======================
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since ``v0.5.3``. You can download and install it with the following command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||
|
||||
If you want to access the wheels for previous commits, you can specify the commit hash in the URL:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||
|
||||
Note that the wheels are built with Python 3.8 ABI (see `PEP 425 <https://peps.python.org/pep-0425/>`_ for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (``1.0.0.dev``) is just a placeholder to have a unified URL for the wheels. The actual versions of wheels are contained in the wheel metadata. Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
|
||||
|
||||
Another way to access the latest code is to use the docker images:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||
$ docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}
|
||||
|
||||
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
|
||||
|
||||
The latest code can contain bugs and may not be stable. Please use it with caution.
|
||||
|
||||
.. _build_from_source:
|
||||
|
||||
Build from source
|
||||
=================
|
||||
|
||||
.. _python-only-build:
|
||||
|
||||
Python-only build (without compilation)
|
||||
---------------------------------------
|
||||
|
||||
If you only need to change Python code, you can simply build vLLM without compilation.
|
||||
|
||||
The first step is to install the latest vLLM wheel:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||
|
||||
You can find more information about vLLM's wheels `above <#install-the-latest-code>`_.
|
||||
|
||||
After verifying that the installation is successful, you can use `the following script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ git clone https://github.com/vllm-project/vllm.git
|
||||
$ cd vllm
|
||||
$ python python_only_dev.py
|
||||
|
||||
The script will:
|
||||
|
||||
* Find the installed vLLM package in the current environment.
|
||||
* Copy built files to the current directory.
|
||||
* Rename the installed vLLM package.
|
||||
* Symbolically link the current directory to the installed vLLM package.
|
||||
|
||||
Now, you can edit the Python code in the current directory, and the changes will be reflected when you run vLLM.
|
||||
|
||||
Once you have finished editing or want to install another vLLM wheel, you should exit the development environment using `the same script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_ with the ``--quit-dev`` (or ``-q`` for short) flag:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python python_only_dev.py --quit-dev
|
||||
|
||||
The ``--quit-dev`` flag will:
|
||||
|
||||
* Remove the symbolic link from the current directory to the vLLM package.
|
||||
* Restore the original vLLM package from the backup.
|
||||
|
||||
If you update the vLLM wheel and rebuild from the source to make further edits, you will need to repeat the `Python-only build <#python-only-build>`_ steps again.
|
||||
|
||||
.. note::
|
||||
|
||||
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
|
||||
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to `the section above <#install-the-latest-code>`_ for instructions on how to install a specified wheel.
|
||||
|
||||
Full build (with compilation)
|
||||
-----------------------------
|
||||
|
||||
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ git clone https://github.com/vllm-project/vllm.git
|
||||
$ cd vllm
|
||||
$ pip install -e .
|
||||
|
||||
.. tip::
|
||||
|
||||
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
|
||||
For example, you can install `ccache <https://github.com/ccache/ccache>`_ using ``conda install ccache`` or ``apt install ccache`` .
|
||||
As long as ``which ccache`` command can find the ``ccache`` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
|
||||
|
||||
|
||||
Use an existing PyTorch installation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
|
||||
|
||||
* Building vLLM with PyTorch nightly or a custom PyTorch build.
|
||||
* Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run ``pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124`` to `install PyTorch nightly <https://pytorch.org/get-started/locally/>`_, and then build vLLM on top of it.
|
||||
|
||||
To build vLLM using an existing PyTorch installation:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ git clone https://github.com/vllm-project/vllm.git
|
||||
$ cd vllm
|
||||
$ python use_existing_torch.py
|
||||
$ pip install -r requirements-build.txt
|
||||
$ pip install -e . --no-build-isolation
|
||||
|
||||
|
||||
Troubleshooting
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
To avoid your system being overloaded, you can limit the number of compilation jobs
|
||||
to be run simultaneously, via the environment variable ``MAX_JOBS``. For example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export MAX_JOBS=6
|
||||
$ pip install -e .
|
||||
|
||||
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only `assigns 50% of the total memory by default <https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings>`_, so using ``export MAX_JOBS=1`` can avoid compiling multiple files simultaneously and running out of memory.
|
||||
A side effect is a much slower build process.
|
||||
|
||||
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ # Use `--ipc=host` to make sure the shared memory is large enough.
|
||||
$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
|
||||
|
||||
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from `the official website <https://developer.nvidia.com/cuda-toolkit-archive>`_. After installation, set the environment variable ``CUDA_HOME`` to the installation path of CUDA Toolkit, and make sure that the ``nvcc`` compiler is in your ``PATH``, e.g.:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export CUDA_HOME=/usr/local/cuda
|
||||
$ export PATH="${CUDA_HOME}/bin:$PATH"
|
||||
|
||||
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ nvcc --version # verify that nvcc is in your PATH
|
||||
$ ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
|
||||
|
||||
|
||||
Unsupported OS build
|
||||
--------------------
|
||||
|
||||
vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.
|
||||
|
||||
Simply disable the ``VLLM_TARGET_DEVICE`` environment variable before installing:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ export VLLM_TARGET_DEVICE=empty
|
||||
$ pip install -e .
|
||||
140
vllm-v0.6.2/docs/source/getting_started/neuron-installation.rst
Normal file
140
vllm-v0.6.2/docs/source/getting_started/neuron-installation.rst
Normal file
@@ -0,0 +1,140 @@
|
||||
.. _installation_neuron:
|
||||
|
||||
Installation with Neuron
|
||||
========================
|
||||
|
||||
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
|
||||
Paged Attention and Chunked Prefill are currently in development and will be available soon.
|
||||
Data types currently supported in Neuron SDK are FP16 and BF16.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* OS: Linux
|
||||
* Python: 3.9 -- 3.11
|
||||
* Accelerator: NeuronCore_v2 (in trn1/inf2 instances)
|
||||
* Pytorch 2.0.1/2.1.1
|
||||
* AWS Neuron SDK 2.16/2.17 (Verified on python 3.8)
|
||||
|
||||
Installation steps:
|
||||
|
||||
- :ref:`Build from source <build_from_source_neuron>`
|
||||
|
||||
- :ref:`Step 0. Launch Trn1/Inf2 instances <launch_instances>`
|
||||
- :ref:`Step 1. Install drivers and tools <install_drivers>`
|
||||
- :ref:`Step 2. Install transformers-neuronx and its dependencies <install_tnx>`
|
||||
- :ref:`Step 3. Install vLLM from source <install_vllm>`
|
||||
|
||||
.. _build_from_source_neuron:
|
||||
|
||||
.. note::
|
||||
|
||||
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with vLLM >= 0.5.3. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
|
||||
|
||||
Build from source
|
||||
-----------------
|
||||
|
||||
Following instructions are applicable to Neuron SDK 2.16 and beyond.
|
||||
|
||||
.. _launch_instances:
|
||||
|
||||
Step 0. Launch Trn1/Inf2 instances
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Here are the steps to launch trn1/inf2 instances, in order to install `PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html>`_.
|
||||
|
||||
- Please follow the instructions at `launch an Amazon EC2 Instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance>`_ to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
|
||||
- To get more information about instances sizes and pricing see: `Trn1 web page <https://aws.amazon.com/ec2/instance-types/trn1/>`_, `Inf2 web page <https://aws.amazon.com/ec2/instance-types/inf2/>`_
|
||||
- Select Ubuntu Server 22.04 TLS AMI
|
||||
- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB.
|
||||
- After launching the instance, follow the instructions in `Connect to your instance <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html>`_ to connect to the instance
|
||||
|
||||
.. _install_drivers:
|
||||
|
||||
Step 1. Install drivers and tools
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The installation of drivers and tools wouldn't be necessary, if `Deep Learning AMI Neuron <https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html>`_ is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# Configure Linux for Neuron repository updates
|
||||
. /etc/os-release
|
||||
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
|
||||
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
|
||||
EOF
|
||||
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
|
||||
|
||||
# Update OS packages
|
||||
sudo apt-get update -y
|
||||
|
||||
# Install OS headers
|
||||
sudo apt-get install linux-headers-$(uname -r) -y
|
||||
|
||||
# Install git
|
||||
sudo apt-get install git -y
|
||||
|
||||
# install Neuron Driver
|
||||
sudo apt-get install aws-neuronx-dkms=2.* -y
|
||||
|
||||
# Install Neuron Runtime
|
||||
sudo apt-get install aws-neuronx-collectives=2.* -y
|
||||
sudo apt-get install aws-neuronx-runtime-lib=2.* -y
|
||||
|
||||
# Install Neuron Tools
|
||||
sudo apt-get install aws-neuronx-tools=2.* -y
|
||||
|
||||
# Add PATH
|
||||
export PATH=/opt/aws/neuron/bin:$PATH
|
||||
|
||||
|
||||
.. _install_tnx:
|
||||
|
||||
Step 2. Install transformers-neuronx and its dependencies
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
`transformers-neuronx <https://github.com/aws-neuron/transformers-neuronx>`_ will be the backend to support inference on trn1/inf2 instances.
|
||||
Follow the steps below to install transformer-neuronx package and its dependencies.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# Install Python venv
|
||||
sudo apt-get install -y python3.10-venv g++
|
||||
|
||||
# Create Python venv
|
||||
python3.10 -m venv aws_neuron_venv_pytorch
|
||||
|
||||
# Activate Python venv
|
||||
source aws_neuron_venv_pytorch/bin/activate
|
||||
|
||||
# Install Jupyter notebook kernel
|
||||
pip install ipykernel
|
||||
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
|
||||
pip install jupyter notebook
|
||||
pip install environment_kernels
|
||||
|
||||
# Set pip repository pointing to the Neuron repository
|
||||
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
|
||||
|
||||
# Install wget, awscli
|
||||
python -m pip install wget
|
||||
python -m pip install awscli
|
||||
|
||||
# Update Neuron Compiler and Framework
|
||||
python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx
|
||||
|
||||
.. _install_vllm:
|
||||
|
||||
Step 3. Install vLLM from source
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ git clone https://github.com/vllm-project/vllm.git
|
||||
$ cd vllm
|
||||
$ pip install -U -r requirements-neuron.txt
|
||||
$ VLLM_TARGET_DEVICE="neuron" pip install .
|
||||
|
||||
If neuron packages are detected correctly in the installation process, ``vllm-0.3.0+neuron212`` will be installed.
|
||||
@@ -0,0 +1,116 @@
|
||||
.. _installation_openvino:
|
||||
|
||||
Installation with OpenVINO
|
||||
==========================
|
||||
|
||||
vLLM powered by OpenVINO supports all LLM models from :doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (`the list of supported GPUs <https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu>`_). OpenVINO vLLM backend supports the following advanced vLLM features:
|
||||
|
||||
- Prefix caching (``--enable-prefix-caching``)
|
||||
- Chunked prefill (``--enable-chunked-prefill``)
|
||||
|
||||
**Table of contents**:
|
||||
|
||||
- :ref:`Requirements <openvino_backend_requirements>`
|
||||
- :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>`
|
||||
- :ref:`Build from source <install_openvino_backend_from_source>`
|
||||
- :ref:`Performance tips <openvino_backend_performance_tips>`
|
||||
- :ref:`Limitations <openvino_backend_limitations>`
|
||||
|
||||
.. _openvino_backend_requirements:
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* OS: Linux
|
||||
* Instruction set architecture (ISA) requirement: at least AVX2.
|
||||
|
||||
.. _openvino_backend_quick_start_dockerfile:
|
||||
|
||||
Quick start using Dockerfile
|
||||
----------------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ docker build -f Dockerfile.openvino -t vllm-openvino-env .
|
||||
$ docker run -it --rm vllm-openvino-env
|
||||
|
||||
.. _install_openvino_backend_from_source:
|
||||
|
||||
Install from source
|
||||
-------------------
|
||||
|
||||
- First, install Python. For example, on Ubuntu 22.04, you can run:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sudo apt-get update -y
|
||||
$ sudo apt-get install python3
|
||||
|
||||
- Second, install prerequisites vLLM OpenVINO backend installation:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ pip install --upgrade pip
|
||||
$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
- Finally, install vLLM with OpenVINO backend:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v .
|
||||
|
||||
- [Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: `https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html <https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html>`_.
|
||||
|
||||
.. _openvino_backend_performance_tips:
|
||||
|
||||
Performance tips
|
||||
----------------
|
||||
|
||||
vLLM OpenVINO backend environment variables
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- ``VLLM_OPENVINO_DEVICE`` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, ``VLLM_OPENVINO_DEVICE=GPU.1``). If the value is not specified, CPU device is used by default.
|
||||
|
||||
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
|
||||
|
||||
CPU performance tips
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
CPU uses the following environment variables to control behavior:
|
||||
|
||||
- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
|
||||
|
||||
- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
|
||||
|
||||
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)
|
||||
|
||||
OpenVINO best known configuration for CPU is:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
|
||||
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
|
||||
|
||||
GPU performance tips
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account ``gpu_memory_utilization`` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using ``VLLM_OPENVINO_KVCACHE_SPACE`` environment variable (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=8`` means 8 GB space for KV cache).
|
||||
|
||||
Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`.
|
||||
|
||||
OpenVINO best known configuration for GPU is:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
|
||||
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||
|
||||
.. _openvino_backend_limitations:
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
- LoRA serving is not supported.
|
||||
|
||||
- Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
|
||||
|
||||
- Tensor and pipeline parallelism are not currently enabled in vLLM integration.
|
||||
181
vllm-v0.6.2/docs/source/getting_started/quickstart.rst
Normal file
181
vllm-v0.6.2/docs/source/getting_started/quickstart.rst
Normal file
@@ -0,0 +1,181 @@
|
||||
.. _quickstart:
|
||||
|
||||
==========
|
||||
Quickstart
|
||||
==========
|
||||
|
||||
This guide will help you quickly get started with vLLM to:
|
||||
|
||||
* :ref:`Run offline batched inference <offline_batched_inference>`
|
||||
* :ref:`Run OpenAI-compatible inference <openai_compatible_server>`
|
||||
|
||||
Prerequisites
|
||||
--------------
|
||||
- OS: Linux
|
||||
- Python: 3.9 -- 3.12
|
||||
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
|
||||
|
||||
Installation
|
||||
--------------
|
||||
|
||||
You can install vLLM using pip. It's recommended to use `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html>`_ to create and manage Python environments.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ conda create -n myenv python=3.10 -y
|
||||
$ conda activate myenv
|
||||
$ pip install vllm
|
||||
|
||||
Please refer to the :ref:`installation documentation <installation>` for more details on installing vLLM.
|
||||
|
||||
.. _offline_batched_inference:
|
||||
|
||||
Offline Batched Inference
|
||||
-------------------------
|
||||
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`__.
|
||||
|
||||
The first line of this example imports the classes :class:`~vllm.LLM` and :class:`~vllm.SamplingParams`:
|
||||
|
||||
- :class:`~vllm.LLM` is the main class for running offline inference with vLLM engine.
|
||||
- :class:`~vllm.SamplingParams` specifies the parameters for the sampling process.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
The next section defines a list of input prompts and sampling parameters for text generation. The `sampling temperature <https://arxiv.org/html/2402.05201v1>`_ is set to ``0.8`` and the `nucleus sampling probability <https://en.wikipedia.org/wiki/Top-p_sampling>`_ is set to ``0.95``. You can find more information about the sampling parameters `here <https://docs.vllm.ai/en/stable/dev/sampling_params.html>`__.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
The :class:`~vllm.LLM` class initializes vLLM's engine and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_ for offline inference. The list of supported models can be found :ref:`here <supported_models>`.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
|
||||
.. note::
|
||||
|
||||
By default, vLLM downloads models from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_, set the environment variable ``VLLM_USE_MODELSCOPE`` before initializing the engine.
|
||||
|
||||
Now, the fun part! The outputs are generated using ``llm.generate``. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all of the output tokens.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
.. _openai_compatible_server:
|
||||
|
||||
OpenAI-Compatible Server
|
||||
------------------------
|
||||
|
||||
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
||||
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time and implements endpoints such as `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints.
|
||||
|
||||
Run the following command to start the vLLM server with the `Qwen2.5-1.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>`_ model:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||
|
||||
.. note::
|
||||
|
||||
By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it `here <https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template>`__.
|
||||
|
||||
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
|
||||
|
||||
OpenAI Completions API with vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ curl http://localhost:8000/v1/completions \
|
||||
$ -H "Content-Type: application/json" \
|
||||
$ -d '{
|
||||
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||
$ "prompt": "San Francisco is a",
|
||||
$ "max_tokens": 7,
|
||||
$ "temperature": 0
|
||||
$ }'
|
||||
|
||||
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the ``openai`` python package:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Modify OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||
prompt="San Francisco is a")
|
||||
print("Completion result:", completion)
|
||||
|
||||
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
|
||||
|
||||
OpenAI Chat Completions API with vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
||||
|
||||
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ curl http://localhost:8000/v1/chat/completions \
|
||||
$ -H "Content-Type: application/json" \
|
||||
$ -d '{
|
||||
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||
$ "messages": [
|
||||
$ {"role": "system", "content": "You are a helpful assistant."},
|
||||
$ {"role": "user", "content": "Who won the world series in 2020?"}
|
||||
$ ]
|
||||
$ }'
|
||||
|
||||
Alternatively, you can use the ``openai`` python package:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from openai import OpenAI
|
||||
# Set OpenAI's API key and API base to use vLLM's API server.
|
||||
openai_api_key = "EMPTY"
|
||||
openai_api_base = "http://localhost:8000/v1"
|
||||
|
||||
client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url=openai_api_base,
|
||||
)
|
||||
|
||||
chat_response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Tell me a joke."},
|
||||
]
|
||||
)
|
||||
print("Chat response:", chat_response)
|
||||
184
vllm-v0.6.2/docs/source/getting_started/tpu-installation.rst
Normal file
184
vllm-v0.6.2/docs/source/getting_started/tpu-installation.rst
Normal file
@@ -0,0 +1,184 @@
|
||||
.. _installation_tpu:
|
||||
|
||||
#####################
|
||||
Installation with TPU
|
||||
#####################
|
||||
|
||||
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
||||
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
||||
are available in different versions each with different hardware specifications.
|
||||
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_.
|
||||
For more information on the TPU versions supported with vLLM, see:
|
||||
|
||||
* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
|
||||
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
|
||||
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
|
||||
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_
|
||||
|
||||
These TPU versions allow you to configure the physical arrangements of the TPU
|
||||
chips. This can improve throughput and networking performance. For more
|
||||
information see:
|
||||
|
||||
* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
|
||||
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
|
||||
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
|
||||
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_
|
||||
|
||||
In order for you to use Cloud TPUs you need to have TPU quota granted to your
|
||||
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
|
||||
GPC project and are specified in terms of TPU version, the number of TPU you
|
||||
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_.
|
||||
|
||||
For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.
|
||||
|
||||
You may need additional persistent storage for your TPU VMs. For more
|
||||
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* Google Cloud TPU VM
|
||||
* TPU versions: v6e, v5e, v5p, v4
|
||||
* Python: 3.10 or newer
|
||||
|
||||
Provision Cloud TPUs
|
||||
====================
|
||||
|
||||
You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
|
||||
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
|
||||
API. This section shows how to create TPUs using the queued resource API.
|
||||
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
|
||||
`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
|
||||
enable you to request Cloud TPU resources in a queued manner. When you request
|
||||
queued resources, the request is added to a queue maintained by the Cloud TPU
|
||||
service. When the requested resource becomes available, it's assigned to your
|
||||
Google Cloud project for your immediate exclusive use.
|
||||
|
||||
Provision a Cloud TPU with the queued resource API
|
||||
--------------------------------------------------
|
||||
Create a TPU v5e with 4 TPU chips:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
||||
--node-id TPU_NAME \
|
||||
--project PROJECT_ID \
|
||||
--zone ZONE \
|
||||
--accelerator-type ACCELERATOR_TYPE \
|
||||
--runtime-version RUNTIME_VERSION \
|
||||
--service-account SERVICE_ACCOUNT
|
||||
|
||||
.. list-table:: Parameter descriptions
|
||||
:header-rows: 1
|
||||
|
||||
* - Parameter name
|
||||
- Description
|
||||
* - QUEUED_RESOURCE_ID
|
||||
- The user-assigned ID of the queued resource request.
|
||||
* - TPU_NAME
|
||||
- The user-assigned name of the TPU which is created when the queued
|
||||
resource request is allocated.
|
||||
* - PROJECT_ID
|
||||
- Your Google Cloud project
|
||||
* - ZONE
|
||||
- The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you
|
||||
want to create your Cloud TPU.
|
||||
* - ACCELERATOR_TYPE
|
||||
- The TPU version you want to use. Specify the TPU version, followed by a
|
||||
'-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU
|
||||
with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
|
||||
* - RUNTIME_VERSION
|
||||
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
|
||||
* - SERVICE_ACCOUNT
|
||||
- The email address for your service account. You can find it in the IAM
|
||||
Cloud Console under *Service Accounts*. For example:
|
||||
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
|
||||
|
||||
Connect to your TPU using SSH:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
gcloud compute tpus tpu-vm ssh TPU_NAME
|
||||
|
||||
Create and activate a Conda environment for vLLM:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
conda create -n vllm python=3.10 -y
|
||||
conda activate vllm
|
||||
|
||||
Clone the vLLM repository and go to the vLLM directory:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/vllm-project/vllm.git && cd vllm
|
||||
|
||||
Uninstall the existing `torch` and `torch_xla` packages:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip uninstall torch torch-xla -y
|
||||
|
||||
Install build dependencies:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -r requirements-tpu.txt
|
||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
|
||||
Run the setup script:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
VLLM_TARGET_DEVICE="tpu" python setup.py develop
|
||||
|
||||
|
||||
Provision Cloud TPUs with GKE
|
||||
-----------------------------
|
||||
|
||||
For more information about using TPUs with GKE, see
|
||||
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
|
||||
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
|
||||
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus
|
||||
|
||||
.. _build_docker_tpu:
|
||||
|
||||
Build a docker image with :code:`Dockerfile.tpu`
|
||||
------------------------------------------------
|
||||
|
||||
You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_
|
||||
to build a Docker image with TPU support.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ docker build -f Dockerfile.tpu -t vllm-tpu .
|
||||
|
||||
Run the Docker image with the following command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ # Make sure to add `--privileged --net host --shm-size=16G`.
|
||||
$ docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
||||
|
||||
.. note::
|
||||
|
||||
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
|
||||
The compilation time may take 20~30 minutes in the first run.
|
||||
However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).
|
||||
|
||||
.. tip::
|
||||
|
||||
If you encounter the following error:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
from torch._C import * # noqa: F403
|
||||
ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory
|
||||
|
||||
|
||||
Install OpenBLAS with the following command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
|
||||
80
vllm-v0.6.2/docs/source/getting_started/xpu-installation.rst
Normal file
80
vllm-v0.6.2/docs/source/getting_started/xpu-installation.rst
Normal file
@@ -0,0 +1,80 @@
|
||||
.. _installation_xpu:
|
||||
|
||||
Installation with XPU
|
||||
========================
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
|
||||
|
||||
Table of contents:
|
||||
|
||||
#. :ref:`Requirements <xpu_backend_requirements>`
|
||||
#. :ref:`Quick start using Dockerfile <xpu_backend_quick_start_dockerfile>`
|
||||
#. :ref:`Build from source <build_xpu_backend_from_source>`
|
||||
|
||||
.. _xpu_backend_requirements:
|
||||
|
||||
Requirements
|
||||
------------
|
||||
|
||||
* OS: Linux
|
||||
* Supported Hardware: Intel Data Center GPU, Intel ARC GPU
|
||||
* OneAPI requirements: oneAPI 2024.2
|
||||
|
||||
.. _xpu_backend_quick_start_dockerfile:
|
||||
|
||||
Quick start using Dockerfile
|
||||
----------------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||
$ docker run -it \
|
||||
--rm \
|
||||
--network=host \
|
||||
--device /dev/dri \
|
||||
-v /dev/dri/by-path:/dev/dri/by-path \
|
||||
vllm-xpu-env
|
||||
|
||||
.. _build_xpu_backend_from_source:
|
||||
|
||||
Build from source
|
||||
-----------------
|
||||
|
||||
- First, install required driver and intel OneAPI 2024.2 or later.
|
||||
|
||||
- Second, install Python packages for vLLM XPU backend building:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ source /opt/intel/oneapi/setvars.sh
|
||||
$ pip install --upgrade pip
|
||||
$ pip install -v -r requirements-xpu.txt
|
||||
|
||||
- Finally, build and install vLLM XPU backend:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_TARGET_DEVICE=xpu python setup.py install
|
||||
|
||||
.. note::
|
||||
- FP16 is the default data type in the current XPU backend. The BF16 data
|
||||
type will be supported in the future.
|
||||
|
||||
|
||||
Distributed inference and serving
|
||||
---------------------------------
|
||||
|
||||
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model=facebook/opt-13b \
|
||||
$ --dtype=bfloat16 \
|
||||
$ --device=xpu \
|
||||
$ --max_model_len=1024 \
|
||||
$ --distributed-executor-backend=ray \
|
||||
$ --pipeline-parallel-size=2 \
|
||||
$ -tp=8
|
||||
|
||||
By default, a ray instance will be launched automatically if no existing one is detected in system, with ``num-gpus`` equals to ``parallel_config.world_size``. We recommend properly starting a ray cluster before execution, referring helper `script <https://github.com/vllm-project/vllm/tree/main/examples/run_cluster.sh>`_.
|
||||
Reference in New Issue
Block a user