Sync from v0.13
This commit is contained in:
47
docs/serving/context_parallel_deployment.md
Normal file
47
docs/serving/context_parallel_deployment.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# Context Parallel Deployment
|
||||
|
||||
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
|
||||
|
||||
- For long context prefill, we need to control the TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
|
||||
- For long context decode, we need more space for KV cache to increase the batchsize (and hence the throughput).
|
||||
|
||||
## Prefill Context Parallel
|
||||
|
||||
During prefill, for a long request with `T` new tokens, we need to compute query/key/value tensors for these new tokens. Say we have `N` GPUs, we can split the request into `N` chunks, and each GPU computes one chunk of the query/key/value tensors.
|
||||
|
||||
Depending on the use case, there're two possible strategies:
|
||||
|
||||
1. Partial query, full key/value: If the request token length is moderately long (we can afford holding the full key/value tensors), and the goal is to accelerate the prefill (and amortize the computation time of the prefill across query tokens), then we can gather the key/value tensors from all GPUs and let each GPU compute the attention output corresponding to the query tokens of its chunk.
|
||||
2. Partial query, partial key/value: If the request token length is too long, we cannot afford holding the full key/value tensors anymore, then we can only compute one chunk of query/key/value tensors for each GPU, and use techniques like [ring-attention](http://arxiv.org/abs/2310.01889) to send/recv key/value tensors chunk by chunk.
|
||||
|
||||
Both approaches are under active development.
|
||||
|
||||
## Decode Context Parallel
|
||||
|
||||
Due to the auto-regressive nature of decoding, every decoding step needs to compute a small amount of query tokens w.r.t. a large number of key/value tokens stored in the paged KV cache. The core of decode context parallel is how to shard the KV cache across GPUs.
|
||||
|
||||
For a model with `H` kv-heads, a request with `T` tokens in the context needs to store `H * T` key/value tensors in the KV cache.
|
||||
|
||||
1. If one GPU can hold them all, and the performance is good enough, then no parallelization is needed.
|
||||
2. If one GPU cannot hold them all, or we want to hold more requests in the KV cache, we can first shard the KV cache along the `H` dimension, that's the plain tensor parallel sharding. It's as simple as adding `-tp <num_gpus>` to the command line.
|
||||
3. Since `H` is limited (determined by the model architecture), when we continue to increase the tensor parallel size, the KV cache for each GPU will be duplicated for `tp_size / H` times. Of course, duplication is not good for efficiency. Then we need to add decode context parallel to further shard the KV cache along the `T` dimension. This is as simple as adding `-dcp <size>` to the command line. Note that `size` does not increase the number of GPUs we need to launch, but just reduces the KV cache duplication. The dcp size should lie in the range of `[1, tp_size/H]`. With larger dcp size, the KV cache duplication is reduced, but the communication overhead increases.
|
||||
|
||||
Theoretically, it is possible to extend the dcp size beyond `tp_size / H` to further shard the KV cache and accelerate the decoding phase. However, since the number of query tokens is limited in decoding, it's unclear what should we do for the remaining `dcp_size - tp_size / H` GPUs for non-attention layers. For the sake of simplicity, dcp size is upper bounded by `tp_size / H`. If you want to further accelerate the decoding phase, you can consider increasing the `tp_size` first, and then increasing the dcp size.
|
||||
|
||||
Note that kv cache can grow during decoding, and the sharding strategy needs to be carefully implemented. We use an interleaving strategy to shard the KV cache along the `T` dimension, so that kv cache for future tokens can be naturally sharded along the `T` dimension. This is proposed by [Chao Hong from Moonshot](https://github.com/youzhedian), and also explained in details in [this paper](http://arxiv.org/abs/2507.07120).
|
||||
|
||||
Case study:
|
||||
|
||||
For DeepSeek-R1, we have 1 kv-head when MLA is enabled. The typical single-node deployment with `-tp 8` causes 8x KV cache duplication. We can consider adding `-dcp 8` to reduce the KV cache duplication.
|
||||
|
||||
For Kimi-K2, the architecture is similar to DeepSeek-R1, but with more parameters. When we deploy it with `-tp 16`, the KV cache duplication is 16x. We can add `-dcp 16` to completely remove the KV cache duplication, at the cost of more communication overhead. We can also add `-dcp 8` to reduce the KV cache duplication to 2x. Although it still duplicates the KV cache twice, the communication overhead is smaller since the DCP communication only happens inside one node.
|
||||
|
||||
For Qwen3-235B-A22B, we have 4 kv-heads. When we deploy it with `-tp 8`, the KV cache duplication is 2x. Then we can add `-dcp 2` to remove the KV cache duplication.
|
||||
|
||||
In short, for decode context parallel, try to increase `-tp` size until you get satisfactory performance, and then add `-dcp` to reduce the KV cache duplication.
|
||||
|
||||
Decode context parallel is supported in vLLM, for both MLA and GQA models. Some attention backends also support the combination of decode context parallel and MTP (multi-token prediction) to further accelerate the decoding phase.
|
||||
|
||||
## Technical Discussions
|
||||
|
||||
The main discussions happen in the `#sig-context-parallel` channel of [vLLM Slack](https://slack.vllm.ai/).
|
||||
133
docs/serving/data_parallel_deployment.md
Normal file
133
docs/serving/data_parallel_deployment.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Data Parallel Deployment
|
||||
|
||||
vLLM supports Data Parallel deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests.
|
||||
|
||||
This will work with both dense and MoE models.
|
||||
|
||||
For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Latent Attention), it can be advantageous to use data parallel for the attention layers and expert or tensor parallel (EP or TP) for the expert layers.
|
||||
|
||||
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks.
|
||||
|
||||
By default, expert layers form a tensor parallel group of size `DP × TP`. To use expert parallelism instead, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case). See [Expert Parallel Deployment](expert_parallel_deployment.md) for details on how attention and expert layers behave differently with EP enabled.
|
||||
|
||||
In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size.
|
||||
|
||||
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form a group of size `DP × TP` (using either tensor parallelism by default, or expert parallelism if `--enable-expert-parallel` is set).
|
||||
|
||||
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
|
||||
|
||||
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
|
||||
|
||||
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
|
||||
|
||||
## Internal Load Balancing
|
||||
|
||||
vLLM supports "self-contained" data parallel deployments that expose a single API endpoint.
|
||||
|
||||
It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs. When sizing DP deployments, remember that `--max-num-seqs` applies per DP rank.
|
||||
|
||||
Running a single data parallel deployment across multiple nodes requires a different `vllm serve` to be run on each node, specifying which DP ranks should run on that node. In this case, there will still be a single HTTP entrypoint - the API server(s) will run only on one node, but it doesn't necessarily need to be co-located with the DP ranks.
|
||||
|
||||
This will run DP=4, TP=2 on a single 8-GPU node:
|
||||
|
||||
```bash
|
||||
vllm serve $MODEL --data-parallel-size 4 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node:
|
||||
|
||||
```bash
|
||||
# Node 0 (with ip address 10.99.48.128)
|
||||
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
# Node 1
|
||||
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 \
|
||||
--data-parallel-start-rank 2 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
```
|
||||
|
||||
This will run DP=4 with only the API server on the first node and all engines on the second node:
|
||||
|
||||
```bash
|
||||
# Node 0 (with ip address 10.99.48.128)
|
||||
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
# Node 1
|
||||
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
```
|
||||
|
||||
This DP mode can also be used with Ray by specifying `--data-parallel-backend=ray`:
|
||||
|
||||
```bash
|
||||
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
|
||||
--data-parallel-backend=ray
|
||||
```
|
||||
|
||||
There are several notable differences when using Ray:
|
||||
|
||||
- A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node
|
||||
- There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address`
|
||||
- There is no need to specify `--data-parallel-rpc-port`
|
||||
- When a single DP group requires multiple nodes, *e.g.* in case a single model replica needs to run on at least two nodes, make sure to set `VLLM_RAY_DP_PACK_STRATEGY="span"` in which case `--data-parallel-size-local` is ignored and will be automatically determined
|
||||
- Remote DP ranks will be allocated based on node resources of the Ray cluster
|
||||
|
||||
Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic.
|
||||
|
||||
When deploying large DP sizes using this method, the API server process can become a bottleneck. In this case, the orthogonal `--api-server-count` command line option can be used to scale this out (for example `--api-server-count=4`). This is transparent to users - a single HTTP endpoint / port is still exposed. Note that this API server scale-out is "internal" and still confined to the "head" node.
|
||||
|
||||
<figure markdown="1">
|
||||

|
||||
</figure>
|
||||
|
||||
## Hybrid Load Balancing
|
||||
|
||||
Hybrid load balancing sits between the internal and external approaches. Each node runs its own API server(s) that only queue requests to the data-parallel engines colocated on that node. An upstream load balancer (for example, an ingress controller or traffic router) spreads user requests across those per-node endpoints.
|
||||
|
||||
Enable this mode with `--data-parallel-hybrid-lb` while still launching every node with the global data-parallel size. The key differences from internal load balancing are:
|
||||
|
||||
- You must provide `--data-parallel-size-local` and `--data-parallel-start-rank` so each node knows which ranks it owns.
|
||||
- Not compatible with `--headless` since every node exposes an API endpoint.
|
||||
- Scale `--api-server-count` per node based on the number of local ranks
|
||||
|
||||
In this configuration, each node keeps scheduling decisions local, which reduces cross-node traffic and avoids single node bottlenecks at larger DP sizes.
|
||||
|
||||
## External Load Balancing
|
||||
|
||||
For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
|
||||
|
||||
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
|
||||
|
||||
This can already be done trivially for non-MoE models, since each deployed server is fully independent. No data parallel CLI options need to be used for this.
|
||||
|
||||
We support an equivalent topology for MoE DP+EP which can be configured via the following CLI arguments.
|
||||
|
||||
If DP ranks are co-located (same node / ip address), a default RPC port is used, but a different HTTP server port must be specified for each rank:
|
||||
|
||||
```bash
|
||||
# Rank 0
|
||||
CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \
|
||||
--port 8000
|
||||
# Rank 1
|
||||
CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \
|
||||
--port 8001
|
||||
```
|
||||
|
||||
For multi-node cases, the address/port of rank 0 must also be specified:
|
||||
|
||||
```bash
|
||||
# Rank 0 (with ip address 10.99.48.128)
|
||||
vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 0 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
# Rank 1
|
||||
vllm serve $MODEL --data-parallel-size 2 --data-parallel-rank 1 \
|
||||
--data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
|
||||
```
|
||||
|
||||
The coordinator process also runs in this scenario, co-located with the DP rank 0 engine.
|
||||
|
||||
<figure markdown="1">
|
||||

|
||||
</figure>
|
||||
|
||||
In the above diagram, each of the dotted boxes corresponds to a separate launch of `vllm serve` - these could be separate Kubernetes pods, for example.
|
||||
16
docs/serving/distributed_troubleshooting.md
Normal file
16
docs/serving/distributed_troubleshooting.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Troubleshooting distributed deployments
|
||||
|
||||
For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
|
||||
|
||||
## Verify inter-node GPU communication
|
||||
|
||||
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script](../usage/troubleshooting.md#incorrect-hardwaredriver). If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
|
||||
|
||||
## No available node types can fulfill resource request
|
||||
|
||||
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
|
||||
|
||||
## Ray observability
|
||||
|
||||
Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the
|
||||
[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html).
|
||||
322
docs/serving/expert_parallel_deployment.md
Normal file
322
docs/serving/expert_parallel_deployment.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Expert Parallel Deployment
|
||||
|
||||
vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall.
|
||||
|
||||
EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
|
||||
|
||||
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
|
||||
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
|
||||
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
|
||||
|
||||
### Backend Selection Guide
|
||||
|
||||
vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one:
|
||||
|
||||
| Backend | Use Case | Features | Best For |
|
||||
|---------|----------|----------|----------|
|
||||
| `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration |
|
||||
| `pplx` | Single node | Chunked prefill support, efficient intra-node communication | Single-node deployments, development |
|
||||
| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios |
|
||||
| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios |
|
||||
| `flashinfer_all2allv` | MNNVL systems | FlashInfer alltoallv kernels for multi-node NVLink | Systems with NVLink across nodes |
|
||||
| `naive` | Testing/debugging | Simple broadcast-based implementation | Debugging, not recommended for production |
|
||||
|
||||
## Single Node Deployment
|
||||
|
||||
!!! warning
|
||||
EP is an experimental feature. Argument names and default values may change in the future.
|
||||
|
||||
### Configuration
|
||||
|
||||
Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
|
||||
|
||||
```text
|
||||
EP_SIZE = TP_SIZE × DP_SIZE
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
- `TP_SIZE`: Tensor parallel size
|
||||
- `DP_SIZE`: Data parallel size
|
||||
- `EP_SIZE`: Expert parallel size (computed automatically)
|
||||
|
||||
### Layer Behavior with EP Enabled
|
||||
|
||||
When EP is enabled, different layers in MoE models behave differently:
|
||||
|
||||
| Layer Type | Behavior | Parallelism Used |
|
||||
|------------|----------|------------------|
|
||||
| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
|
||||
| **Attention Layers** | Behavior depends on TP size | See below |
|
||||
|
||||
**Attention layer parallelism:**
|
||||
|
||||
- **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism)
|
||||
- **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group
|
||||
|
||||
For example, with `TP=2, DP=4` (8 GPUs total):
|
||||
|
||||
- Expert layers form an EP group of size 8, with experts distributed across all GPUs
|
||||
- Attention layers use TP=2 within each of the 4 DP groups
|
||||
|
||||
!!! note "Key Difference from Data Parallel Deployment"
|
||||
Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.
|
||||
|
||||
### Example Command
|
||||
|
||||
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
|
||||
|
||||
```bash
|
||||
# Single node EP deployment with pplx backend
|
||||
vllm serve deepseek-ai/DeepSeek-V3-0324 \
|
||||
--tensor-parallel-size 1 \ # Tensor parallelism across 1 GPU
|
||||
--data-parallel-size 8 \ # Data parallelism across 8 processes
|
||||
--enable-expert-parallel \ # Enable expert parallelism
|
||||
--all2all-backend pplx # Use pplx communication backend
|
||||
```
|
||||
|
||||
## Multi-Node Deployment
|
||||
|
||||
For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above).
|
||||
|
||||
### Deployment Steps
|
||||
|
||||
1. **Run one command per node** - Each node requires its own launch command
|
||||
2. **Configure networking** - Ensure proper IP addresses and port configurations
|
||||
3. **Set node roles** - First node handles requests, additional nodes run in headless mode
|
||||
|
||||
### Example: 2-Node Deployment
|
||||
|
||||
The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode:
|
||||
|
||||
```bash
|
||||
# Node 1 (Primary - handles incoming requests)
|
||||
vllm serve deepseek-ai/DeepSeek-V3-0324 \
|
||||
--all2all-backend deepep_low_latency \
|
||||
--tensor-parallel-size 1 \ # TP size per node
|
||||
--enable-expert-parallel \ # Enable EP
|
||||
--data-parallel-size 16 \ # Total DP size across all nodes
|
||||
--data-parallel-size-local 8 \ # Local DP size on this node (8 GPUs per node)
|
||||
--data-parallel-address 192.168.1.100 \ # Replace with actual IP of Node 1
|
||||
--data-parallel-rpc-port 13345 \ # RPC communication port, can be any port as long as reachable by all nodes
|
||||
--api-server-count=8 # Number of API servers for load handling (scaling this out to # local ranks is recommended)
|
||||
|
||||
# Node 2 (Secondary - headless mode, no API server)
|
||||
vllm serve deepseek-ai/DeepSeek-V3-0324 \
|
||||
--all2all-backend deepep_low_latency \
|
||||
--tensor-parallel-size 1 \ # TP size per node
|
||||
--enable-expert-parallel \ # Enable EP
|
||||
--data-parallel-size 16 \ # Total DP size across all nodes
|
||||
--data-parallel-size-local 8 \ # Local DP size on this node
|
||||
--data-parallel-start-rank 8 \ # Starting rank offset for this node
|
||||
--data-parallel-address 192.168.1.100 \ # IP of primary node (Node 1)
|
||||
--data-parallel-rpc-port 13345 \ # Same RPC port as primary
|
||||
--headless # No API server, worker only
|
||||
```
|
||||
|
||||
### Key Configuration Notes
|
||||
|
||||
- **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node
|
||||
- **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes
|
||||
- **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads
|
||||
|
||||
### Network Configuration
|
||||
|
||||
!!! important "InfiniBand Clusters"
|
||||
On InfiniBand networked clusters, set this environment variable to prevent initialization hangs:
|
||||
```bash
|
||||
export GLOO_SOCKET_IFNAME=eth0
|
||||
```
|
||||
This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup.
|
||||
|
||||
## Expert Parallel Load Balancer (EPLB)
|
||||
|
||||
While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts.
|
||||
|
||||
### Configuration
|
||||
|
||||
Enable EPLB with the `--enable-eplb` flag.
|
||||
|
||||
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
|
||||
|
||||
### EPLB Parameters
|
||||
|
||||
Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are:
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| `window_size`| Number of engine steps to track for rebalancing decisions | 1000 |
|
||||
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
|
||||
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
|
||||
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
|
||||
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
|
||||
| `policy` | The policy type for expert parallel load balancing | `"default"` |
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B \
|
||||
--enable-eplb \
|
||||
--eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
|
||||
```
|
||||
|
||||
??? tip "Prefer individual arguments instead of JSON?"
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-30B-A3B \
|
||||
--enable-eplb \
|
||||
--eplb-config.window_size 1000 \
|
||||
--eplb-config.step_interval 3000 \
|
||||
--eplb-config.num_redundant_experts 2 \
|
||||
--eplb-config.log_balancedness true
|
||||
```
|
||||
|
||||
### Expert Distribution Formula
|
||||
|
||||
- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
|
||||
- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
|
||||
|
||||
### Memory Footprint Overhead
|
||||
|
||||
EPLB uses redundant experts that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.
|
||||
|
||||
This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
|
||||
For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per EP rank.
|
||||
|
||||
### Example Command
|
||||
|
||||
Single node deployment with EPLB enabled:
|
||||
|
||||
```bash
|
||||
# Single node with EPLB load balancing
|
||||
vllm serve deepseek-ai/DeepSeek-V3-0324 \
|
||||
--tensor-parallel-size 1 \ # Tensor parallelism
|
||||
--data-parallel-size 8 \ # Data parallelism
|
||||
--enable-expert-parallel \ # Enable EP
|
||||
--all2all-backend pplx \ # Use pplx communication backend
|
||||
--enable-eplb \ # Enable load balancer
|
||||
--eplb-config '{"window_size":1000,"step_interval":3000,"num_redundant_experts":2,"log_balancedness":true}'
|
||||
```
|
||||
|
||||
For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--eplb-config '{"num_redundant_experts":32}'` to 32 in large scale use cases so the most popular experts are always available.
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
- **DeepEP kernels**: The `high_throughput` and `low_latency` kernels are optimized for disaggregated serving and may show poor performance for mixed workloads
|
||||
- **Dual Batch Overlap**: Use `--enable-dbo` to overlap all-to-all communication with compute. See [Dual Batch Overlap](../design/dbo.md) for more details.
|
||||
- **Async scheduling (experimental)**: Try `--async-scheduling` to overlap scheduling with model execution.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- **`non-zero status: 7 cannot register cq buf`**: When using Infiniband/RoCE, make sure host VM and pods show `ulimit -l` "unlimited".
|
||||
- **`init failed for transport: IBGDA`**: The InfiniBand GDA kernel modules are missing. Run `tools/ep_kernels/configure_system_drivers.sh` on each GPU node and reboot. Also fixes error `NVSHMEM API called before NVSHMEM initialization has completed`.
|
||||
- **NVSHMEM peer disconnect**: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with `hostNetwork: true`, `securityContext.privileged: true` to access Infiniband.
|
||||
|
||||
### Benchmarking
|
||||
|
||||
- Use simulator flags `VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random` and `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` so token routing is balanced across EP ranks.
|
||||
|
||||
- Increasing `VLLM_MOE_DP_CHUNK_SIZE` may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP to throw `assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2`, which can be fixed by increasing environment variable `NVSHMEM_QP_DEPTH`.
|
||||
|
||||
## Disaggregated Serving (Prefill/Decode Split)
|
||||
|
||||
For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.
|
||||
|
||||
### Architecture Overview
|
||||
|
||||
- **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance
|
||||
- **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency
|
||||
- **KV Cache Transfer**: Connects instances via NIXL or other KV connectors
|
||||
|
||||
### Setup Steps
|
||||
|
||||
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
|
||||
|
||||
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
|
||||
|
||||
3. **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions.
|
||||
|
||||
### Client Orchestration Example
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
import uuid
|
||||
|
||||
try:
|
||||
# 1: Set up clients for prefill and decode instances
|
||||
openai_api_key = "EMPTY" # vLLM doesn't require a real API key
|
||||
|
||||
# Replace these IP addresses with your actual instance addresses
|
||||
prefill_client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url="http://192.168.1.100:8000/v1", # Prefill instance URL
|
||||
)
|
||||
decode_client = OpenAI(
|
||||
api_key=openai_api_key,
|
||||
base_url="http://192.168.1.101:8001/v1", # Decode instance URL
|
||||
)
|
||||
|
||||
# Get model name from prefill instance
|
||||
models = prefill_client.models.list()
|
||||
model = models.data[0].id
|
||||
print(f"Using model: {model}")
|
||||
|
||||
# 2: Prefill Phase
|
||||
# Generate unique request ID to link prefill and decode operations
|
||||
request_id = str(uuid.uuid4())
|
||||
print(f"Request ID: {request_id}")
|
||||
|
||||
prefill_response = prefill_client.completions.create(
|
||||
model=model,
|
||||
# Prompt must exceed vLLM's block size (16 tokens) for PD to work
|
||||
prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
|
||||
max_tokens=1, # Force prefill-only operation
|
||||
extra_body={
|
||||
"kv_transfer_params": {
|
||||
"do_remote_decode": True, # Enable remote decode
|
||||
"do_remote_prefill": False, # This is the prefill instance
|
||||
"remote_engine_id": None, # Will be populated by vLLM
|
||||
"remote_block_ids": None, # Will be populated by vLLM
|
||||
"remote_host": None, # Will be populated by vLLM
|
||||
"remote_port": None, # Will be populated by vLLM
|
||||
}
|
||||
},
|
||||
extra_headers={"X-Request-Id": request_id},
|
||||
)
|
||||
|
||||
print("-" * 50)
|
||||
print("✓ Prefill completed successfully")
|
||||
print(f"Prefill response: {prefill_response.choices[0].text}")
|
||||
|
||||
# 3: Decode Phase
|
||||
# Transfer KV cache parameters from prefill to decode instance
|
||||
decode_response = decode_client.completions.create(
|
||||
model=model,
|
||||
prompt="This prompt is ignored during decode", # Original prompt not needed
|
||||
max_tokens=150, # Generate up to 150 tokens
|
||||
extra_body={
|
||||
"kv_transfer_params": prefill_response.kv_transfer_params # Pass KV cache info
|
||||
},
|
||||
extra_headers={"X-Request-Id": request_id}, # Same request ID
|
||||
)
|
||||
|
||||
print("-" * 50)
|
||||
print("✓ Decode completed successfully")
|
||||
print(f"Final response: {decode_response.choices[0].text}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error during disaggregated serving: {e}")
|
||||
print("Check that both prefill and decode instances are running and accessible")
|
||||
```
|
||||
|
||||
### Benchmarking
|
||||
|
||||
- To simulate the decode deployment of disaggregated serving, pass `--kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}'` to the `vllm serve` invocation. The connector populates KV cache with random values so decode can be profiled in isolation.
|
||||
|
||||
- **CUDAGraph capture**: Use `--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` to enable CUDA graph capture for decode only and save KV cache.
|
||||
32
docs/serving/integrations/langchain.md
Normal file
32
docs/serving/integrations/langchain.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# LangChain
|
||||
|
||||
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
|
||||
|
||||
To install LangChain, run
|
||||
|
||||
```bash
|
||||
pip install langchain langchain_community -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from langchain_community.llms import VLLM
|
||||
|
||||
llm = VLLM(
|
||||
model="mosaicml/mpt-7b",
|
||||
trust_remote_code=True, # mandatory for hf models
|
||||
max_new_tokens=128,
|
||||
top_k=10,
|
||||
top_p=0.95,
|
||||
temperature=0.8,
|
||||
# for distributed inference
|
||||
# tensor_parallel_size=...,
|
||||
)
|
||||
|
||||
print(llm("What is the capital of France ?"))
|
||||
```
|
||||
|
||||
Please refer to this [Tutorial](https://python.langchain.com/docs/integrations/llms/vllm) for more details.
|
||||
24
docs/serving/integrations/llamaindex.md
Normal file
24
docs/serving/integrations/llamaindex.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# LlamaIndex
|
||||
|
||||
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
|
||||
|
||||
To install LlamaIndex, run
|
||||
|
||||
```bash
|
||||
pip install llama-index-llms-vllm -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
|
||||
|
||||
```python
|
||||
from llama_index.llms.vllm import Vllm
|
||||
|
||||
llm = Vllm(
|
||||
model="microsoft/Orca-2-7b",
|
||||
tensor_parallel_size=4,
|
||||
max_new_tokens=100,
|
||||
vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
|
||||
)
|
||||
```
|
||||
|
||||
Please refer to this [Tutorial](https://docs.llamaindex.ai/en/latest/examples/llm/vllm/) for more details.
|
||||
60
docs/serving/offline_inference.md
Normal file
60
docs/serving/offline_inference.md
Normal file
@@ -0,0 +1,60 @@
|
||||
# Offline Inference
|
||||
|
||||
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
|
||||
|
||||
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
||||
and runs it in vLLM using the default configuration.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Initialize the vLLM engine.
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
```
|
||||
|
||||
After initializing the `LLM` instance, use the available APIs to perform model inference.
|
||||
The available APIs depend on the model type:
|
||||
|
||||
- [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
|
||||
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
|
||||
|
||||
!!! info
|
||||
[API Reference](../api/README.md#offline-inference)
|
||||
|
||||
## Ray Data LLM API
|
||||
|
||||
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
|
||||
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
|
||||
|
||||
- Streaming execution processes datasets that exceed aggregate cluster memory.
|
||||
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
|
||||
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
|
||||
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
|
||||
- Reading and writing to most popular file formats and cloud object storage.
|
||||
- Scaling up the workload without code changes.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
import ray # Requires ray>=2.44.1
|
||||
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
|
||||
|
||||
config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct")
|
||||
processor = build_llm_processor(
|
||||
config,
|
||||
preprocess=lambda row: {
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a bot that completes unfinished haikus."},
|
||||
{"role": "user", "content": row["item"]},
|
||||
],
|
||||
"sampling_params": {"temperature": 0.3, "max_tokens": 250},
|
||||
},
|
||||
postprocess=lambda row: {"answer": row["generated_text"]},
|
||||
)
|
||||
|
||||
ds = ray.data.from_items(["An old silent pond..."])
|
||||
ds = processor(ds)
|
||||
ds.write_parquet("local:///tmp/data/")
|
||||
```
|
||||
|
||||
For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).
|
||||
934
docs/serving/openai_compatible_server.md
Normal file
934
docs/serving/openai_compatible_server.md
Normal file
@@ -0,0 +1,934 @@
|
||||
# OpenAI-Compatible Server
|
||||
|
||||
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
|
||||
|
||||
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`](../configuration/serve_args.md) command. (You can also use our [Docker](../deployment/docker.md) image.)
|
||||
|
||||
```bash
|
||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
|
||||
--dtype auto \
|
||||
--api-key token-abc123
|
||||
```
|
||||
|
||||
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="token-abc123",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Hello!"},
|
||||
],
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
!!! tip
|
||||
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
||||
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
||||
|
||||
!!! important
|
||||
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||
|
||||
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||
|
||||
## Supported APIs
|
||||
|
||||
We currently support the following OpenAI APIs:
|
||||
|
||||
- [Completions API](#completions-api) (`/v1/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md).
|
||||
- *Note: `suffix` parameter is not supported.*
|
||||
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template).
|
||||
- *Note: `user` parameter is ignored.*
|
||||
- *Note:* Setting the `parallel_tool_calls` parameter to `false` ensures vLLM only returns zero or one tool call per request. Setting it to `true` (the default) allows returning more than one tool call per request. There is no guarantee more than one tool call will be returned if this is set to `true`, as that behavior is model dependent and not all models are designed to support parallel tool calls.
|
||||
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
|
||||
- Only applicable to [embedding models](../models/pooling_models.md).
|
||||
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
- [Translation API](#translations-api) (`/v1/audio/translations`)
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
|
||||
In addition, we have the following custom APIs:
|
||||
|
||||
- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
|
||||
- Applicable to any model with a tokenizer.
|
||||
- [Pooling API](#pooling-api) (`/pooling`)
|
||||
- Applicable to all [pooling models](../models/pooling_models.md).
|
||||
- [Classification API](#classification-api) (`/classify`)
|
||||
- Only applicable to [classification models](../models/pooling_models.md).
|
||||
- [Score API](#score-api) (`/score`)
|
||||
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
|
||||
- [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
||||
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md).
|
||||
|
||||
## Chat Template
|
||||
|
||||
In order for the language model to support chat protocol, vLLM requires the model to include
|
||||
a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
|
||||
specifies how roles, messages, and other chat-specific tokens are encoded in the input.
|
||||
|
||||
An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)
|
||||
|
||||
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those models,
|
||||
you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
|
||||
template, or the template in string form. Without a chat template, the server will not be able to process chat
|
||||
and all chat requests will error.
|
||||
|
||||
```bash
|
||||
vllm serve <model> --chat-template ./path-to-chat-template.jinja
|
||||
```
|
||||
|
||||
vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
|
||||
|
||||
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
|
||||
both a `type` and a `text` field. An example is provided below:
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
|
||||
],
|
||||
},
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
|
||||
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
|
||||
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
|
||||
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
|
||||
the detected format, which can be one of:
|
||||
|
||||
- `"string"`: A string.
|
||||
- Example: `"Hello world"`
|
||||
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
|
||||
- Example: `[{"type": "text", "text": "Hello world!"}]`
|
||||
|
||||
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
|
||||
to override which format to use.
|
||||
|
||||
## Extra Parameters
|
||||
|
||||
vLLM supports a set of parameters that are not part of the OpenAI API.
|
||||
In order to use them, you can pass them as extra parameters in the OpenAI client.
|
||||
Or directly merge them into the JSON payload if you are using HTTP call directly.
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
|
||||
],
|
||||
extra_body={
|
||||
"structured_outputs": {"choice": ["positive", "negative"]},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## Extra HTTP Headers
|
||||
|
||||
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
|
||||
with `--enable-request-id-headers`.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
|
||||
],
|
||||
extra_headers={
|
||||
"x-request-id": "sentiment-classification-00001",
|
||||
},
|
||||
)
|
||||
print(completion._request_id)
|
||||
|
||||
completion = client.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
prompt="A robot may not injure a human being",
|
||||
extra_headers={
|
||||
"x-request-id": "completion-test",
|
||||
},
|
||||
)
|
||||
print(completion._request_id)
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### Completions API
|
||||
|
||||
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py)
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
|
||||
```
|
||||
|
||||
### Chat API
|
||||
|
||||
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
|
||||
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
|
||||
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
|
||||
|
||||
- *Note: `image_url.detail` parameter is not supported.*
|
||||
|
||||
Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py)
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
|
||||
```
|
||||
|
||||
### Embeddings API
|
||||
|
||||
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
Code example: [examples/pooling/embed/openai_embedding_client.py](../../examples/pooling/embed/openai_embedding_client.py)
|
||||
|
||||
If the model has a [chat template](../serving/openai_compatible_server.md#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
|
||||
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
from openai._types import NOT_GIVEN, NotGiven
|
||||
from openai.types.chat import ChatCompletionMessageParam
|
||||
from openai.types.create_embedding_response import CreateEmbeddingResponse
|
||||
|
||||
def create_chat_embeddings(
|
||||
client: OpenAI,
|
||||
*,
|
||||
messages: list[ChatCompletionMessageParam],
|
||||
model: str,
|
||||
encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
|
||||
) -> CreateEmbeddingResponse:
|
||||
return client.post(
|
||||
"/embeddings",
|
||||
cast_to=CreateEmbeddingResponse,
|
||||
body={"messages": messages, "model": model, "encoding_format": encoding_format},
|
||||
)
|
||||
```
|
||||
|
||||
#### Multi-modal inputs
|
||||
|
||||
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
|
||||
and passing a list of `messages` in the request. Refer to the examples below for illustration.
|
||||
|
||||
=== "VLM2Vec"
|
||||
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 4096 \
|
||||
--chat-template examples/template_vlm2vec_phi3v.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja)
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="EMPTY",
|
||||
)
|
||||
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
|
||||
|
||||
response = create_chat_embeddings(
|
||||
client,
|
||||
model="TIGER-Lab/VLM2Vec-Full",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image_url", "image_url": {"url": image_url}},
|
||||
{"type": "text", "text": "Represent the given image."},
|
||||
],
|
||||
}
|
||||
],
|
||||
encoding_format="float",
|
||||
)
|
||||
|
||||
print("Image embedding output:", response.data[0].embedding)
|
||||
```
|
||||
|
||||
=== "DSE-Qwen2-MRL"
|
||||
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 8192 \
|
||||
--chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja)
|
||||
|
||||
!!! important
|
||||
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
|
||||
example below for details.
|
||||
|
||||
Full example: [examples/pooling/embed/openai_chat_embedding_client_for_multimodal.py](../../examples/pooling/embed/openai_chat_embedding_client_for_multimodal.py)
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][vllm.PoolingParams] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/pooling_params.py:common-pooling-params"
|
||||
--8<-- "vllm/pooling_params.py:embedding-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported by default:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/pooling/embed/protocol.py:embedding-extra-params"
|
||||
```
|
||||
|
||||
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/pooling/embed/protocol.py:chat-embedding-extra-params"
|
||||
```
|
||||
|
||||
### Transcriptions API
|
||||
|
||||
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
|
||||
!!! note
|
||||
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
|
||||
|
||||
Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
|
||||
|
||||
#### API Enforced Limits
|
||||
|
||||
Set the maximum audio file size (in MB) that VLLM will accept, via the
|
||||
`VLLM_MAX_AUDIO_CLIP_FILESIZE_MB` environment variable. Default is 25 MB.
|
||||
|
||||
#### Uploading Audio Files
|
||||
|
||||
The Transcriptions API supports uploading audio files in various formats including FLAC, MP3, MP4, MPEG, MPGA, M4A, OGG, WAV, and WEBM.
|
||||
|
||||
**Using OpenAI Python Client:**
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="token-abc123",
|
||||
)
|
||||
|
||||
# Upload audio file from disk
|
||||
with open("audio.mp3", "rb") as audio_file:
|
||||
transcription = client.audio.transcriptions.create(
|
||||
model="openai/whisper-large-v3-turbo",
|
||||
file=audio_file,
|
||||
language="en",
|
||||
response_format="verbose_json",
|
||||
)
|
||||
|
||||
print(transcription.text)
|
||||
```
|
||||
|
||||
**Using curl with multipart/form-data:**
|
||||
|
||||
??? code
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
|
||||
-H "Authorization: Bearer token-abc123" \
|
||||
-F "file=@audio.mp3" \
|
||||
-F "model=openai/whisper-large-v3-turbo" \
|
||||
-F "language=en" \
|
||||
-F "response_format=verbose_json"
|
||||
```
|
||||
|
||||
**Supported Parameters:**
|
||||
|
||||
- `file`: The audio file to transcribe (required)
|
||||
- `model`: The model to use for transcription (required)
|
||||
- `language`: The language code (e.g., "en", "zh") (optional)
|
||||
- `prompt`: Optional text to guide the transcription style (optional)
|
||||
- `response_format`: Format of the response ("json", "text") (optional)
|
||||
- `temperature`: Sampling temperature between 0 and 1 (optional)
|
||||
|
||||
For the complete list of supported parameters including sampling parameters and vLLM extensions, see the [protocol definitions](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/protocol.py#L2182).
|
||||
|
||||
**Response Format:**
|
||||
|
||||
For `verbose_json` response format:
|
||||
|
||||
??? code
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "Hello, this is a transcription of the audio file.",
|
||||
"language": "en",
|
||||
"duration": 5.42,
|
||||
"segments": [
|
||||
{
|
||||
"id": 0,
|
||||
"seek": 0,
|
||||
"start": 0.0,
|
||||
"end": 2.5,
|
||||
"text": "Hello, this is a transcription",
|
||||
"tokens": [50364, 938, 428, 307, 275, 28347],
|
||||
"temperature": 0.0,
|
||||
"avg_logprob": -0.245,
|
||||
"compression_ratio": 1.235,
|
||||
"no_speech_prob": 0.012
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
Currently “verbose_json” response format doesn’t support avg_logprob, compression_ratio, no_speech_prob.
|
||||
|
||||
#### Extra Parameters
|
||||
|
||||
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
|
||||
```
|
||||
|
||||
### Translations API
|
||||
|
||||
Our Translation API is compatible with [OpenAI's Translations API](https://platform.openai.com/docs/api-reference/audio/createTranslation);
|
||||
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
|
||||
Whisper models can translate audio from one of the 55 non-English supported languages into English.
|
||||
Please mind that the popular `openai/whisper-large-v3-turbo` model does not support translating.
|
||||
|
||||
!!! note
|
||||
To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.
|
||||
|
||||
Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
|
||||
|
||||
#### Extra Parameters
|
||||
|
||||
The following [sampling parameters](../api/README.md#inference-parameters) are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:translation-sampling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:translation-extra-params"
|
||||
```
|
||||
|
||||
### Tokenizer API
|
||||
|
||||
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
|
||||
It consists of two endpoints:
|
||||
|
||||
- `/tokenize` corresponds to calling `tokenizer.encode()`.
|
||||
- `/detokenize` corresponds to calling `tokenizer.decode()`.
|
||||
|
||||
### Pooling API
|
||||
|
||||
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
|
||||
|
||||
The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
|
||||
|
||||
Code example: [examples/pooling/pooling/openai_pooling_client.py](../../examples/pooling/pooling/openai_pooling_client.py)
|
||||
|
||||
### Classification API
|
||||
|
||||
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
|
||||
|
||||
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
|
||||
|
||||
Code example: [examples/pooling/classify/openai_classification_client.py](../../examples/pooling/classify/openai_classification_client.py)
|
||||
|
||||
#### Example Requests
|
||||
|
||||
You can classify multiple texts by passing an array of strings:
|
||||
|
||||
```bash
|
||||
curl -v "http://127.0.0.1:8000/classify" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"input": [
|
||||
"Loved the new café—coffee was great.",
|
||||
"This update broke everything. Frustrating."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
|
||||
"object": "list",
|
||||
"created": 1745383065,
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"label": "Default",
|
||||
"probs": [
|
||||
0.565970778465271,
|
||||
0.4340292513370514
|
||||
],
|
||||
"num_classes": 2
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"label": "Spoiled",
|
||||
"probs": [
|
||||
0.26448777318000793,
|
||||
0.7355121970176697
|
||||
],
|
||||
"num_classes": 2
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 20,
|
||||
"total_tokens": 20,
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens_details": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
You can also pass a string directly to the `input` field:
|
||||
|
||||
```bash
|
||||
curl -v "http://127.0.0.1:8000/classify" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"input": "Loved the new café—coffee was great."
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
|
||||
"object": "list",
|
||||
"created": 1745383213,
|
||||
"model": "jason9693/Qwen2.5-1.5B-apeach",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"label": "Default",
|
||||
"probs": [
|
||||
0.565970778465271,
|
||||
0.4340292513370514
|
||||
],
|
||||
"num_classes": 2
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 10,
|
||||
"total_tokens": 10,
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens_details": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][vllm.PoolingParams] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/pooling_params.py:common-pooling-params"
|
||||
--8<-- "vllm/pooling_params.py:classification-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/pooling/classify/protocol.py:classification-extra-params"
|
||||
```
|
||||
|
||||
### Score API
|
||||
|
||||
Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
|
||||
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
|
||||
|
||||
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
|
||||
|
||||
Code example: [examples/pooling/score/openai_cross_encoder_score.py](../../examples/pooling/score/openai_cross_encoder_score.py)
|
||||
|
||||
#### Single inference
|
||||
|
||||
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"encoding_format": "float",
|
||||
"text_1": "What is the capital of France?",
|
||||
"text_2": "The capital of France is Paris."
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693447,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
#### Batch inference
|
||||
|
||||
You can pass a string to `text_1` and a list to `text_2`, forming multiple sentence pairs
|
||||
where each pair is built from `text_1` and a string in `text_2`.
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"text_1": "What is the capital of France?",
|
||||
"text_2": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693570,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 0.001094818115234375
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
You can pass a list to both `text_1` and `text_2`, forming multiple sentence pairs
|
||||
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/score' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"encoding_format": "float",
|
||||
"text_1": [
|
||||
"What is the capital of Brazil?",
|
||||
"What is the capital of France?"
|
||||
],
|
||||
"text_2": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris."
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "score-request-id",
|
||||
"object": "list",
|
||||
"created": 693447,
|
||||
"model": "BAAI/bge-reranker-v2-m3",
|
||||
"data": [
|
||||
{
|
||||
"index": 0,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"object": "score",
|
||||
"score": 1
|
||||
}
|
||||
],
|
||||
"usage": {}
|
||||
}
|
||||
```
|
||||
|
||||
#### Multi-modal inputs
|
||||
|
||||
You can pass multi-modal inputs to scoring models by passing `content` including a list of multi-modal input (image, etc.) in the request. Refer to the examples below for illustration.
|
||||
|
||||
=== "JinaVL-Reranker"
|
||||
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve jinaai/jina-reranker-m0
|
||||
```
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
??? Code
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/v1/score",
|
||||
json={
|
||||
"model": "jinaai/jina-reranker-m0",
|
||||
"text_1": "slm markdown",
|
||||
"text_2": {
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
|
||||
},
|
||||
},
|
||||
],
|
||||
},
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
response_json = response.json()
|
||||
print("Scoring output:", response_json["data"][0]["score"])
|
||||
print("Scoring output:", response_json["data"][1]["score"])
|
||||
```
|
||||
Full example: [examples/pooling/score/openai_cross_encoder_score_for_multimodal.py](../../examples/pooling/score/openai_cross_encoder_score_for_multimodal.py)
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][vllm.PoolingParams] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/pooling_params.py:common-pooling-params"
|
||||
--8<-- "vllm/pooling_params.py:classification-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/pooling/score/protocol.py:score-extra-params"
|
||||
```
|
||||
|
||||
### Re-rank API
|
||||
|
||||
Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
|
||||
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences or multi-modal inputs (image, etc.), on a scale of 0 to 1.
|
||||
|
||||
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
|
||||
|
||||
The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
|
||||
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
|
||||
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
|
||||
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
|
||||
popular open-source tools.
|
||||
|
||||
Code example: [examples/pooling/score/openai_reranker.py](../../examples/pooling/score/openai_reranker.py)
|
||||
|
||||
#### Example Request
|
||||
|
||||
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
|
||||
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
|
||||
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
'http://127.0.0.1:8000/v1/rerank' \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "BAAI/bge-reranker-base",
|
||||
"query": "What is the capital of France?",
|
||||
"documents": [
|
||||
"The capital of Brazil is Brasilia.",
|
||||
"The capital of France is Paris.",
|
||||
"Horses and cows are both animals"
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
|
||||
"model": "BAAI/bge-reranker-base",
|
||||
"usage": {
|
||||
"total_tokens": 56
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"index": 1,
|
||||
"document": {
|
||||
"text": "The capital of France is Paris."
|
||||
},
|
||||
"relevance_score": 0.99853515625
|
||||
},
|
||||
{
|
||||
"index": 0,
|
||||
"document": {
|
||||
"text": "The capital of Brazil is Brasilia."
|
||||
},
|
||||
"relevance_score": 0.0005860328674316406
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Extra parameters
|
||||
|
||||
The following [pooling parameters][vllm.PoolingParams] are supported.
|
||||
|
||||
```python
|
||||
--8<-- "vllm/pooling_params.py:common-pooling-params"
|
||||
--8<-- "vllm/pooling_params.py:classification-pooling-params"
|
||||
```
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/pooling/score/protocol.py:rerank-extra-params"
|
||||
```
|
||||
|
||||
## Ray Serve LLM
|
||||
|
||||
Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.
|
||||
|
||||
Key capabilities:
|
||||
|
||||
- Exposes an OpenAI-compatible HTTP API as well as a Pythonic API.
|
||||
- Scales from a single GPU to a multi-node cluster without code changes.
|
||||
- Provides observability and autoscaling policies through Ray dashboards and metrics.
|
||||
|
||||
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
|
||||
|
||||
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html).
|
||||
220
docs/serving/parallelism_scaling.md
Normal file
220
docs/serving/parallelism_scaling.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Parallelism and Scaling
|
||||
|
||||
## Distributed inference strategies for a single-model replica
|
||||
|
||||
To choose a distributed inference strategy for a single-model replica, use the following guidelines:
|
||||
|
||||
- **Single GPU (no distributed inference):** if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU.
|
||||
- **Single-node multi-GPU using tensor parallel inference:** if the model is too large for a single GPU but fits on a single node with multiple GPUs, use *tensor parallelism*. For example, set `tensor_parallel_size=4` when using a node with 4 GPUs.
|
||||
- **Multi-node multi-GPU using tensor parallel and pipeline parallel inference:** if the model is too large for a single node, combine *tensor parallelism* with *pipeline parallelism*. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. For example, set `tensor_parallel_size=8` and `pipeline_parallel_size=2` when using 2 nodes with 8 GPUs per node.
|
||||
|
||||
Increase the number of GPUs and nodes until there is enough GPU memory for the model. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes.
|
||||
|
||||
After you provision sufficient resources to fit the model, run `vllm`. Look for log messages like:
|
||||
|
||||
```text
|
||||
INFO 07-23 13:56:04 [kv_cache_utils.py:775] GPU KV cache size: 643,232 tokens
|
||||
INFO 07-23 13:56:04 [kv_cache_utils.py:779] Maximum concurrency for 40,960 tokens per request: 15.70x
|
||||
```
|
||||
|
||||
The `GPU KV cache size` line reports the total number of tokens that can be stored in the GPU KV cache at once. The `Maximum concurrency` line provides an estimate of how many requests can be served concurrently if each request requires the specified number of tokens (40,960 in the example above). The tokens-per-request number is taken from the model configuration's maximum sequence length, `ModelConfig.max_model_len`. If these numbers are lower than your throughput requirements, add more GPUs or nodes to your cluster.
|
||||
|
||||
!!! note "Edge case: uneven GPU splits"
|
||||
If the model fits within a single node but the GPU count doesn't evenly divide the model size, enable pipeline parallelism, which splits the model along layers and supports uneven splits. In this scenario, set `tensor_parallel_size=1` and `pipeline_parallel_size` to the number of GPUs. Furthermore, if the GPUs on the node do not have NVLINK interconnect (e.g. L40S), leverage pipeline parallelism instead of tensor parallelism for higher throughput and lower communication overhead.
|
||||
|
||||
### Distributed serving of *Mixture of Experts* (*MoE*) models
|
||||
|
||||
It's often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. For more information, see [Data Parallel Deployment](data_parallel_deployment.md).
|
||||
|
||||
## Single-node deployment
|
||||
|
||||
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. The implementation includes [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf).
|
||||
|
||||
The default distributed runtimes are [Ray](https://github.com/ray-project/ray) for multi-node inference and native Python `multiprocessing` for single-node inference. You can override the defaults by setting `distributed_executor_backend` in the `LLM` class or `--distributed-executor-backend` in the API server. Use `mp` for `multiprocessing` or `ray` for Ray.
|
||||
|
||||
For multi-GPU inference, set `tensor_parallel_size` in the `LLM` class to the desired GPU count. For example, to run inference on 4 GPUs:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
|
||||
output = llm.generate("San Francisco is a")
|
||||
```
|
||||
|
||||
For multi-GPU serving, include `--tensor-parallel-size` when starting the server. For example, to run the API server on 4 GPUs:
|
||||
|
||||
```bash
|
||||
vllm serve facebook/opt-13b \
|
||||
--tensor-parallel-size 4
|
||||
```
|
||||
|
||||
To enable pipeline parallelism, add `--pipeline-parallel-size`. For example, to run the API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||
|
||||
```bash
|
||||
# Eight GPUs total
|
||||
vllm serve gpt2 \
|
||||
--tensor-parallel-size 4 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Multi-node deployment
|
||||
|
||||
If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity.
|
||||
|
||||
### What is Ray?
|
||||
|
||||
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.
|
||||
|
||||
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
|
||||
|
||||
Ray also offers high-level APIs for large-scale [offline batch inference](https://docs.ray.io/en/latest/data/working-with-llms.html) and [online serving](https://docs.ray.io/en/latest/serve/llm) that can leverage vLLM as the engine. These APIs add production-grade fault tolerance, scaling, and distributed observability to vLLM workloads.
|
||||
|
||||
For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.html).
|
||||
|
||||
### Ray cluster setup with containers
|
||||
|
||||
The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
|
||||
|
||||
Choose one node as the head node and run:
|
||||
|
||||
```bash
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
<HEAD_NODE_IP> \
|
||||
--head \
|
||||
/path/to/the/huggingface/home/in/this/node \
|
||||
-e VLLM_HOST_IP=<HEAD_NODE_IP>
|
||||
```
|
||||
|
||||
On each worker node, run:
|
||||
|
||||
```bash
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
<HEAD_NODE_IP> \
|
||||
--worker \
|
||||
/path/to/the/huggingface/home/in/this/node \
|
||||
-e VLLM_HOST_IP=<WORKER_NODE_IP>
|
||||
```
|
||||
|
||||
Note that `VLLM_HOST_IP` is unique for each worker. Keep the shells running these commands open; closing any shell terminates the cluster. Ensure that all nodes can communicate with each other through their IP addresses.
|
||||
|
||||
!!! warning "Network security"
|
||||
For security, set `VLLM_HOST_IP` to an address on a private network segment. Traffic sent over this network is unencrypted, and the endpoints exchange data in a format that can be exploited to execute arbitrary code if an adversary gains network access. Ensure that untrusted parties cannot reach the network.
|
||||
|
||||
From any node, enter a container and run `ray status` and `ray list nodes` to verify that Ray finds the expected number of nodes and GPUs.
|
||||
|
||||
!!! tip
|
||||
Alternatively, set up the Ray cluster using KubeRay. For more information, see [KubeRay vLLM documentation](https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayserve-llm-example.html).
|
||||
|
||||
### Running vLLM on a Ray cluster
|
||||
|
||||
!!! tip
|
||||
If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
|
||||
|
||||
Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient.
|
||||
|
||||
The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs across 2 nodes (8 GPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2 \
|
||||
--distributed-executor-backend ray
|
||||
```
|
||||
|
||||
Alternatively, you can set `tensor_parallel_size` to the total number of GPUs in the cluster:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 16 \
|
||||
--distributed-executor-backend ray
|
||||
```
|
||||
|
||||
### Running vLLM with MultiProcessing
|
||||
|
||||
Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.
|
||||
|
||||
Choose one node as the head node and run:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
|
||||
--nnodes 2 --node-rank 0 \
|
||||
--master-addr <HEAD_NODE_IP>
|
||||
```
|
||||
|
||||
On the other worker node, run:
|
||||
|
||||
```bash
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
|
||||
--nnodes 2 --node-rank 1 \
|
||||
--master-addr <HEAD_NODE_IP> --headless
|
||||
```
|
||||
|
||||
## Optimizing network communication for tensor parallelism
|
||||
|
||||
Efficient tensor parallelism requires fast internode communication, preferably through high-speed network adapters such as InfiniBand.
|
||||
To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
|
||||
[examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
|
||||
Contact your system administrator for more information about the required flags.
|
||||
|
||||
## Enabling GPUDirect RDMA
|
||||
|
||||
GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that allows network adapters to directly access GPU memory, bypassing the CPU and system memory. This direct access reduces latency and CPU overhead, which is beneficial for large data transfers between GPUs across nodes.
|
||||
|
||||
To enable GPUDirect RDMA with vLLM, configure the following settings:
|
||||
|
||||
- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk.
|
||||
- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC).
|
||||
|
||||
If you use Docker, set up the container as follows:
|
||||
|
||||
```bash
|
||||
docker run --gpus all \
|
||||
--ipc=host \
|
||||
--shm-size=16G \
|
||||
-v /dev/shm:/dev/shm \
|
||||
vllm/vllm-openai
|
||||
```
|
||||
|
||||
If you use Kubernetes, set up the pod spec as follows:
|
||||
|
||||
```yaml
|
||||
...
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai
|
||||
securityContext:
|
||||
capabilities:
|
||||
add: ["IPC_LOCK"]
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 8
|
||||
requests:
|
||||
nvidia.com/gpu: 8
|
||||
volumes:
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
...
|
||||
```
|
||||
|
||||
!!! tip "Confirm GPUDirect RDMA operation"
|
||||
To confirm your InfiniBand card is using GPUDirect RDMA, run vLLM with detailed NCCL logs: `NCCL_DEBUG=TRACE vllm serve ...`.
|
||||
|
||||
Then look for the NCCL version and the network used.
|
||||
|
||||
- If you find `[send] via NET/IB/GDRDMA` in the logs, then NCCL is using InfiniBand with GPUDirect RDMA, which *is* efficient.
|
||||
- If you find `[send] via NET/Socket` in the logs, NCCL used a raw TCP socket, which *is not* efficient for cross-node tensor parallelism.
|
||||
|
||||
!!! tip "Pre-download Hugging Face models"
|
||||
If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=<TOKEN>` to `run_cluster.sh`.
|
||||
|
||||
## Troubleshooting distributed deployments
|
||||
|
||||
For information about distributed debugging, see [Troubleshooting distributed deployments](distributed_troubleshooting.md).
|
||||
Reference in New Issue
Block a user