Sync from v0.13
This commit is contained in:
9
docs/usage/README.md
Normal file
9
docs/usage/README.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Using vLLM
|
||||
|
||||
First, vLLM must be [installed](../getting_started/installation/README.md) for your chosen device in either a Python or Docker environment.
|
||||
|
||||
Then, vLLM supports the following usage patterns:
|
||||
|
||||
- [Inference and Serving](../serving/offline_inference.md): Run a single instance of a model.
|
||||
- [Deployment](../deployment/docker.md): Scale up model instances for production.
|
||||
- [Training](../training/rlhf.md): Train or fine-tune a model.
|
||||
35
docs/usage/faq.md
Normal file
35
docs/usage/faq.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Frequently Asked Questions
|
||||
|
||||
> Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
|
||||
|
||||
---
|
||||
|
||||
> Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
|
||||
more are listed [here](../models/supported_models.md).
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
but they are expected to be inferior to models that are specifically trained on embedding tasks.
|
||||
|
||||
---
|
||||
|
||||
> Q: Can the output of a prompt vary across runs in vLLM?
|
||||
|
||||
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
|
||||
|
||||
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||
|
||||
## Mitigation Strategies
|
||||
|
||||
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
|
||||
- If using `bfloat16`, switching to `float16` can also help.
|
||||
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
|
||||
52
docs/usage/metrics.md
Normal file
52
docs/usage/metrics.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Production Metrics
|
||||
|
||||
vLLM exposes a number of metrics that can be used to monitor the health of the
|
||||
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
|
||||
OpenAI compatible API server.
|
||||
|
||||
You can start the server using Python, or using [Docker](../deployment/docker.md):
|
||||
|
||||
```bash
|
||||
vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
```
|
||||
|
||||
Then query the endpoint to get the latest metrics from the server:
|
||||
|
||||
??? console "Output"
|
||||
|
||||
```console
|
||||
$ curl http://0.0.0.0:8000/metrics
|
||||
|
||||
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
|
||||
# TYPE vllm:iteration_tokens_total histogram
|
||||
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
|
||||
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
|
||||
...
|
||||
```
|
||||
|
||||
The following metrics are exposed:
|
||||
|
||||
## General Metrics
|
||||
|
||||
--8<-- "docs/generated/metrics/general.md"
|
||||
|
||||
## Speculative Decoding Metrics
|
||||
|
||||
--8<-- "docs/generated/metrics/spec_decode.md"
|
||||
|
||||
## NIXL KV Connector Metrics
|
||||
|
||||
--8<-- "docs/generated/metrics/nixl_connector.md"
|
||||
|
||||
## Deprecation Policy
|
||||
|
||||
Note: when metrics are deprecated in version `X.Y`, they are hidden in version `X.Y+1`
|
||||
but can be re-enabled using the `--show-hidden-metrics-for-version=X.Y` escape hatch,
|
||||
and are then removed in version `X.Y+2`.
|
||||
41
docs/usage/reproducibility.md
Normal file
41
docs/usage/reproducibility.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Reproducibility
|
||||
|
||||
vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. To achieve
|
||||
reproducible results:
|
||||
|
||||
- In offline mode, you can either set `VLLM_ENABLE_V1_MULTIPROCESSING=0` which makes scheduling deterministic,
|
||||
or enable [batch invariance](../features/batch_invariance.md) to make the outputs insensitive to scheduling.
|
||||
- In online mode, you can only enable [batch invariance](../features/batch_invariance.md).
|
||||
|
||||
Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py)
|
||||
|
||||
!!! warning
|
||||
|
||||
Setting `VLLM_ENABLE_V1_MULTIPROCESSING=0` will change the random state of user code
|
||||
(i.e. the code that constructs [LLM][vllm.LLM] class).
|
||||
|
||||
!!! note
|
||||
|
||||
Even with the above settings, vLLM only provides reproducibility
|
||||
when it runs on the same hardware and the same vLLM version.
|
||||
|
||||
## Setting the global seed
|
||||
|
||||
The `seed` parameter in vLLM is used to control the random states for various random number generators.
|
||||
|
||||
If a specific seed value is provided, the random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly.
|
||||
|
||||
### Default Behavior
|
||||
|
||||
In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`.
|
||||
|
||||
It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
|
||||
for workflows such as speculative decoding. For more information, see: <https://github.com/vllm-project/vllm/pull/17929>
|
||||
|
||||
!!! note
|
||||
|
||||
The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM
|
||||
only if the workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
|
||||
|
||||
By default, `VLLM_ENABLE_V1_MULTIPROCESSING=1` so you can use vLLM without having to worry about
|
||||
accidentally making deterministic subsequent operations that rely on random state.
|
||||
223
docs/usage/security.md
Normal file
223
docs/usage/security.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# Security
|
||||
|
||||
## Inter-Node Communication
|
||||
|
||||
All communications between nodes in a multi-node vLLM deployment are **insecure by default** and must be protected by placing the nodes on an isolated network. This includes:
|
||||
|
||||
1. PyTorch Distributed communications
|
||||
2. KV cache transfer communications
|
||||
3. Tensor, Pipeline, and Data parallel communications
|
||||
|
||||
### Configuration Options for Inter-Node Communications
|
||||
|
||||
The following options control internode communications in vLLM:
|
||||
|
||||
#### 1. **Environment Variables:**
|
||||
|
||||
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
|
||||
|
||||
#### 2. **KV Cache Transfer Configuration:**
|
||||
|
||||
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
|
||||
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
|
||||
|
||||
#### 3. **Data Parallel Configuration:**
|
||||
|
||||
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
|
||||
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
|
||||
|
||||
### Notes on PyTorch Distributed
|
||||
|
||||
vLLM uses PyTorch's distributed features for some internode communication. For
|
||||
detailed information about PyTorch Distributed security considerations, please
|
||||
refer to the [PyTorch Security
|
||||
Guide](https://github.com/pytorch/pytorch/security/policy#using-distributed-features).
|
||||
|
||||
Key points from the PyTorch security guide:
|
||||
|
||||
- PyTorch Distributed features are intended for internal communication only
|
||||
- They are not built for use in untrusted environments or networks
|
||||
- No authorization protocol is included for performance reasons
|
||||
- Messages are sent unencrypted
|
||||
- Connections are accepted from anywhere without checks
|
||||
|
||||
### Security Recommendations
|
||||
|
||||
#### 1. **Network Isolation:**
|
||||
|
||||
- Deploy vLLM nodes on a dedicated, isolated network
|
||||
- Use network segmentation to prevent unauthorized access
|
||||
- Implement appropriate firewall rules
|
||||
|
||||
#### 2. **Configuration Best Practices:**
|
||||
|
||||
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
|
||||
- Configure firewalls to only allow necessary ports between nodes
|
||||
|
||||
#### 3. **Access Control:**
|
||||
|
||||
- Restrict physical and network access to the deployment environment
|
||||
- Implement proper authentication and authorization for management interfaces
|
||||
- Follow the principle of least privilege for all system components
|
||||
|
||||
### 4. **Restrict Domains Access for Media URLs:**
|
||||
|
||||
Restrict domains that vLLM can access for media URLs by setting
|
||||
`--allowed-media-domains` to prevent Server-Side Request Forgery (SSRF) attacks.
|
||||
(e.g. `--allowed-media-domains upload.wikimedia.org github.com www.bogotobogo.com`)
|
||||
|
||||
Also, consider setting `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to prevent HTTP
|
||||
redirects from being followed to bypass domain restrictions.
|
||||
|
||||
## Security and Firewalls: Protecting Exposed vLLM Systems
|
||||
|
||||
While vLLM is designed to allow unsafe network services to be isolated to
|
||||
private networks, there are components—such as dependencies and underlying
|
||||
frameworks—that may open insecure services listening on all network interfaces,
|
||||
sometimes outside of vLLM's direct control.
|
||||
|
||||
A major concern is the use of `torch.distributed`, which vLLM leverages for
|
||||
distributed communication, including when using vLLM on a single host. When vLLM
|
||||
uses TCP initialization (see [PyTorch TCP Initialization
|
||||
documentation](https://docs.pytorch.org/docs/stable/distributed.html#tcp-initialization)),
|
||||
PyTorch creates a `TCPStore` that, by default, listens on all network
|
||||
interfaces. This means that unless additional protections are put in place,
|
||||
these services may be accessible to any host that can reach your machine via any
|
||||
network interface.
|
||||
|
||||
**From a PyTorch perspective, any use of `torch.distributed` should be
|
||||
considered insecure by default.** This is a known and intentional behavior from
|
||||
the PyTorch team.
|
||||
|
||||
### Firewall Configuration Guidance
|
||||
|
||||
The best way to protect your vLLM system is to carefully configure a firewall to
|
||||
expose only the minimum network surface area necessary. In most cases, this
|
||||
means:
|
||||
|
||||
- **Block all incoming connections except to the TCP port the API server is
|
||||
listening on.**
|
||||
|
||||
- Ensure that ports used for internal communication (such as those for
|
||||
`torch.distributed` and KV cache transfer) are only accessible from trusted
|
||||
hosts or networks.
|
||||
|
||||
- Never expose these internal ports to the public internet or untrusted
|
||||
networks.
|
||||
|
||||
Consult your operating system or application platform documentation for specific
|
||||
firewall configuration instructions.
|
||||
|
||||
## API Key Authentication Limitations
|
||||
|
||||
### Overview
|
||||
|
||||
The `--api-key` flag (or `VLLM_API_KEY` environment variable) provides authentication for vLLM's HTTP server, but **only for OpenAI-compatible API endpoints under the `/v1` path prefix**. Many other sensitive endpoints are exposed on the same HTTP server without any authentication enforcement.
|
||||
|
||||
**Important:** Do not rely exclusively on `--api-key` for securing access to vLLM. Additional security measures are required for production deployments.
|
||||
|
||||
### Protected Endpoints (Require API Key)
|
||||
|
||||
When `--api-key` is configured, the following `/v1` endpoints require Bearer token authentication:
|
||||
|
||||
- `/v1/models` - List available models
|
||||
- `/v1/chat/completions` - Chat completions
|
||||
- `/v1/completions` - Text completions
|
||||
- `/v1/embeddings` - Generate embeddings
|
||||
- `/v1/audio/transcriptions` - Audio transcription
|
||||
- `/v1/audio/translations` - Audio translation
|
||||
- `/v1/messages` - Anthropic-compatible messages API
|
||||
- `/v1/responses` - Response management
|
||||
- `/v1/score` - Scoring API
|
||||
- `/v1/rerank` - Reranking API
|
||||
|
||||
### Unprotected Endpoints (No API Key Required)
|
||||
|
||||
The following endpoints **do not require authentication** even when `--api-key` is configured:
|
||||
|
||||
**Inference endpoints:**
|
||||
|
||||
- `/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints)
|
||||
- `/inference/v1/generate` - Generate completions
|
||||
- `/pooling` - Pooling API
|
||||
- `/classify` - Classification API
|
||||
- `/score` - Scoring API (non-`/v1` variant)
|
||||
- `/rerank` - Reranking API (non-`/v1` variant)
|
||||
|
||||
**Operational control endpoints (always enabled):**
|
||||
|
||||
- `/pause` - Pause generation (causes denial of service)
|
||||
- `/resume` - Resume generation
|
||||
- `/scale_elastic_ep` - Trigger scaling operations
|
||||
|
||||
**Utility endpoints:**
|
||||
|
||||
- `/tokenize` - Tokenize text
|
||||
- `/detokenize` - Detokenize tokens
|
||||
- `/health` - Health check
|
||||
- `/ping` - SageMaker health check
|
||||
- `/version` - Version information
|
||||
- `/load` - Server load metrics
|
||||
|
||||
**Tokenizer information endpoint (only when `--enable-tokenizer-info-endpoint` is set):**
|
||||
|
||||
This endpoint is **only available when the `--enable-tokenizer-info-endpoint` flag is set**. It may expose sensitive information such as chat templates and tokenizer configuration:
|
||||
|
||||
- `/tokenizer_info` - Get comprehensive tokenizer information including chat templates and configuration
|
||||
|
||||
**Development endpoints (only when `VLLM_SERVER_DEV_MODE=1`):**
|
||||
|
||||
These endpoints are **only available when the environment variable `VLLM_SERVER_DEV_MODE` is set to `1`**. They are intended for development and debugging purposes and should never be enabled in production:
|
||||
|
||||
- `/server_info` - Get detailed server configuration
|
||||
- `/reset_prefix_cache` - Reset prefix cache (can disrupt service)
|
||||
- `/reset_mm_cache` - Reset multimodal cache (can disrupt service)
|
||||
- `/sleep` - Put engine to sleep (causes denial of service)
|
||||
- `/wake_up` - Wake engine from sleep
|
||||
- `/is_sleeping` - Check if engine is sleeping
|
||||
- `/collective_rpc` - Execute arbitrary RPC methods on the engine (extremely dangerous)
|
||||
|
||||
**Profiler endpoints (only when `VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE` are set):**
|
||||
|
||||
These endpoints are only available when profiling is enabled and should only be used for local development:
|
||||
|
||||
- `/start_profile` - Start PyTorch profiler
|
||||
- `/stop_profile` - Stop PyTorch profiler
|
||||
|
||||
**Note:** The `/invocations` endpoint is particularly concerning as it provides unauthenticated access to the same inference capabilities as the protected `/v1` endpoints.
|
||||
|
||||
### Security Implications
|
||||
|
||||
An attacker who can reach the vLLM HTTP server can:
|
||||
|
||||
1. **Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials
|
||||
2. **Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token
|
||||
3. **Access operational controls** to manipulate server state (e.g., pausing generation)
|
||||
4. **If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details
|
||||
5. **If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration
|
||||
|
||||
### Recommended Security Practices
|
||||
|
||||
#### 1. Minimize Exposed Endpoints
|
||||
|
||||
**CRITICAL:** Never set `VLLM_SERVER_DEV_MODE=1` in production environments. Development endpoints expose extremely dangerous functionality including:
|
||||
|
||||
- Arbitrary RPC execution via `/collective_rpc`
|
||||
- Cache manipulation that can disrupt service
|
||||
- Detailed server configuration disclosure
|
||||
|
||||
Similarly, never enable profiler endpoints (`VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE`) in production.
|
||||
|
||||
**Be cautious with `--enable-tokenizer-info-endpoint`:** Only enable the `/tokenizer_info` endpoint if you need to expose tokenizer configuration information. This endpoint reveals chat templates and tokenizer settings that may contain sensitive implementation details or prompt engineering strategies.
|
||||
|
||||
#### 2. Deploy Behind a Reverse Proxy
|
||||
|
||||
The most effective approach is to deploy vLLM behind a reverse proxy (such as nginx, Envoy, or a Kubernetes Gateway) that:
|
||||
|
||||
- Explicitly allowlists only the endpoints you want to expose to end users
|
||||
- Blocks all other endpoints, including the unauthenticated inference and operational control endpoints
|
||||
- Implements additional authentication, rate limiting, and logging at the proxy layer
|
||||
|
||||
## Reporting Security Vulnerabilities
|
||||
|
||||
If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).
|
||||
327
docs/usage/troubleshooting.md
Normal file
327
docs/usage/troubleshooting.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Troubleshooting
|
||||
|
||||
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
!!! note
|
||||
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
|
||||
|
||||
## Hangs downloading a model
|
||||
|
||||
If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection.
|
||||
It's recommended to download the model first using the [huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and passing the local path to the model to vLLM. This way, you can isolate the issue.
|
||||
|
||||
## Hangs loading a model from disk
|
||||
|
||||
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
|
||||
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
|
||||
|
||||
!!! note
|
||||
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
|
||||
|
||||
## Out of memory
|
||||
|
||||
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](../configuration/conserving_memory.md) to reduce the memory consumption.
|
||||
|
||||
## Generation quality changed
|
||||
|
||||
In v0.8.0, the source of default sampling parameters was changed in <https://github.com/vllm-project/vllm/pull/12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
|
||||
|
||||
In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
|
||||
|
||||
You can check if this is happening by trying the old defaults with `--generation-config vllm` for online and `generation_config="vllm"` for offline. If, after trying this, your generation quality improves we would recommend continuing to use the vLLM defaults and petition the model creator on <https://huggingface.co> to update their default `generation_config.json` so that it produces better quality generations.
|
||||
|
||||
## Enable more logging
|
||||
|
||||
If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:
|
||||
|
||||
- `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging.
|
||||
- `export VLLM_LOG_STATS_INTERVAL=1.` to get log statistics more frequently for tracking running queue, waiting queue and cache hit states.
|
||||
- `export CUDA_LAUNCH_BLOCKING=1` to identify which CUDA kernel is causing the problem.
|
||||
- `export NCCL_DEBUG=TRACE` to turn on more logging for NCCL.
|
||||
- `export VLLM_TRACE_FUNCTION=1` to record all function calls for inspection in the log files to tell which function crashes or hangs. (WARNING: This flag will slow down the token generation by **over 100x**. Do not use unless absolutely needed.)
|
||||
|
||||
## Breakpoints
|
||||
|
||||
Setting normal `pdb` breakpoints may not work in vLLM's codebase if they are executed in a subprocess. You will experience something like:
|
||||
|
||||
``` text
|
||||
File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/bdb.py", line 100, in trace_dispatch
|
||||
return self.dispatch_line(frame)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
File "/usr/local/uv/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/bdb.py", line 125, in dispatch_line
|
||||
if self.quitting: raise BdbQuit
|
||||
^^^^^^^^^^^^^
|
||||
bdb.BdbQuit
|
||||
```
|
||||
|
||||
One solution is using [forked-pdb](https://github.com/Lightning-AI/forked-pdb). Install with `pip install fpdb` and set a breakpoint with something like:
|
||||
|
||||
``` python
|
||||
__import__('fpdb').ForkedPdb().set_trace()
|
||||
```
|
||||
|
||||
Another option is to disable multiprocessing entirely, with the `VLLM_ENABLE_V1_MULTIPROCESSING` environment variable.
|
||||
This keeps the scheduler in the same process, so you can use stock `pdb` breakpoints:
|
||||
|
||||
``` python
|
||||
import os
|
||||
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
|
||||
```
|
||||
|
||||
## Incorrect network setup
|
||||
|
||||
The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as `DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl` and the IP address should be the correct one.
|
||||
If it's not, override the IP address using the environment variable `export VLLM_HOST_IP=<your_ip_address>`.
|
||||
|
||||
You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>` and `export GLOO_SOCKET_IFNAME=<your_network_interface>` to specify the network interface for the IP address.
|
||||
|
||||
## Error near `self.graph.replay()`
|
||||
|
||||
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
|
||||
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
|
||||
|
||||
## Incorrect hardware/driver
|
||||
|
||||
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Test PyTorch NCCL
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
dist.init_process_group(backend="nccl")
|
||||
local_rank = dist.get_rank() % torch.cuda.device_count()
|
||||
torch.cuda.set_device(local_rank)
|
||||
data = torch.FloatTensor([1,] * 128).to("cuda")
|
||||
dist.all_reduce(data, op=dist.ReduceOp.SUM)
|
||||
torch.cuda.synchronize()
|
||||
value = data.mean().item()
|
||||
world_size = dist.get_world_size()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch NCCL is successful!")
|
||||
|
||||
# Test PyTorch GLOO
|
||||
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
|
||||
cpu_data = torch.FloatTensor([1,] * 128)
|
||||
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
|
||||
value = cpu_data.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch GLOO is successful!")
|
||||
|
||||
if world_size <= 1:
|
||||
exit()
|
||||
|
||||
# Test vLLM NCCL, with cuda graph
|
||||
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
|
||||
|
||||
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
|
||||
# pynccl is enabled by default for 0.6.5+,
|
||||
# but for 0.6.4 and below, we need to enable it manually.
|
||||
# keep the code for backward compatibility when because people
|
||||
# prefer to read the latest documentation.
|
||||
pynccl.disabled = False
|
||||
|
||||
s = torch.cuda.Stream()
|
||||
with torch.cuda.stream(s):
|
||||
data.fill_(1)
|
||||
out = pynccl.all_reduce(data, stream=s)
|
||||
value = out.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL is successful!")
|
||||
|
||||
g = torch.cuda.CUDAGraph()
|
||||
with torch.cuda.graph(cuda_graph=g, stream=s):
|
||||
out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
|
||||
|
||||
data.fill_(1)
|
||||
g.replay()
|
||||
torch.cuda.current_stream().synchronize()
|
||||
value = out.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL with cuda graph is successful!")
|
||||
|
||||
dist.destroy_process_group(gloo_group)
|
||||
dist.destroy_process_group()
|
||||
```
|
||||
|
||||
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
|
||||
|
||||
```bash
|
||||
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
||||
```
|
||||
|
||||
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
||||
|
||||
```bash
|
||||
NCCL_DEBUG=TRACE torchrun --nnodes 2 \
|
||||
--nproc-per-node=2 \
|
||||
--rdzv_backend=c10d \
|
||||
--rdzv_endpoint=$MASTER_ADDR test.py
|
||||
```
|
||||
|
||||
If the script runs successfully, you should see the message `sanity check is successful!`.
|
||||
|
||||
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
|
||||
|
||||
!!! note
|
||||
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
|
||||
|
||||
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
|
||||
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
|
||||
|
||||
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
|
||||
|
||||
## Python multiprocessing
|
||||
|
||||
### `RuntimeError` Exception
|
||||
|
||||
If you have seen a warning in your logs like this:
|
||||
|
||||
```console
|
||||
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
initialized. We must use the `spawn` multiprocessing start method. Setting
|
||||
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
|
||||
https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing
|
||||
for more information.
|
||||
```
|
||||
|
||||
or an error from Python that looks like this:
|
||||
|
||||
??? console "Logs"
|
||||
|
||||
```console
|
||||
RuntimeError:
|
||||
An attempt has been made to start a new process before the
|
||||
current process has finished its bootstrapping phase.
|
||||
|
||||
This probably means that you are not using fork to start your
|
||||
child processes and you have forgotten to use the proper idiom
|
||||
in the main module:
|
||||
|
||||
if __name__ == '__main__':
|
||||
freeze_support()
|
||||
...
|
||||
|
||||
The "freeze_support()" line can be omitted if the program
|
||||
is not going to be frozen to produce an executable.
|
||||
|
||||
To fix this issue, refer to the "Safe importing of main module"
|
||||
section in https://docs.python.org/3/library/multiprocessing.html
|
||||
```
|
||||
|
||||
then you must update your Python code to guard usage of `vllm` behind a `if
|
||||
__name__ == '__main__':` block. For example, instead of this:
|
||||
|
||||
```python
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
```
|
||||
|
||||
try this instead:
|
||||
|
||||
```python
|
||||
if __name__ == '__main__':
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
```
|
||||
|
||||
## `torch.compile` Error
|
||||
|
||||
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
@torch.compile
|
||||
def f(x):
|
||||
# a simple function to test torch.compile
|
||||
x = x + 1
|
||||
x = x * 2
|
||||
x = x.sin()
|
||||
return x
|
||||
|
||||
x = torch.randn(4, 4).cuda()
|
||||
print(f(x))
|
||||
```
|
||||
|
||||
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <https://github.com/vllm-project/vllm/issues/12219> for example.
|
||||
|
||||
## Model failed to be inspected
|
||||
|
||||
If you see an error like:
|
||||
|
||||
```text
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
|
||||
raise ValueError(
|
||||
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
|
||||
```
|
||||
|
||||
It means that vLLM failed to import the model file.
|
||||
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
|
||||
Please read the logs carefully to determine the root cause of the error.
|
||||
|
||||
## Model not supported
|
||||
|
||||
If you see an error like:
|
||||
|
||||
```text
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
|
||||
for arch in architectures:
|
||||
TypeError: 'NoneType' object is not iterable
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```text
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
|
||||
raise ValueError(
|
||||
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
|
||||
```
|
||||
|
||||
But you are sure that the model is in the [list of supported models](../models/supported_models.md), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
|
||||
|
||||
## Failed to infer device type
|
||||
|
||||
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](../../vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](https://github.com/vllm-project/vllm/pull/14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
|
||||
|
||||
## NCCL error: unhandled system error during `ncclCommInitRank`
|
||||
|
||||
If your serving workload uses GPUDirect RDMA for distributed serving across multiple nodes and encounters an error during `ncclCommInitRank`, with no clear error message even with `NCCL_DEBUG=INFO` set, it might look like this:
|
||||
|
||||
```text
|
||||
Error executing method 'init_device'. This might cause deadlock in distributed execution.
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
|
||||
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
|
||||
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
|
||||
raise RuntimeError(f"NCCL error: {error_str}")
|
||||
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
|
||||
...
|
||||
```
|
||||
|
||||
This indicates vLLM failed to initialize the NCCL communicator, possibly due to a missing `IPC_LOCK` linux capability or an unmounted `/dev/shm`. Refer to [Enabling GPUDirect RDMA](../serving/parallelism_scaling.md#enabling-gpudirect-rdma) for guidance on properly configuring the environment for GPUDirect RDMA.
|
||||
|
||||
## CUDA error: the provided PTX was compiled with an unsupported toolchain
|
||||
|
||||
If you see an error like `RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.`, it means that the CUDA PTX in vLLM's wheels was compiled with a toolchain unsupported by your system. The released vLLM wheels have to be compiled with a specific version of CUDA toolkit, and the compiled code might fail to run on lower versions of CUDA drivers. Read [cuda compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/) for more details. The solution is to install `cuda-compat` package from your package manager. For example, on Ubuntu, you can run `sudo apt-get install cuda-compat-12-9`, and then add `export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH` to your `.bashrc` file. When successfully installed, you should see that the output of `nvidia-smi` will show `CUDA Version: 12.9`. Note that we use CUDA 12.9 as an example here, you may want to install a higher version of cuda-compat package in case vLLM's default CUDA version goes higher.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759).
|
||||
- To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations.
|
||||
- In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version.
|
||||
61
docs/usage/usage_stats.md
Normal file
61
docs/usage/usage_stats.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Usage Stats Collection
|
||||
|
||||
vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information.
|
||||
|
||||
A subset of the data, after cleaning and aggregation, will be publicly released for the community's benefit. For example, you can see the 2024 usage report [here](https://2024.vllm.ai).
|
||||
|
||||
## What data is collected?
|
||||
|
||||
The list of data collected by the latest version of vLLM can be found here: [vllm/usage/usage_lib.py](../../vllm/usage/usage_lib.py)
|
||||
|
||||
Here is an example as of v0.4.0:
|
||||
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
"uuid": "fbe880e9-084d-4cab-a395-8984c50f1109",
|
||||
"provider": "GCP",
|
||||
"num_cpu": 24,
|
||||
"cpu_type": "Intel(R) Xeon(R) CPU @ 2.20GHz",
|
||||
"cpu_family_model_stepping": "6,85,7",
|
||||
"total_memory": 101261135872,
|
||||
"architecture": "x86_64",
|
||||
"platform": "Linux-5.10.0-28-cloud-amd64-x86_64-with-glibc2.31",
|
||||
"gpu_count": 2,
|
||||
"gpu_type": "NVIDIA L4",
|
||||
"gpu_memory_per_device": 23580639232,
|
||||
"model_architecture": "OPTForCausalLM",
|
||||
"vllm_version": "0.3.2+cu123",
|
||||
"context": "LLM_CLASS",
|
||||
"log_time": 1711663373492490000,
|
||||
"source": "production",
|
||||
"dtype": "torch.float16",
|
||||
"tensor_parallel_size": 1,
|
||||
"block_size": 16,
|
||||
"gpu_memory_utilization": 0.9,
|
||||
"quantization": null,
|
||||
"kv_cache_dtype": "auto",
|
||||
"enable_lora": false,
|
||||
"enable_prefix_caching": false,
|
||||
"enforce_eager": false,
|
||||
"disable_custom_all_reduce": true
|
||||
}
|
||||
```
|
||||
|
||||
You can preview the collected data by running the following command:
|
||||
|
||||
```bash
|
||||
tail ~/.config/vllm/usage_stats.json
|
||||
```
|
||||
|
||||
## Opting out
|
||||
|
||||
You can opt out of usage stats collection by setting the `VLLM_NO_USAGE_STATS` or `DO_NOT_TRACK` environment variable, or by creating a `~/.config/vllm/do_not_track` file:
|
||||
|
||||
```bash
|
||||
# Any of the following methods can disable usage stats collection
|
||||
export VLLM_NO_USAGE_STATS=1
|
||||
export DO_NOT_TRACK=1
|
||||
mkdir -p ~/.config/vllm && touch ~/.config/vllm/do_not_track
|
||||
```
|
||||
186
docs/usage/v1_guide.md
Normal file
186
docs/usage/v1_guide.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# vLLM V1
|
||||
|
||||
!!! announcement
|
||||
|
||||
We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
|
||||
|
||||
If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
|
||||
|
||||
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
|
||||
|
||||
Building on V0’s success, vLLM V1 retains the stable and proven components from V0
|
||||
(such as the models, GPU kernels, and utilities). At the same time, it significantly
|
||||
re-architects the core systems, covering the scheduler, KV cache manager, worker,
|
||||
sampler, and API server, to provide a cohesive, maintainable framework that better
|
||||
accommodates continued growth and innovation.
|
||||
|
||||
Specifically, V1 aims to:
|
||||
|
||||
- Provide a **simple, modular, and easy-to-hack codebase**.
|
||||
- Ensure **high performance** with near-zero CPU overhead.
|
||||
- **Combine key optimizations** into a unified architecture.
|
||||
- Require **zero configs** by enabling features/optimizations by default.
|
||||
|
||||
We see significant performance improvements from upgrading to V1 core engine, in
|
||||
particular for long context scenarios. Please see performance benchmark (To be
|
||||
added).
|
||||
|
||||
For more details, check out the vLLM V1 blog post [vLLM V1: A Major
|
||||
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
|
||||
|
||||
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
|
||||
|
||||
## Differences from V0
|
||||
|
||||
This section lists some differences in behavior between V0 and V1.
|
||||
|
||||
### Chunked Prefill
|
||||
|
||||
Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.
|
||||
|
||||
### CUDA Graphs
|
||||
|
||||
CUDA graph capture takes up more memory in V1 than in V0.
|
||||
|
||||
### Semantic Changes to Logprobs
|
||||
|
||||
#### Logprobs Calculation
|
||||
|
||||
By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
|
||||
before applying any logits post-processing such as temperature scaling or penalty
|
||||
adjustments). As a result, the returned logprobs do not reflect the final adjusted
|
||||
probabilities used during sampling.
|
||||
|
||||
You can adjust this behavior by setting the `--logprobs-mode` flag.
|
||||
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
|
||||
Raw means the values before applying any logit processors, like bad words.
|
||||
Processed means the values after applying all processors, including temperature and top_k/top_p.
|
||||
|
||||
#### Prompt Logprobs with Prefix Caching
|
||||
|
||||
While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs.
|
||||
For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.
|
||||
|
||||
## Feature Support
|
||||
|
||||
For each item, its support in vLLM V1 falls into one of the following states:
|
||||
|
||||
- **🟢 Functional**: Fully operational with optimizations comparable to or better than V0.
|
||||
- **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs.
|
||||
- **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.
|
||||
|
||||
!!! note
|
||||
vLLM V1’s unified scheduler treats both prompt and output tokens the same
|
||||
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
|
||||
allocate a fixed token budget per request, enabling features like chunked prefills,
|
||||
prefix caching, and speculative decoding without a strict separation between prefill
|
||||
and decode phases.
|
||||
|
||||
The V1 scheduler supports multiple scheduling policies, including First-Come,
|
||||
First-Served (FCFS) and priority-based scheduling (where requests are processed
|
||||
based on assigned priority, with FCFS as a tie-breaker), configurable via the
|
||||
`--scheduling-policy` argument.
|
||||
|
||||
### Hardware
|
||||
|
||||
| Hardware | Status |
|
||||
|------------------|-----------------------------------------------|
|
||||
| **NVIDIA** | <nobr>🟢</nobr> |
|
||||
| **AMD** | <nobr>🟢</nobr> |
|
||||
| **INTEL GPU** | <nobr>🟢</nobr> |
|
||||
| **TPU** | <nobr>🟢</nobr> |
|
||||
| **CPU** | <nobr>🟢</nobr> |
|
||||
|
||||
!!! note
|
||||
|
||||
More hardware platforms may be supported via plugins, e.g.:
|
||||
|
||||
- [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
|
||||
- [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
|
||||
- [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi)
|
||||
- [vllm-openvino](https://github.com/vllm-project/vllm-openvino)
|
||||
|
||||
Please check their corresponding repositories for more details.
|
||||
|
||||
### Models
|
||||
|
||||
| Model Type | Status |
|
||||
|-----------------------------|-------------------------------------------------------------------------|
|
||||
| **Decoder-only Models** | <nobr>🟢</nobr> |
|
||||
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
|
||||
| **Pooling Models** | <nobr>🟢</nobr> |
|
||||
| **Mamba Models** | <nobr>🟢</nobr> |
|
||||
| **Multimodal Models** | <nobr>🟢</nobr> |
|
||||
|
||||
See below for the status of models that are not yet supported or have more features planned in V1.
|
||||
|
||||
#### Pooling Models
|
||||
|
||||
Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models.
|
||||
|
||||
We are working on enabling prefix caching and chunked prefill for more categories of pooling models.
|
||||
|
||||
#### Mamba Models
|
||||
|
||||
Models using selective state-space mechanisms instead of standard transformer attention are supported.
|
||||
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported.
|
||||
|
||||
Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
|
||||
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).
|
||||
|
||||
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`, `Lfm2ForCausalLM`).
|
||||
|
||||
Please note that prefix caching is not yet supported for any of the above models.
|
||||
|
||||
#### Encoder-Decoder Models
|
||||
|
||||
Whisper is supported. Other models requiring cross-attention between separate
|
||||
encoder and decoder (e.g., `BartForConditionalGeneration`,
|
||||
`MllamaForConditionalGeneration`) are no longer supported.
|
||||
|
||||
### Features
|
||||
|
||||
| Feature | Status |
|
||||
|---------------------------------------------|-----------------------------------------------------------------------------------|
|
||||
| **Prefix Caching** | <nobr>🟢 Functional</nobr> |
|
||||
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
|
||||
| **LoRA** | <nobr>🟢 Functional</nobr> |
|
||||
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional</nobr> |
|
||||
| **Spec Decode** | <nobr>🟢 Functional</nobr> |
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟢 Functional</nobr> |
|
||||
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
|
||||
| **Concurrent Partial Prefills** | <nobr>🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003)</nobr> |
|
||||
| **best_of** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361)</nobr> |
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360)</nobr> |
|
||||
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Removed</nobr> |
|
||||
| **Request-level Structured Output Backend** | <nobr>🔴 Removed</nobr> |
|
||||
|
||||
!!! note
|
||||
|
||||
vLLM V1’s unified scheduler treats both prompt and output tokens the same
|
||||
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
|
||||
allocate a fixed token budget per request, enabling features like chunked prefills,
|
||||
prefix caching, and speculative decoding without a strict separation between prefill
|
||||
and decode phases.
|
||||
|
||||
#### Removed Features
|
||||
|
||||
As part of the major architectural rework in vLLM V1, several legacy features have been removed.
|
||||
|
||||
##### Sampling features
|
||||
|
||||
- **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
|
||||
- **Per-Request Logits Processors**: In V0, users could pass custom
|
||||
processing functions to adjust logits on a per-request basis. In vLLM V1, this
|
||||
feature has been removed. Instead, we now support **global logits processors**
|
||||
which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799).
|
||||
|
||||
##### KV Cache features
|
||||
|
||||
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
|
||||
to handle request preemptions.
|
||||
|
||||
##### Structured Output features
|
||||
|
||||
- **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.
|
||||
Reference in New Issue
Block a user