Sync from v0.13
This commit is contained in:
9
docs/configuration/README.md
Normal file
9
docs/configuration/README.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Configuration Options
|
||||
|
||||
This section lists the most common options for running vLLM.
|
||||
|
||||
There are three main levels of configuration, from highest priority to lowest priority:
|
||||
|
||||
- [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
|
||||
- [Engine arguments](./engine_args.md)
|
||||
- [Environment variables](./env_vars.md)
|
||||
185
docs/configuration/conserving_memory.md
Normal file
185
docs/configuration/conserving_memory.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# Conserving Memory
|
||||
|
||||
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
|
||||
|
||||
## Tensor Parallelism (TP)
|
||||
|
||||
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
|
||||
|
||||
The following code splits the model across 2 GPUs.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
|
||||
```
|
||||
|
||||
!!! warning
|
||||
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
|
||||
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
|
||||
|
||||
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
|
||||
|
||||
!!! note
|
||||
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
|
||||
|
||||
You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
|
||||
|
||||
## Quantization
|
||||
|
||||
Quantized models take less memory at the cost of lower precision.
|
||||
|
||||
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
|
||||
and used directly without extra configuration.
|
||||
|
||||
Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.
|
||||
|
||||
## Context length and batch size
|
||||
|
||||
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
|
||||
and the maximum batch size (`max_num_seqs` option).
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)
|
||||
```
|
||||
|
||||
## Reduce CUDA Graphs
|
||||
|
||||
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
|
||||
|
||||
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
|
||||
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.config import CompilationConfig, CompilationMode
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||
compilation_config=CompilationConfig(
|
||||
mode=CompilationMode.VLLM_COMPILE,
|
||||
# By default, it goes up to max_num_seqs
|
||||
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
You can disable graph capturing completely via the `enforce_eager` flag:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
|
||||
```
|
||||
|
||||
## Adjust cache size
|
||||
|
||||
If you run out of CPU RAM, try the following options:
|
||||
|
||||
- (Multi-modal models only) you can set the size of multi-modal cache by setting `mm_processor_cache_gb` engine argument (default 4 GiB).
|
||||
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
|
||||
|
||||
## Multi-modal input limits
|
||||
|
||||
You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Accept up to 3 images and 1 video per prompt
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
limit_mm_per_prompt={"image": 3, "video": 1},
|
||||
)
|
||||
```
|
||||
|
||||
You can go a step further and disable unused modalities completely by setting its limit to zero.
|
||||
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Accept any number of images but no videos
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
limit_mm_per_prompt={"video": 0},
|
||||
)
|
||||
```
|
||||
|
||||
You can even run a multi-modal model for text-only inference:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Don't accept images. Just text.
|
||||
llm = LLM(
|
||||
model="google/gemma-3-27b-it",
|
||||
limit_mm_per_prompt={"image": 0},
|
||||
)
|
||||
```
|
||||
|
||||
### Configurable options
|
||||
|
||||
`limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima.
|
||||
|
||||
Configurable options by modality:
|
||||
|
||||
- `image`: `{"count": int, "width": int, "height": int}`
|
||||
- `video`: `{"count": int, "num_frames": int, "width": int, "height": int}`
|
||||
- `audio`: `{"count": int, "length": int}`
|
||||
|
||||
Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions].
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Up to 5 images per prompt, profile with 512x512.
|
||||
# Up to 1 video per prompt, profile with 32 frames at 640x640.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
limit_mm_per_prompt={
|
||||
"image": {"count": 5, "width": 512, "height": 512},
|
||||
"video": {"count": 1, "num_frames": 32, "width": 640, "height": 640},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
For backward compatibility, passing an integer works as before and is interpreted as `{"count": <int>}`. For example:
|
||||
|
||||
- `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}`
|
||||
- You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}`
|
||||
|
||||
!!! note
|
||||
- The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time.
|
||||
- If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning.
|
||||
|
||||
!!! warning
|
||||
These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints.
|
||||
|
||||
## Multi-modal processor arguments
|
||||
|
||||
For certain models, you can adjust the multi-modal processor arguments to
|
||||
reduce the size of the processed multi-modal inputs, which in turn saves memory.
|
||||
|
||||
Here are some examples:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Available for Qwen2-VL series models
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
mm_processor_kwargs={"max_pixels": 768 * 768}, # Default is 1280 * 28 * 28
|
||||
)
|
||||
|
||||
# Available for InternVL series models
|
||||
llm = LLM(
|
||||
model="OpenGVLab/InternVL2-2B",
|
||||
mm_processor_kwargs={"max_dynamic_patch": 4}, # Default is 12
|
||||
)
|
||||
```
|
||||
22
docs/configuration/engine_args.md
Normal file
22
docs/configuration/engine_args.md
Normal file
@@ -0,0 +1,22 @@
|
||||
---
|
||||
toc_depth: 3
|
||||
---
|
||||
|
||||
# Engine Arguments
|
||||
|
||||
Engine arguments control the behavior of the vLLM engine.
|
||||
|
||||
- For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class.
|
||||
- For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`.
|
||||
|
||||
The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings.
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## `EngineArgs`
|
||||
|
||||
--8<-- "docs/argparse/engine_args.md"
|
||||
|
||||
## `AsyncEngineArgs`
|
||||
|
||||
--8<-- "docs/argparse/async_engine_args.md"
|
||||
12
docs/configuration/env_vars.md
Normal file
12
docs/configuration/env_vars.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Environment Variables
|
||||
|
||||
vLLM uses the following environment variables to configure the system:
|
||||
|
||||
!!! warning
|
||||
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
|
||||
|
||||
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
|
||||
|
||||
```python
|
||||
--8<-- "vllm/envs.py:env-vars-definition"
|
||||
```
|
||||
23
docs/configuration/model_resolution.md
Normal file
23
docs/configuration/model_resolution.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Model Resolution
|
||||
|
||||
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
|
||||
and finding the corresponding implementation that is registered to vLLM.
|
||||
Nevertheless, our model resolution may fail for the following reasons:
|
||||
|
||||
- The `config.json` of the model repository lacks the `architectures` field.
|
||||
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
|
||||
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
|
||||
|
||||
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
|
||||
For example:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(
|
||||
model="cerebras/Cerebras-GPT-1.3B",
|
||||
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
|
||||
)
|
||||
```
|
||||
|
||||
Our [list of supported models](../models/supported_models.md) shows the model architectures that are recognized by vLLM.
|
||||
288
docs/configuration/optimization.md
Normal file
288
docs/configuration/optimization.md
Normal file
@@ -0,0 +1,288 @@
|
||||
# Optimization and Tuning
|
||||
|
||||
This guide covers optimization strategies and performance tuning for vLLM V1.
|
||||
|
||||
!!! tip
|
||||
Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
|
||||
|
||||
## Preemption
|
||||
|
||||
Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
|
||||
In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
|
||||
available again. When this occurs, you may see the following warning:
|
||||
|
||||
```text
|
||||
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
|
||||
```
|
||||
|
||||
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
|
||||
If you frequently encounter preemptions, consider the following actions:
|
||||
|
||||
- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
|
||||
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
|
||||
- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
|
||||
- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
|
||||
|
||||
You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
|
||||
|
||||
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
|
||||
|
||||
## Chunked Prefill
|
||||
|
||||
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
|
||||
|
||||
In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
|
||||
|
||||
This policy has two benefits:
|
||||
|
||||
- It improves ITL and generation decode because decode requests are prioritized.
|
||||
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
|
||||
|
||||
### Performance Tuning with Chunked Prefill
|
||||
|
||||
You can tune the performance by adjusting `max_num_batched_tokens`:
|
||||
|
||||
- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
|
||||
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
|
||||
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
|
||||
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Set max_num_batched_tokens to tune performance
|
||||
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384)
|
||||
```
|
||||
|
||||
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
|
||||
|
||||
## Parallelism Strategies
|
||||
|
||||
vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.
|
||||
|
||||
### Tensor Parallelism (TP)
|
||||
|
||||
Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
|
||||
|
||||
**When to use:**
|
||||
|
||||
- When the model is too large to fit on a single GPU
|
||||
- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Split model across 4 GPUs
|
||||
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4)
|
||||
```
|
||||
|
||||
For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.
|
||||
|
||||
### Pipeline Parallelism (PP)
|
||||
|
||||
Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.
|
||||
|
||||
**When to use:**
|
||||
|
||||
- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
|
||||
- For very deep and narrow models where layer distribution is more efficient than tensor sharding
|
||||
|
||||
Pipeline parallelism can be combined with tensor parallelism for very large models:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# Combine pipeline and tensor parallelism
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3.3-70B-Instruct,
|
||||
tensor_parallel_size=4,
|
||||
pipeline_parallel_size=2,
|
||||
)
|
||||
```
|
||||
|
||||
### Expert Parallelism (EP)
|
||||
|
||||
Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
|
||||
|
||||
**When to use:**
|
||||
|
||||
- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
|
||||
- When you want to balance the expert computation load across GPUs
|
||||
|
||||
Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
|
||||
It will use the same degree of parallelism as what you have set for tensor parallelism.
|
||||
|
||||
### Data Parallelism (DP)
|
||||
|
||||
Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
|
||||
|
||||
**When to use:**
|
||||
|
||||
- When you have enough GPUs to replicate the entire model
|
||||
- When you need to scale throughput rather than model size
|
||||
- In multi-user environments where isolation between request batches is beneficial
|
||||
|
||||
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
|
||||
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
|
||||
|
||||
### Batch-level DP for Multi-Modal Encoders
|
||||
|
||||
By default, TP is used to shard the weights of multi-modal encoders just like for language decoders,
|
||||
in order to reduce the memory and compute load on each GPU.
|
||||
|
||||
However, since the size of multi-modal encoders is very small compared to language decoders,
|
||||
there is relatively little gain from TP. On the other hand, TP incurs significant communication
|
||||
overhead because of all-reduce being performed after every layer.
|
||||
|
||||
Given this, it may be advantageous to instead shard the batched input data using TP, essentially
|
||||
performing batch-level DP. This has been shown to improve the throughput and TTFT by around 10% for
|
||||
`tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations,
|
||||
batch-level DP can provide another 40% improvement compared to regular TP.
|
||||
|
||||
Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank,
|
||||
there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already.
|
||||
|
||||
You can enable batch-level DP by setting `mm_encoder_tp_mode="data"`, for example:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-72B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
# When mm_encoder_tp_mode="data",
|
||||
# the vision encoder uses TP=4 (not DP=1) to shard the input data,
|
||||
# so the TP size becomes the effective DP size.
|
||||
# Note that this is independent of the DP size for language decoder which is used in expert parallel setting.
|
||||
mm_encoder_tp_mode="data",
|
||||
# The language decoder uses TP=4 to shard the weights regardless
|
||||
# of the setting of mm_encoder_tp_mode
|
||||
)
|
||||
```
|
||||
|
||||
!!! important
|
||||
Batch-level DP is not to be confused with API request-level DP
|
||||
(which is instead controlled by `data_parallel_size`).
|
||||
|
||||
Batch-level DP needs to be implemented on a per-model basis,
|
||||
and enabled by setting `supports_encoder_tp_data = True` in the model class.
|
||||
Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to use this feature.
|
||||
|
||||
Known supported models (with corresponding benchmarks):
|
||||
|
||||
- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
|
||||
- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
|
||||
- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
|
||||
- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
|
||||
- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
|
||||
- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
|
||||
- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
|
||||
- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
|
||||
|
||||
## Input Processing
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
|
||||
This is useful when input processing (which is run inside the API server)
|
||||
becomes a bottleneck compared to model execution (which is run inside engine core)
|
||||
and you have excess CPU capacity.
|
||||
|
||||
```console
|
||||
# Run 4 API processes and 1 engine core process
|
||||
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4
|
||||
|
||||
# Run 4 API processes and 2 engine core processes
|
||||
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
|
||||
```
|
||||
|
||||
!!! note
|
||||
API server scale-out is only available for online inference.
|
||||
|
||||
!!! warning
|
||||
By default, 8 CPU threads are used in each API server to load media items (e.g. images)
|
||||
from request data.
|
||||
|
||||
If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
|
||||
to avoid CPU resource exhaustion.
|
||||
|
||||
!!! note
|
||||
API server scale-out disables [multi-modal IPC caching](#ipc-caching)
|
||||
because it requires a one-to-one correspondence between API and engine core processes.
|
||||
|
||||
This does not impact [multi-modal processor caching](#processor-caching).
|
||||
|
||||
## Multi-Modal Caching
|
||||
|
||||
Multi-modal caching avoids repeated transfer or processing of the same multi-modal data,
|
||||
which commonly occurs in multi-turn conversations.
|
||||
|
||||
### Processor Caching
|
||||
|
||||
Multi-modal processor caching is automatically enabled
|
||||
to avoid repeatedly processing the same multi-modal inputs in `BaseMultiModalProcessor`.
|
||||
|
||||
### IPC Caching
|
||||
|
||||
Multi-modal IPC caching is automatically enabled when
|
||||
there is a one-to-one correspondence between API (`P0`) and engine core (`P1`) processes,
|
||||
to avoid repeatedly transferring the same multi-modal inputs between them.
|
||||
|
||||
#### Key-Replicated Cache
|
||||
|
||||
By default, IPC caching uses a **key-replicated cache**, where cache keys exist
|
||||
in both the API (`P0`) and engine core (`P1`) processes, but the actual cache
|
||||
data resides only in `P1`.
|
||||
|
||||
#### Shared Memory Cache
|
||||
|
||||
When multiple worker processes are involved (e.g., when TP > 1), a
|
||||
**shared-memory cache** is more efficient. This can be enabled by setting
|
||||
`mm_processor_cache_type="shm"`. In this mode, cache keys are stored
|
||||
on `P0`, while the cache data itself lives in shared memory accessible by all
|
||||
processes.
|
||||
|
||||
### Configuration
|
||||
|
||||
You can adjust the size of the cache by setting the value of `mm_processor_cache_gb` (default 4 GiB).
|
||||
|
||||
If you do not benefit much from the cache, you can disable both IPC
|
||||
and processor caching completely via `mm_processor_cache_gb=0`.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
# Use a larger cache
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
mm_processor_cache_gb=8,
|
||||
)
|
||||
|
||||
# Use a shared-memory based IPC cache
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
tensor_parallel_size=2,
|
||||
mm_processor_cache_type="shm",
|
||||
mm_processor_cache_gb=8,
|
||||
)
|
||||
|
||||
# Disable the cache
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
mm_processor_cache_gb=0,
|
||||
)
|
||||
```
|
||||
|
||||
### Cache Placement
|
||||
|
||||
Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows:
|
||||
|
||||
| mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory |
|
||||
|-------------------|-------------|------------|------------|-------------|-------------|
|
||||
| lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` |
|
||||
| lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` |
|
||||
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
|
||||
| N/A | Disabled | N/A | N/A | N/A | `0` |
|
||||
|
||||
K: Stores the hashes of multi-modal items
|
||||
V: Stores the processed tensor data of multi-modal items
|
||||
35
docs/configuration/serve_args.md
Normal file
35
docs/configuration/serve_args.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Server Arguments
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
|
||||
## CLI Arguments
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
To see the available options, take a look at the [CLI Reference](../cli/README.md)!
|
||||
|
||||
## Configuration file
|
||||
|
||||
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
|
||||
The argument names must be the long form of those outlined [above](serve_args.md).
|
||||
|
||||
For example:
|
||||
|
||||
```yaml
|
||||
# config.yaml
|
||||
|
||||
model: meta-llama/Llama-3.1-8B-Instruct
|
||||
host: "127.0.0.1"
|
||||
port: 6379
|
||||
uvicorn-log-level: "info"
|
||||
```
|
||||
|
||||
To use the above config file:
|
||||
|
||||
```bash
|
||||
vllm serve --config config.yaml
|
||||
```
|
||||
|
||||
!!! note
|
||||
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
|
||||
The order of priorities is `command line > config file values > defaults`.
|
||||
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
|
||||
Reference in New Issue
Block a user