Sync from v0.13
65
docs/.nav.yml
Normal file
@@ -0,0 +1,65 @@
|
||||
nav:
|
||||
- Home: README.md
|
||||
- User Guide:
|
||||
- usage/README.md
|
||||
- Getting Started:
|
||||
- getting_started/quickstart.md
|
||||
- getting_started/installation
|
||||
- Examples: examples
|
||||
- General:
|
||||
- usage/v1_guide.md
|
||||
- usage/*
|
||||
- Inference and Serving:
|
||||
- serving/offline_inference.md
|
||||
- serving/openai_compatible_server.md
|
||||
- serving/*
|
||||
- serving/integrations
|
||||
- Deployment:
|
||||
- deployment/*
|
||||
- deployment/frameworks
|
||||
- deployment/integrations
|
||||
- Training: training
|
||||
- Configuration:
|
||||
- configuration/*
|
||||
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/
|
||||
- Models:
|
||||
- models/supported_models.md
|
||||
- models/generative_models.md
|
||||
- models/pooling_models.md
|
||||
- models/extensions
|
||||
- Hardware Supported Models:
|
||||
- models/hardware_supported_models/*
|
||||
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/
|
||||
- Features: features
|
||||
- Developer Guide:
|
||||
- contributing/README.md
|
||||
- General:
|
||||
- glob: contributing/*
|
||||
flatten_single_child_sections: true
|
||||
- Model Implementation:
|
||||
- contributing/model/README.md
|
||||
- contributing/model/basic.md
|
||||
- contributing/model/registration.md
|
||||
- contributing/model/tests.md
|
||||
- contributing/model/multimodal.md
|
||||
- contributing/model/transcription.md
|
||||
- CI: contributing/ci
|
||||
- Design Documents:
|
||||
- Plugins:
|
||||
- design/*plugin*.md
|
||||
- design/*
|
||||
- Benchmarking:
|
||||
- benchmarking/README.md
|
||||
- benchmarking/cli.md
|
||||
- benchmarking/sweeps.md
|
||||
- benchmarking/dashboard.md
|
||||
- API Reference:
|
||||
- api/README.md
|
||||
- api/vllm
|
||||
- CLI Reference: cli
|
||||
- Community:
|
||||
- community/*
|
||||
- Governance: governance
|
||||
- Blog: https://blog.vllm.ai
|
||||
- Forum: https://discuss.vllm.ai
|
||||
- Slack: https://slack.vllm.ai
|
||||
@@ -1,20 +0,0 @@
|
||||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = source
|
||||
BUILDDIR = build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
@@ -1,19 +1,68 @@
|
||||
# vLLM documents
|
||||
---
|
||||
hide:
|
||||
- navigation
|
||||
- toc
|
||||
---
|
||||
|
||||
## Build the docs
|
||||
# Welcome to vLLM
|
||||
|
||||
```bash
|
||||
# Install dependencies.
|
||||
pip install -r requirements-docs.txt
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="vLLM Light" class="logo-light" width="60%" }
|
||||
{ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }
|
||||
</figure>
|
||||
|
||||
# Build the docs.
|
||||
make clean
|
||||
make html
|
||||
```
|
||||
<p style="text-align:center">
|
||||
<strong>Easy, fast, and cheap LLM serving for everyone
|
||||
</strong>
|
||||
</p>
|
||||
|
||||
## Open the docs with your browser
|
||||
<p style="text-align:center">
|
||||
<script async defer src="https://buttons.github.io/buttons.js"></script>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-show-count="true" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
|
||||
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
|
||||
</p>
|
||||
|
||||
```bash
|
||||
python -m http.server -d build/html/
|
||||
```
|
||||
Launch your browser and open localhost:8000.
|
||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||
|
||||
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
||||
|
||||
Where to get started with vLLM depends on the type of user. If you are looking to:
|
||||
|
||||
- Run open-source models on vLLM, we recommend starting with the [Quickstart Guide](./getting_started/quickstart.md)
|
||||
- Build applications with vLLM, we recommend starting with the [User Guide](./usage/README.md)
|
||||
- Build vLLM, we recommend starting with [Developer Guide](./contributing/README.md)
|
||||
|
||||
For information about the development of vLLM, see:
|
||||
|
||||
- [Roadmap](https://roadmap.vllm.ai)
|
||||
- [Releases](https://github.com/vllm-project/vllm/releases)
|
||||
|
||||
vLLM is fast with:
|
||||
|
||||
- State-of-the-art serving throughput
|
||||
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
|
||||
- Continuous batching of incoming requests
|
||||
- Fast model execution with CUDA/HIP graph
|
||||
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
|
||||
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
|
||||
- Speculative decoding
|
||||
- Chunked prefill
|
||||
|
||||
vLLM is flexible and easy to use with:
|
||||
|
||||
- Seamless integration with popular HuggingFace models
|
||||
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
|
||||
- Tensor, pipeline, data and expert parallelism support for distributed inference
|
||||
- Streaming outputs
|
||||
- OpenAI-compatible API server
|
||||
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
|
||||
- Prefix caching support
|
||||
- Multi-LoRA support
|
||||
|
||||
For more information, check out the following:
|
||||
|
||||
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
|
||||
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
|
||||
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
|
||||
- [vLLM Meetups](community/meetups.md)
|
||||
|
||||
98
docs/api/README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Summary
|
||||
|
||||
## Configuration
|
||||
|
||||
API documentation for vLLM's configuration classes.
|
||||
|
||||
- [vllm.config.ModelConfig][]
|
||||
- [vllm.config.CacheConfig][]
|
||||
- [vllm.config.LoadConfig][]
|
||||
- [vllm.config.ParallelConfig][]
|
||||
- [vllm.config.SchedulerConfig][]
|
||||
- [vllm.config.DeviceConfig][]
|
||||
- [vllm.config.SpeculativeConfig][]
|
||||
- [vllm.config.LoRAConfig][]
|
||||
- [vllm.config.MultiModalConfig][]
|
||||
- [vllm.config.PoolerConfig][]
|
||||
- [vllm.config.StructuredOutputsConfig][]
|
||||
- [vllm.config.ProfilerConfig][]
|
||||
- [vllm.config.ObservabilityConfig][]
|
||||
- [vllm.config.KVTransferConfig][]
|
||||
- [vllm.config.CompilationConfig][]
|
||||
- [vllm.config.VllmConfig][]
|
||||
|
||||
## Offline Inference
|
||||
|
||||
LLM Class.
|
||||
|
||||
- [vllm.LLM][]
|
||||
|
||||
LLM Inputs.
|
||||
|
||||
- [vllm.inputs.PromptType][]
|
||||
- [vllm.inputs.TextPrompt][]
|
||||
- [vllm.inputs.TokensPrompt][]
|
||||
|
||||
## vLLM Engines
|
||||
|
||||
Engine classes for offline and online inference.
|
||||
|
||||
- [vllm.LLMEngine][]
|
||||
- [vllm.AsyncLLMEngine][]
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
Inference parameters for vLLM APIs.
|
||||
|
||||
- [vllm.SamplingParams][]
|
||||
- [vllm.PoolingParams][]
|
||||
|
||||
## Multi-Modality
|
||||
|
||||
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
|
||||
|
||||
Multi-modal inputs can be passed alongside text and token prompts to [supported models](../models/supported_models.md#list-of-multimodal-language-models)
|
||||
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
|
||||
|
||||
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
|
||||
|
||||
- [vllm.multimodal.MULTIMODAL_REGISTRY][]
|
||||
|
||||
### Inputs
|
||||
|
||||
User-facing inputs.
|
||||
|
||||
- [vllm.multimodal.inputs.MultiModalDataDict][]
|
||||
|
||||
Internal data structures.
|
||||
|
||||
- [vllm.multimodal.inputs.PlaceholderRange][]
|
||||
- [vllm.multimodal.inputs.NestedTensors][]
|
||||
- [vllm.multimodal.inputs.MultiModalFieldElem][]
|
||||
- [vllm.multimodal.inputs.MultiModalFieldConfig][]
|
||||
- [vllm.multimodal.inputs.MultiModalKwargsItem][]
|
||||
- [vllm.multimodal.inputs.MultiModalKwargsItems][]
|
||||
- [vllm.multimodal.inputs.MultiModalKwargs][]
|
||||
- [vllm.multimodal.inputs.MultiModalInputs][]
|
||||
|
||||
### Data Parsing
|
||||
|
||||
- [vllm.multimodal.parse][]
|
||||
|
||||
### Data Processing
|
||||
|
||||
- [vllm.multimodal.processing][]
|
||||
|
||||
### Memory Profiling
|
||||
|
||||
- [vllm.multimodal.profiling][]
|
||||
|
||||
### Registry
|
||||
|
||||
- [vllm.multimodal.registry][]
|
||||
|
||||
## Model Development
|
||||
|
||||
- [vllm.model_executor.models.interfaces_base][]
|
||||
- [vllm.model_executor.models.interfaces][]
|
||||
- [vllm.model_executor.models.adapters][]
|
||||
2
docs/api/vllm/.meta.yml
Normal file
@@ -0,0 +1,2 @@
|
||||
search:
|
||||
exclude: true
|
||||
BIN
docs/assets/contributing/dockerfile-stages-dependency.png
Normal file
|
After Width: | Height: | Size: 174 KiB |
BIN
docs/assets/contributing/load-pattern-examples.png
Normal file
|
After Width: | Height: | Size: 577 KiB |
BIN
docs/assets/deployment/anything-llm-chat-with-doc.png
Normal file
|
After Width: | Height: | Size: 118 KiB |
BIN
docs/assets/deployment/anything-llm-chat-without-doc.png
Normal file
|
After Width: | Height: | Size: 136 KiB |
BIN
docs/assets/deployment/anything-llm-provider.png
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
docs/assets/deployment/anything-llm-upload-doc.png
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
docs/assets/deployment/architecture_helm_deployment.png
Normal file
|
After Width: | Height: | Size: 968 KiB |
BIN
docs/assets/deployment/chatbox-chat.png
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
docs/assets/deployment/chatbox-settings.png
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
docs/assets/deployment/dify-chat.png
Normal file
|
After Width: | Height: | Size: 143 KiB |
BIN
docs/assets/deployment/dify-create-chatbot.png
Normal file
|
After Width: | Height: | Size: 265 KiB |
BIN
docs/assets/deployment/dify-settings.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
docs/assets/deployment/dp_external_lb.png
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
docs/assets/deployment/dp_internal_lb.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
docs/assets/deployment/hf-inference-endpoints-catalog.png
Normal file
|
After Width: | Height: | Size: 627 KiB |
BIN
docs/assets/deployment/hf-inference-endpoints-choose-infra.png
Normal file
|
After Width: | Height: | Size: 350 KiB |
|
After Width: | Height: | Size: 814 KiB |
|
After Width: | Height: | Size: 267 KiB |
|
After Width: | Height: | Size: 354 KiB |
|
After Width: | Height: | Size: 781 KiB |
BIN
docs/assets/deployment/hf-inference-endpoints-new-endpoint.png
Normal file
|
After Width: | Height: | Size: 51 KiB |
|
After Width: | Height: | Size: 359 KiB |
BIN
docs/assets/deployment/hf-inference-endpoints-select-model.png
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
docs/assets/deployment/open_webui.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
docs/assets/deployment/streamlit-chat.png
Normal file
|
After Width: | Height: | Size: 106 KiB |
BIN
docs/assets/design/arch_overview/entrypoints.excalidraw.png
Normal file
|
After Width: | Height: | Size: 120 KiB |
BIN
docs/assets/design/arch_overview/llm_engine.excalidraw.png
Normal file
|
After Width: | Height: | Size: 174 KiB |
BIN
docs/assets/design/cuda_graphs/current_design.png
Normal file
|
After Width: | Height: | Size: 70 KiB |
BIN
docs/assets/design/cuda_graphs/executor_runtime.png
Normal file
|
After Width: | Height: | Size: 60 KiB |
BIN
docs/assets/design/cuda_graphs/previous_design.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
docs/assets/design/cuda_graphs/wrapper_flow.png
Normal file
|
After Width: | Height: | Size: 87 KiB |
BIN
docs/assets/design/debug_vllm_compile/design_diagram.png
Normal file
|
After Width: | Height: | Size: 314 KiB |
BIN
docs/assets/design/debug_vllm_compile/dynamic_shapes.png
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
docs/assets/design/debug_vllm_compile/tlparse_inductor.png
Normal file
|
After Width: | Height: | Size: 257 KiB |
|
After Width: | Height: | Size: 187 KiB |
|
After Width: | Height: | Size: 189 KiB |
|
After Width: | Height: | Size: 227 KiB |
|
After Width: | Height: | Size: 128 KiB |
BIN
docs/assets/design/hierarchy.png
Normal file
|
After Width: | Height: | Size: 170 KiB |
|
After Width: | Height: | Size: 24 KiB |
BIN
docs/assets/design/hybrid_kv_cache_manager/full_attn.png
Normal file
|
After Width: | Height: | Size: 4.0 KiB |
BIN
docs/assets/design/hybrid_kv_cache_manager/memory_layout.png
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
docs/assets/design/hybrid_kv_cache_manager/overview.png
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
docs/assets/design/hybrid_kv_cache_manager/sw_attn.png
Normal file
|
After Width: | Height: | Size: 4.5 KiB |
BIN
docs/assets/design/metrics/intervals-1.png
Normal file
|
After Width: | Height: | Size: 185 KiB |
BIN
docs/assets/design/metrics/intervals-2.png
Normal file
|
After Width: | Height: | Size: 162 KiB |
BIN
docs/assets/design/metrics/intervals-3.png
Normal file
|
After Width: | Height: | Size: 161 KiB |
|
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 27 KiB |
|
Before Width: | Height: | Size: 109 KiB After Width: | Height: | Size: 109 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 42 KiB After Width: | Height: | Size: 42 KiB |
|
Before Width: | Height: | Size: 167 KiB After Width: | Height: | Size: 167 KiB |
BIN
docs/assets/design/prefix_caching/example-time-1.png
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
docs/assets/design/prefix_caching/example-time-3.png
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
docs/assets/design/prefix_caching/example-time-4.png
Normal file
|
After Width: | Height: | Size: 59 KiB |
BIN
docs/assets/design/prefix_caching/example-time-5.png
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
docs/assets/design/prefix_caching/example-time-6.png
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
docs/assets/design/prefix_caching/example-time-7.png
Normal file
|
After Width: | Height: | Size: 55 KiB |
BIN
docs/assets/design/prefix_caching/free.png
Normal file
|
After Width: | Height: | Size: 18 KiB |
BIN
docs/assets/design/prefix_caching/overview.png
Normal file
|
After Width: | Height: | Size: 32 KiB |
BIN
docs/assets/design/tpu/most_model_len.png
Normal file
|
After Width: | Height: | Size: 12 KiB |
BIN
docs/assets/features/disagg_encoder/disagg_encoder_flow.png
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
docs/assets/features/disagg_prefill/abstraction.jpg
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
docs/assets/features/disagg_prefill/high_level_design.png
Normal file
|
After Width: | Height: | Size: 91 KiB |
BIN
docs/assets/features/disagg_prefill/overview.jpg
Normal file
|
After Width: | Height: | Size: 173 KiB |
BIN
docs/assets/features/disagg_prefill/workflow.png
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
docs/assets/logos/vllm-logo-only-light.ico
Normal file
|
After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 53 KiB After Width: | Height: | Size: 53 KiB |
|
Before Width: | Height: | Size: 86 KiB After Width: | Height: | Size: 86 KiB |
|
Before Width: | Height: | Size: 88 KiB After Width: | Height: | Size: 88 KiB |
7
docs/benchmarking/README.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Benchmark Suites
|
||||
|
||||
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
|
||||
|
||||
- **[Benchmark CLI](./cli.md)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing.
|
||||
- **[Parameter Sweeps](./sweeps.md)**: Automate `vllm bench` runs for multiple configurations, useful for [optimization and tuning](../configuration/optimization.md).
|
||||
- **[Performance Dashboard](./dashboard.md)**: Automated CI that publishes benchmarks on each commit.
|
||||
1007
docs/benchmarking/cli.md
Normal file
58
docs/benchmarking/dashboard.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Performance Dashboard
|
||||
|
||||
The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads.
|
||||
It is updated by triggering benchmark runs on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
|
||||
|
||||
The results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
||||
|
||||
## Manually Trigger the benchmark
|
||||
|
||||
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
|
||||
For CPU environment, please use the image with "-cpu" postfix.
|
||||
|
||||
Here is an example for docker run command for CPU.
|
||||
|
||||
```bash
|
||||
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN='' --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
|
||||
```
|
||||
|
||||
Then, run below command inside the docker instance.
|
||||
|
||||
```bash
|
||||
bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
|
||||
|
||||
### Runtime environment variables
|
||||
|
||||
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
|
||||
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
|
||||
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
|
||||
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
|
||||
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
|
||||
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
|
||||
|
||||
For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
|
||||
|
||||
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
|
||||
|
||||
## Continuous Benchmarking
|
||||
|
||||
The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements.
|
||||
|
||||
### How It Works
|
||||
|
||||
The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests:
|
||||
|
||||
- **Serving tests**: Measure request handling and API performance
|
||||
- **Throughput tests**: Evaluate token generation rates
|
||||
- **Latency tests**: Assess response time characteristics
|
||||
|
||||
### Benchmark Configuration
|
||||
|
||||
The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking:
|
||||
|
||||
1. Navigate to the appropriate GPU directory in the benchmarks configuration
|
||||
2. Add your model specifications to the corresponding configuration files
|
||||
3. The new models will be included in the next scheduled benchmark run
|
||||
178
docs/benchmarking/sweeps.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Parameter Sweeps
|
||||
|
||||
## Online Benchmark
|
||||
|
||||
### Basic
|
||||
|
||||
`vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
|
||||
|
||||
Follow these steps to run the script:
|
||||
|
||||
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
|
||||
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
|
||||
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
|
||||
|
||||
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"max_num_seqs": 32,
|
||||
"max_num_batched_tokens": 1024
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 64,
|
||||
"max_num_batched_tokens": 1024
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 64,
|
||||
"max_num_batched_tokens": 2048
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 128,
|
||||
"max_num_batched_tokens": 2048
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 128,
|
||||
"max_num_batched_tokens": 4096
|
||||
},
|
||||
{
|
||||
"max_num_seqs": 256,
|
||||
"max_num_batched_tokens": 4096
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
|
||||
|
||||
- Example: Using different input/output lengths for random dataset:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"random_input_len": 128,
|
||||
"random_output_len": 32
|
||||
},
|
||||
{
|
||||
"random_input_len": 256,
|
||||
"random_output_len": 64
|
||||
},
|
||||
{
|
||||
"random_input_len": 512,
|
||||
"random_output_len": 128
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
5. Determine where you want to save the results, and pass that to `--output-dir`.
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep serve \
|
||||
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
|
||||
--serve-params benchmarks/serve_hparams.json \
|
||||
--bench-params benchmarks/bench_hparams.json \
|
||||
-o benchmarks/results
|
||||
```
|
||||
|
||||
!!! important
|
||||
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
|
||||
You can use `--dry-run` to preview the commands to be run.
|
||||
|
||||
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
|
||||
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
|
||||
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
|
||||
|
||||
!!! note
|
||||
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
|
||||
|
||||
!!! tip
|
||||
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
|
||||
|
||||
### SLA auto-tuner
|
||||
|
||||
`vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
|
||||
|
||||
For example, to ensure E2E latency within different target values for 99% of requests:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"p99_e2el_ms": "<=200"
|
||||
},
|
||||
{
|
||||
"p99_e2el_ms": "<=500"
|
||||
},
|
||||
{
|
||||
"p99_e2el_ms": "<=1000"
|
||||
},
|
||||
{
|
||||
"p99_e2el_ms": "<=2000"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep serve_sla \
|
||||
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
|
||||
--serve-params benchmarks/serve_hparams.json \
|
||||
--bench-params benchmarks/bench_hparams.json \
|
||||
--sla-params benchmarks/sla_hparams.json \
|
||||
--sla-variable max_concurrency \
|
||||
-o benchmarks/results
|
||||
```
|
||||
|
||||
The algorithm for adjusting the SLA variable is as follows:
|
||||
|
||||
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
|
||||
- For example, the initial request rate is set to the concurrency under infinite QPS.
|
||||
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
|
||||
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
|
||||
|
||||
!!! important
|
||||
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
|
||||
|
||||
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
|
||||
|
||||
## Visualization
|
||||
|
||||
### Basic
|
||||
|
||||
`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
|
||||
|
||||
Example command:
|
||||
|
||||
```bash
|
||||
vllm bench sweep plot benchmarks/results/<timestamp> \
|
||||
--var-x max_concurrency \
|
||||
--row-by random_input_len \
|
||||
--col-by random_output_len \
|
||||
--curve-by api_server_count,max_num_batched_tokens \
|
||||
--filter-by 'max_concurrency<=1024'
|
||||
```
|
||||
|
||||
!!! tip
|
||||
You can use `--dry-run` to preview the figures to be plotted.
|
||||
|
||||
### Pareto chart
|
||||
|
||||
`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.
|
||||
|
||||
Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.
|
||||
|
||||
- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
|
||||
- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
|
||||
- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
|
||||
- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
|
||||
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
|
||||
```
|
||||
1
docs/cli/.meta.yml
Normal file
@@ -0,0 +1 @@
|
||||
toc_depth: 3
|
||||
8
docs/cli/.nav.yml
Normal file
@@ -0,0 +1,8 @@
|
||||
nav:
|
||||
- README.md
|
||||
- serve.md
|
||||
- chat.md
|
||||
- complete.md
|
||||
- run-batch.md
|
||||
- vllm bench:
|
||||
- bench/**/*.md
|
||||
188
docs/cli/README.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# vLLM CLI Guide
|
||||
|
||||
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
|
||||
|
||||
```bash
|
||||
vllm --help
|
||||
```
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm {chat,complete,serve,bench,collect-env,run-batch}
|
||||
```
|
||||
|
||||
## serve
|
||||
|
||||
Starts the vLLM OpenAI Compatible API server.
|
||||
|
||||
Start with a model:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf
|
||||
```
|
||||
|
||||
Specify the port:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --port 8100
|
||||
```
|
||||
|
||||
Serve over a Unix domain socket:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
|
||||
```
|
||||
|
||||
Check with --help for more options:
|
||||
|
||||
```bash
|
||||
# To list all groups
|
||||
vllm serve --help=listgroup
|
||||
|
||||
# To view a argument group
|
||||
vllm serve --help=ModelConfig
|
||||
|
||||
# To view a single argument
|
||||
vllm serve --help=max-num-seqs
|
||||
|
||||
# To search by keyword
|
||||
vllm serve --help=max
|
||||
|
||||
# To view full help with pager (less/more)
|
||||
vllm serve --help=page
|
||||
```
|
||||
|
||||
See [vllm serve](./serve.md) for the full reference of all available arguments.
|
||||
|
||||
## chat
|
||||
|
||||
Generate chat completions via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm chat
|
||||
|
||||
# Specify API url
|
||||
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick chat with a single prompt
|
||||
vllm chat --quick "hi"
|
||||
```
|
||||
|
||||
See [vllm chat](./chat.md) for the full reference of all available arguments.
|
||||
|
||||
## complete
|
||||
|
||||
Generate text completions based on the given prompt via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm complete
|
||||
|
||||
# Specify API url
|
||||
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick complete with a single prompt
|
||||
vllm complete --quick "The future of AI is"
|
||||
```
|
||||
|
||||
See [vllm complete](./complete.md) for the full reference of all available arguments.
|
||||
|
||||
## bench
|
||||
|
||||
Run benchmark tests for latency online serving throughput and offline inference throughput.
|
||||
|
||||
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm bench {latency, serve, throughput}
|
||||
```
|
||||
|
||||
### latency
|
||||
|
||||
Benchmark the latency of a single batch of requests.
|
||||
|
||||
```bash
|
||||
vllm bench latency \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
|
||||
|
||||
### serve
|
||||
|
||||
Benchmark the online serving throughput.
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--host server-host \
|
||||
--port server-port \
|
||||
--random-input-len 32 \
|
||||
--random-output-len 4 \
|
||||
--num-prompts 5
|
||||
```
|
||||
|
||||
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
|
||||
|
||||
### throughput
|
||||
|
||||
Benchmark offline inference throughput.
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
|
||||
|
||||
## collect-env
|
||||
|
||||
Start collecting environment information.
|
||||
|
||||
```bash
|
||||
vllm collect-env
|
||||
```
|
||||
|
||||
## run-batch
|
||||
|
||||
Run batch prompts and write results to file.
|
||||
|
||||
Running with a local file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
Using remote file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
|
||||
|
||||
## More Help
|
||||
|
||||
For detailed options of any subcommand, use:
|
||||
|
||||
```bash
|
||||
vllm <subcommand> --help
|
||||
```
|
||||
9
docs/cli/bench/latency.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench latency
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_latency.inc.md"
|
||||
9
docs/cli/bench/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_serve.inc.md"
|
||||
9
docs/cli/bench/sweep/plot.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_plot.inc.md"
|
||||
9
docs/cli/bench/sweep/plot_pareto.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot_pareto
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
|
||||
9
docs/cli/bench/sweep/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_serve.inc.md"
|
||||
9
docs/cli/bench/sweep/serve_sla.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve_sla
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
|
||||
9
docs/cli/bench/throughput.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench throughput
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_throughput.inc.md"
|
||||
5
docs/cli/chat.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# vllm chat
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/chat.inc.md"
|
||||
5
docs/cli/complete.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# vllm complete
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/complete.inc.md"
|
||||
9
docs/cli/json_tip.inc.md
Normal file
@@ -0,0 +1,9 @@
|
||||
When passing JSON CLI arguments, the following sets of arguments are equivalent:
|
||||
|
||||
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
|
||||
- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
|
||||
|
||||
Additionally, list elements can be passed individually using `+`:
|
||||
|
||||
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
|
||||
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
|
||||
9
docs/cli/run-batch.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm run-batch
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/run-batch.inc.md"
|
||||
9
docs/cli/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/serve.inc.md"
|
||||
3
docs/community/contact_us.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# Contact Us
|
||||
|
||||
--8<-- "README.md:contact-us"
|
||||
46
docs/community/meetups.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# Meetups
|
||||
|
||||
We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights.
|
||||
|
||||
## Upcoming Meetups
|
||||
|
||||
Stay tuned for upcoming meetups! Follow us on [Twitter/X](https://x.com/vllm_project), join our [Slack](https://slack.vllm.ai), and follow vLLM on [Luma](https://luma.com/vLLM-Meetups) to get notified about new events.
|
||||
|
||||
## Past Meetups
|
||||
|
||||
Below you'll find slides and recordings from our previous meetups:
|
||||
|
||||
- [vLLM Bangkok Meetup](https://luma.com/v0f647nv), November 21st 2025. [[Slides]](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing)
|
||||
- [vLLM Zurich Meetup](https://luma.com/0gls27kb), November 6th 2025. [[Slides]](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) [[Recording]](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
|
||||
- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w), November 1st 2025. [[Slides]](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link)
|
||||
- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg), October 25th 2025. [[Slides]](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6)
|
||||
- [vLLM Toronto Meetup](https://luma.com/e80e0ymm), September 25th 2025. [[Slides]](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing)
|
||||
- [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA)
|
||||
- [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing)
|
||||
- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH)
|
||||
- [vLLM Korea Meetup](https://luma.com/cgcgprmh), August 19th 2025. [[Slides]](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
|
||||
- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA), August 2nd 2025. [[Slides]](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) [[Recording]](https://www.chaspark.com/#/live/1166916873711665152).
|
||||
- [NYC vLLM Meetup](https://lu.ma/c1rqyf1f), May 7th, 2025. [[Slides]](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing)
|
||||
- [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day), April 3rd 2025. [[Slides]](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
|
||||
- [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama), March 27th 2025. [[Slides]](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
|
||||
- [The first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg), March 16th 2025. [[Slides]](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
|
||||
- [The East Coast vLLM Meetup](https://lu.ma/7mu4k4xx), March 11th 2025. [[Slides]](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0)
|
||||
- [The ninth vLLM meetup](https://lu.ma/h7g3kuj9), with Meta, February 27th 2025. [[Slides]](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing)
|
||||
- [The eighth vLLM meetup](https://lu.ma/zep56hui), with Google Cloud, January 22nd 2025. [[Slides]](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing)
|
||||
- [The seventh vLLM meetup](https://lu.ma/h0qvrajz), with Snowflake, November 14th 2024. [[Slides]](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing)
|
||||
- [The sixth vLLM meetup](https://lu.ma/87q3nvnh), with NVIDIA, September 9th 2024. [[Slides]](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing)
|
||||
- [The fifth vLLM meetup](https://lu.ma/lp0gyjqr), with AWS, July 24th 2024. [[Slides]](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing)
|
||||
- [The fourth vLLM meetup](https://lu.ma/agivllm), with Cloudflare and BentoML, June 11th 2024. [[Slides]](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing)
|
||||
- [The third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/), with Roblox, April 2nd 2024. [[Slides]](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing)
|
||||
- [The second vLLM meetup](https://lu.ma/ygxbpzhl), with IBM Research, January 31st 2024. [[Slides]](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing) [[Video (vLLM Update)]](https://youtu.be/Y0C-DUvEnZQ) [[Video (IBM Research & torch.compile)]](https://youtu.be/m0dMtFLI-dg)
|
||||
- [The first vLLM meetup](https://lu.ma/first-vllm-meetup), with a16z, October 5th 2023. [[Slides]](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing)
|
||||
|
||||
## Get Involved
|
||||
|
||||
**Want to host or speak at a vLLM meetup?** We're always looking for speakers and sponsors for our meetups. Whether you want to:
|
||||
|
||||
- Share your vLLM feature, use case, project extension, or deployment experience
|
||||
- Host a meetup in your city
|
||||
- Sponsor an event
|
||||
|
||||
Please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu).
|
||||
44
docs/community/sponsors.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Sponsors
|
||||
|
||||
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
|
||||
|
||||
<!-- Note: Please sort them in alphabetical order. -->
|
||||
<!-- Note: Please keep these consistent with README.md. -->
|
||||
|
||||
Cash Donations:
|
||||
|
||||
- a16z
|
||||
- Dropbox
|
||||
- Sequoia Capital
|
||||
- Skywork AI
|
||||
- ZhenFund
|
||||
|
||||
Compute Resources:
|
||||
|
||||
- Alibaba Cloud
|
||||
- AMD
|
||||
- Anyscale
|
||||
- Arm
|
||||
- AWS
|
||||
- Crusoe Cloud
|
||||
- Databricks
|
||||
- DeepInfra
|
||||
- Google Cloud
|
||||
- IBM
|
||||
- Intel
|
||||
- Lambda Lab
|
||||
- Nebius
|
||||
- Novita AI
|
||||
- NVIDIA
|
||||
- Red Hat
|
||||
- Replicate
|
||||
- Roblox
|
||||
- RunPod
|
||||
- Trainy
|
||||
- UC Berkeley
|
||||
- UC San Diego
|
||||
- Volcengine
|
||||
|
||||
Slack Sponsor: Anyscale
|
||||
|
||||
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
|
||||
9
docs/configuration/README.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Configuration Options
|
||||
|
||||
This section lists the most common options for running vLLM.
|
||||
|
||||
There are three main levels of configuration, from highest priority to lowest priority:
|
||||
|
||||
- [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
|
||||
- [Engine arguments](./engine_args.md)
|
||||
- [Environment variables](./env_vars.md)
|
||||