Sync from v0.13

2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions
--- a/docs/cli/.meta.yml
+++ b/docs/cli/.meta.yml
@@ -0,0 +1 @@
+toc_depth: 3
--- a/docs/cli/.nav.yml
+++ b/docs/cli/.nav.yml
@@ -0,0 +1,8 @@
+nav:
+  - README.md
+  - serve.md
+  - chat.md
+  - complete.md
+  - run-batch.md
+  - vllm bench:
+    - bench/**/*.md
--- a/docs/cli/README.md
+++ b/docs/cli/README.md
@@ -0,0 +1,188 @@
+# vLLM CLI Guide
+
+The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
+
+```bash
+vllm --help
+```
+
+Available Commands:
+
+```bash
+vllm {chat,complete,serve,bench,collect-env,run-batch}
+```
+
+## serve
+
+Starts the vLLM OpenAI Compatible API server.
+
+Start with a model:
+
+```bash
+vllm serve meta-llama/Llama-2-7b-hf
+```
+
+Specify the port:
+
+```bash
+vllm serve meta-llama/Llama-2-7b-hf --port 8100
+```
+
+Serve over a Unix domain socket:
+
+```bash
+vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
+```
+
+Check with --help for more options:
+
+```bash
+# To list all groups
+vllm serve --help=listgroup
+
+# To view a argument group
+vllm serve --help=ModelConfig
+
+# To view a single argument
+vllm serve --help=max-num-seqs
+
+# To search by keyword
+vllm serve --help=max
+
+# To view full help with pager (less/more)
+vllm serve --help=page
+```
+
+See [vllm serve](./serve.md) for the full reference of all available arguments.
+
+## chat
+
+Generate chat completions via the running API server.
+
+```bash
+# Directly connect to localhost API without arguments
+vllm chat
+
+# Specify API url
+vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
+
+# Quick chat with a single prompt
+vllm chat --quick "hi"
+```
+
+See [vllm chat](./chat.md) for the full reference of all available arguments.
+
+## complete
+
+Generate text completions based on the given prompt via the running API server.
+
+```bash
+# Directly connect to localhost API without arguments
+vllm complete
+
+# Specify API url
+vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
+
+# Quick complete with a single prompt
+vllm complete --quick "The future of AI is"
+```
+
+See [vllm complete](./complete.md) for the full reference of all available arguments.
+
+## bench
+
+Run benchmark tests for latency online serving throughput and offline inference throughput.
+
+To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
+
+Available Commands:
+
+```bash
+vllm bench {latency, serve, throughput}
+```
+
+### latency
+
+Benchmark the latency of a single batch of requests.
+
+```bash
+vllm bench latency \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --input-len 32 \
+    --output-len 1 \
+    --enforce-eager \
+    --load-format dummy
+```
+
+See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
+
+### serve
+
+Benchmark the online serving throughput.
+
+```bash
+vllm bench serve \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --host server-host \
+    --port server-port \
+    --random-input-len 32 \
+    --random-output-len 4  \
+    --num-prompts  5
+```
+
+See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
+
+### throughput
+
+Benchmark offline inference throughput.
+
+```bash
+vllm bench throughput \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --input-len 32 \
+    --output-len 1 \
+    --enforce-eager \
+    --load-format dummy
+```
+
+See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
+
+## collect-env
+
+Start collecting environment information.
+
+```bash
+vllm collect-env
+```
+
+## run-batch
+
+Run batch prompts and write results to file.
+
+Running with a local file:
+
+```bash
+vllm run-batch \
+    -i offline_inference/openai_batch/openai_example_batch.jsonl \
+    -o results.jsonl \
+    --model meta-llama/Meta-Llama-3-8B-Instruct
+```
+
+Using remote file:
+
+```bash
+vllm run-batch \
+    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
+    -o results.jsonl \
+    --model meta-llama/Meta-Llama-3-8B-Instruct
+```
+
+See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
+
+## More Help
+
+For detailed options of any subcommand, use:
+
+```bash
+vllm <subcommand> --help
+```
--- a/docs/cli/bench/latency.md
+++ b/docs/cli/bench/latency.md
@@ -0,0 +1,9 @@
+# vllm bench latency
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_latency.inc.md"
--- a/docs/cli/bench/serve.md
+++ b/docs/cli/bench/serve.md
@@ -0,0 +1,9 @@
+# vllm bench serve
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_serve.inc.md"
--- a/docs/cli/bench/sweep/plot.md
+++ b/docs/cli/bench/sweep/plot.md
@@ -0,0 +1,9 @@
+# vllm bench sweep plot
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_sweep_plot.inc.md"
--- a/docs/cli/bench/sweep/plot_pareto.md
+++ b/docs/cli/bench/sweep/plot_pareto.md
@@ -0,0 +1,9 @@
+# vllm bench sweep plot_pareto
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
--- a/docs/cli/bench/sweep/serve.md
+++ b/docs/cli/bench/sweep/serve.md
@@ -0,0 +1,9 @@
+# vllm bench sweep serve
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_sweep_serve.inc.md"
--- a/docs/cli/bench/sweep/serve_sla.md
+++ b/docs/cli/bench/sweep/serve_sla.md
@@ -0,0 +1,9 @@
+# vllm bench sweep serve_sla
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
--- a/docs/cli/bench/throughput.md
+++ b/docs/cli/bench/throughput.md
@@ -0,0 +1,9 @@
+# vllm bench throughput
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/bench_throughput.inc.md"
--- a/docs/cli/chat.md
+++ b/docs/cli/chat.md
@@ -0,0 +1,5 @@
+# vllm chat
+
+## Arguments
+
+--8<-- "docs/argparse/chat.inc.md"
--- a/docs/cli/complete.md
+++ b/docs/cli/complete.md
@@ -0,0 +1,5 @@
+# vllm complete
+
+## Arguments
+
+--8<-- "docs/argparse/complete.inc.md"
--- a/docs/cli/json_tip.inc.md
+++ b/docs/cli/json_tip.inc.md
@@ -0,0 +1,9 @@
+When passing JSON CLI arguments, the following sets of arguments are equivalent:
+
+- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
+- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
+
+Additionally, list elements can be passed individually using `+`:
+
+- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
+- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
--- a/docs/cli/run-batch.md
+++ b/docs/cli/run-batch.md
@@ -0,0 +1,9 @@
+# vllm run-batch
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/run-batch.inc.md"
--- a/docs/cli/serve.md
+++ b/docs/cli/serve.md
@@ -0,0 +1,9 @@
+# vllm serve
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Arguments
+
+--8<-- "docs/argparse/serve.inc.md"