Sync from v0.13
This commit is contained in:
1
docs/cli/.meta.yml
Normal file
1
docs/cli/.meta.yml
Normal file
@@ -0,0 +1 @@
|
||||
toc_depth: 3
|
||||
8
docs/cli/.nav.yml
Normal file
8
docs/cli/.nav.yml
Normal file
@@ -0,0 +1,8 @@
|
||||
nav:
|
||||
- README.md
|
||||
- serve.md
|
||||
- chat.md
|
||||
- complete.md
|
||||
- run-batch.md
|
||||
- vllm bench:
|
||||
- bench/**/*.md
|
||||
188
docs/cli/README.md
Normal file
188
docs/cli/README.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# vLLM CLI Guide
|
||||
|
||||
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
|
||||
|
||||
```bash
|
||||
vllm --help
|
||||
```
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm {chat,complete,serve,bench,collect-env,run-batch}
|
||||
```
|
||||
|
||||
## serve
|
||||
|
||||
Starts the vLLM OpenAI Compatible API server.
|
||||
|
||||
Start with a model:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf
|
||||
```
|
||||
|
||||
Specify the port:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --port 8100
|
||||
```
|
||||
|
||||
Serve over a Unix domain socket:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
|
||||
```
|
||||
|
||||
Check with --help for more options:
|
||||
|
||||
```bash
|
||||
# To list all groups
|
||||
vllm serve --help=listgroup
|
||||
|
||||
# To view a argument group
|
||||
vllm serve --help=ModelConfig
|
||||
|
||||
# To view a single argument
|
||||
vllm serve --help=max-num-seqs
|
||||
|
||||
# To search by keyword
|
||||
vllm serve --help=max
|
||||
|
||||
# To view full help with pager (less/more)
|
||||
vllm serve --help=page
|
||||
```
|
||||
|
||||
See [vllm serve](./serve.md) for the full reference of all available arguments.
|
||||
|
||||
## chat
|
||||
|
||||
Generate chat completions via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm chat
|
||||
|
||||
# Specify API url
|
||||
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick chat with a single prompt
|
||||
vllm chat --quick "hi"
|
||||
```
|
||||
|
||||
See [vllm chat](./chat.md) for the full reference of all available arguments.
|
||||
|
||||
## complete
|
||||
|
||||
Generate text completions based on the given prompt via the running API server.
|
||||
|
||||
```bash
|
||||
# Directly connect to localhost API without arguments
|
||||
vllm complete
|
||||
|
||||
# Specify API url
|
||||
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
||||
|
||||
# Quick complete with a single prompt
|
||||
vllm complete --quick "The future of AI is"
|
||||
```
|
||||
|
||||
See [vllm complete](./complete.md) for the full reference of all available arguments.
|
||||
|
||||
## bench
|
||||
|
||||
Run benchmark tests for latency online serving throughput and offline inference throughput.
|
||||
|
||||
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
|
||||
|
||||
Available Commands:
|
||||
|
||||
```bash
|
||||
vllm bench {latency, serve, throughput}
|
||||
```
|
||||
|
||||
### latency
|
||||
|
||||
Benchmark the latency of a single batch of requests.
|
||||
|
||||
```bash
|
||||
vllm bench latency \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
|
||||
|
||||
### serve
|
||||
|
||||
Benchmark the online serving throughput.
|
||||
|
||||
```bash
|
||||
vllm bench serve \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--host server-host \
|
||||
--port server-port \
|
||||
--random-input-len 32 \
|
||||
--random-output-len 4 \
|
||||
--num-prompts 5
|
||||
```
|
||||
|
||||
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
|
||||
|
||||
### throughput
|
||||
|
||||
Benchmark offline inference throughput.
|
||||
|
||||
```bash
|
||||
vllm bench throughput \
|
||||
--model meta-llama/Llama-3.2-1B-Instruct \
|
||||
--input-len 32 \
|
||||
--output-len 1 \
|
||||
--enforce-eager \
|
||||
--load-format dummy
|
||||
```
|
||||
|
||||
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
|
||||
|
||||
## collect-env
|
||||
|
||||
Start collecting environment information.
|
||||
|
||||
```bash
|
||||
vllm collect-env
|
||||
```
|
||||
|
||||
## run-batch
|
||||
|
||||
Run batch prompts and write results to file.
|
||||
|
||||
Running with a local file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
Using remote file:
|
||||
|
||||
```bash
|
||||
vllm run-batch \
|
||||
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||
-o results.jsonl \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
|
||||
|
||||
## More Help
|
||||
|
||||
For detailed options of any subcommand, use:
|
||||
|
||||
```bash
|
||||
vllm <subcommand> --help
|
||||
```
|
||||
9
docs/cli/bench/latency.md
Normal file
9
docs/cli/bench/latency.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench latency
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_latency.inc.md"
|
||||
9
docs/cli/bench/serve.md
Normal file
9
docs/cli/bench/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_serve.inc.md"
|
||||
9
docs/cli/bench/sweep/plot.md
Normal file
9
docs/cli/bench/sweep/plot.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_plot.inc.md"
|
||||
9
docs/cli/bench/sweep/plot_pareto.md
Normal file
9
docs/cli/bench/sweep/plot_pareto.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep plot_pareto
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
|
||||
9
docs/cli/bench/sweep/serve.md
Normal file
9
docs/cli/bench/sweep/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_serve.inc.md"
|
||||
9
docs/cli/bench/sweep/serve_sla.md
Normal file
9
docs/cli/bench/sweep/serve_sla.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench sweep serve_sla
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
|
||||
9
docs/cli/bench/throughput.md
Normal file
9
docs/cli/bench/throughput.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm bench throughput
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/bench_throughput.inc.md"
|
||||
5
docs/cli/chat.md
Normal file
5
docs/cli/chat.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# vllm chat
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/chat.inc.md"
|
||||
5
docs/cli/complete.md
Normal file
5
docs/cli/complete.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# vllm complete
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/complete.inc.md"
|
||||
9
docs/cli/json_tip.inc.md
Normal file
9
docs/cli/json_tip.inc.md
Normal file
@@ -0,0 +1,9 @@
|
||||
When passing JSON CLI arguments, the following sets of arguments are equivalent:
|
||||
|
||||
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
|
||||
- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
|
||||
|
||||
Additionally, list elements can be passed individually using `+`:
|
||||
|
||||
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
|
||||
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
|
||||
9
docs/cli/run-batch.md
Normal file
9
docs/cli/run-batch.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm run-batch
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/run-batch.inc.md"
|
||||
9
docs/cli/serve.md
Normal file
9
docs/cli/serve.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# vllm serve
|
||||
|
||||
## JSON CLI Arguments
|
||||
|
||||
--8<-- "docs/cli/json_tip.inc.md"
|
||||
|
||||
## Arguments
|
||||
|
||||
--8<-- "docs/argparse/serve.inc.md"
|
||||
Reference in New Issue
Block a user