[Doc] Update max_tokens to max_completion_tokens in all docs (#6248)
### What this PR does / why we need it?
Fix:
```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```
- vLLM version: v0.14.1
- vLLM main:
d68209402d
Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"max_completion_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
|
||||
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
|
||||
"<|im_start|>assistant\n"'",
|
||||
"max_tokens": 1,
|
||||
"max_completion_tokens": 1,
|
||||
"temperature": 0,
|
||||
"stop": ["<|im_end|>"]
|
||||
}' | python3 -m json.tool
|
||||
|
||||
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"max_completion_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
@@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
6. The D node performs decoding and returns the result.
|
||||
|
||||
|
||||
@@ -112,7 +112,7 @@ JSON
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Explain gravity in one sentence.",
|
||||
"max_tokens": 32,
|
||||
"max_completion_tokens": 32,
|
||||
"temperature": 0
|
||||
}' | python -m json.tool
|
||||
```
|
||||
|
||||
@@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Beijing is a",
|
||||
"max_tokens": 5,
|
||||
"max_completion_tokens": 5,
|
||||
"temperature": 0
|
||||
}' | python3 -m json.tool
|
||||
```
|
||||
|
||||
@@ -168,7 +168,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Beijing is a",
|
||||
"max_tokens": 5,
|
||||
"max_completion_tokens": 5,
|
||||
"temperature": 0
|
||||
}' | python3 -m json.tool
|
||||
```
|
||||
|
||||
@@ -78,7 +78,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"max_completion_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"max_completion_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
@@ -138,7 +138,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 64,
|
||||
"max_completion_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
@@ -179,7 +179,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
|
||||
"max_tokens": 64,
|
||||
"max_completion_tokens": 64,
|
||||
"top_p": 0.95,
|
||||
"top_k": 50,
|
||||
"temperature": 0.6
|
||||
@@ -221,7 +221,7 @@ prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
|
||||
sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen3-0.6B",
|
||||
@@ -264,7 +264,7 @@ prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
|
||||
sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
@@ -307,7 +307,7 @@ prompts = [
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
|
||||
sampling_params = SamplingParams(max_completion_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
|
||||
# Create an LLM.
|
||||
llm = LLM(
|
||||
model="Qwen/Qwen2.5-VL-3B-Instruct",
|
||||
|
||||
@@ -243,7 +243,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
-d '{
|
||||
"model": "deepseek_r1",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -661,7 +661,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
-d '{
|
||||
"model": "deepseek_v3",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -840,7 +840,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
-d '{
|
||||
"model": "deepseek_v3.2",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -142,7 +142,7 @@ llm = LLM(
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=512
|
||||
max_completion_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
@@ -238,7 +238,7 @@ llm = LLM(
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=512
|
||||
max_completion_tokens=512
|
||||
)
|
||||
|
||||
image_messages = [
|
||||
|
||||
@@ -127,7 +127,7 @@ curl http://<IP>:<Port>/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen-2.5-7b-instruct",
|
||||
"prompt": "Beijing is a",
|
||||
"max_tokens": 5,
|
||||
"max_completion_tokens": 5,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -156,7 +156,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
|
||||
]
|
||||
}
|
||||
],
|
||||
"max_tokens": 100,
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
|
||||
|
||||
@@ -269,7 +269,7 @@ curl http://<node0_ip>:<port>/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 50,
|
||||
"max_completion_tokens": 50,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -62,7 +62,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 4096
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3-32b-w4a4",
|
||||
"prompt": "what is large language model?",
|
||||
"max_tokens": "128",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
|
||||
@@ -106,7 +106,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3-8b-w4a8",
|
||||
"prompt": "what is large language model?",
|
||||
"max_tokens": "128",
|
||||
"max_completion_tokens": "128",
|
||||
"top_p": "0.95",
|
||||
"top_k": "40",
|
||||
"temperature": "0.0"
|
||||
|
||||
@@ -82,7 +82,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 4096
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -214,7 +214,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 4096
|
||||
"max_completion_tokens": 4096
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -75,7 +75,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
|
||||
"temperature": 0.6,
|
||||
"top_p": 0.95,
|
||||
"top_k": 20,
|
||||
"max_tokens": 32
|
||||
"max_completion_tokens": 32
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -103,7 +103,7 @@ if __name__ == '__main__':
|
||||
prompts = [
|
||||
"Who are you?",
|
||||
]
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
|
||||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
|
||||
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
||||
tensor_parallel_size=4,
|
||||
enforce_eager=True,
|
||||
|
||||
@@ -123,7 +123,7 @@ def main():
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
top_k=20,
|
||||
max_tokens=16384,
|
||||
max_completion_tokens=16384,
|
||||
)
|
||||
|
||||
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
|
||||
@@ -243,7 +243,7 @@ evalscope eval \
|
||||
--datasets omni_bench, gsm8k, bbh \
|
||||
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
|
||||
--eval-batch-size 1 \
|
||||
--generation-config '{"max_tokens": 10000, "temperature": 0.6}' \
|
||||
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
|
||||
--limit 100
|
||||
```
|
||||
|
||||
|
||||
@@ -120,7 +120,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
],
|
||||
"max_tokens": 100
|
||||
"max_completion_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -182,7 +182,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
{"type": "text", "text": "What is in this video?"}
|
||||
]}
|
||||
],
|
||||
"max_tokens": 100
|
||||
"max_completion_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
|
||||
@@ -932,7 +932,7 @@ curl http://192.0.0.1:8080/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen3-moe",
|
||||
"prompt": "Who are you?",
|
||||
"max_tokens": 100,
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -271,7 +271,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
],
|
||||
"max_tokens": 100,
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -186,7 +186,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "qwen",
|
||||
"prompt": "tell me how to sleep well",
|
||||
"max_tokens": 100,
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -296,13 +296,13 @@ Configure the localhost, port, and model weight path in the command to your own
|
||||
Short question:
|
||||
|
||||
```shell
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
Long question:
|
||||
|
||||
```shell
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
|
||||
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
### Colocation Scenario
|
||||
@@ -355,11 +355,11 @@ Configure the localhost, port, and model weight path in the command to your own
|
||||
Short question:
|
||||
|
||||
```shell
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
Long question:
|
||||
|
||||
```shell
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
|
||||
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
|
||||
```
|
||||
|
||||
@@ -51,7 +51,7 @@ The following is a simple example of how to use sleep mode.
|
||||
# record npu memory use baseline in case other process is running
|
||||
used_bytes_baseline = total - free
|
||||
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
|
||||
sampling_params = SamplingParams(temperature=0, max_tokens=10)
|
||||
sampling_params = SamplingParams(temperature=0, max_completion_tokens=10)
|
||||
output = llm.generate(prompt, sampling_params)
|
||||
|
||||
llm.sleep(level=1)
|
||||
@@ -110,7 +110,7 @@ The following is a simple example of how to use sleep mode.
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"max_completion_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user