[Doc] Update max_tokens to max_completion_tokens in all docs (#6248)

### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
Shanshan Shen
2026-01-26 11:57:40 +08:00
committed by GitHub
parent 418fccf0bc
commit e3eefdecbd
28 changed files with 43 additions and 43 deletions

View File

@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"max_completion_tokens": 7,
"temperature": 0
}'
```

View File

@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'",
"max_tokens": 1,
"max_completion_tokens": 1,
"temperature": 0,
"stop": ["<|im_end|>"]
}' | python3 -m json.tool

View File

@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"max_completion_tokens": 7,
"temperature": 0
}'
```

View File

@@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
#### Mooncake Connector
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
3. After the P nodes scheduler finishes prefill, `update_from_output` invokes the schedule connectors `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
@@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
5. During processing, the P nodes scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
6. The D node performs decoding and returns the result.

View File

@@ -112,7 +112,7 @@ JSON
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Explain gravity in one sentence.",
"max_tokens": 32,
"max_completion_tokens": 32,
"temperature": 0
}' | python -m json.tool
```

View File

@@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a",
"max_tokens": 5,
"max_completion_tokens": 5,
"temperature": 0
}' | python3 -m json.tool
```

View File

@@ -168,7 +168,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a",
"max_tokens": 5,
"max_completion_tokens": 5,
"temperature": 0
}' | python3 -m json.tool
```

View File

@@ -78,7 +78,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"max_completion_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"max_completion_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
@@ -138,7 +138,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 64,
"max_completion_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
@@ -179,7 +179,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64,
"max_completion_tokens": 64,
"top_p": 0.95,
"top_k": 50,
"temperature": 0.6
@@ -221,7 +221,7 @@ prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen3-0.6B",
@@ -264,7 +264,7 @@ prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
@@ -307,7 +307,7 @@ prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
sampling_params = SamplingParams(max_completion_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct",

View File

@@ -243,7 +243,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{
"model": "deepseek_r1",
"prompt": "The future of AI is",
"max_tokens": 50,
"max_completion_tokens": 50,
"temperature": 0
}'
```

View File

@@ -661,7 +661,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{
"model": "deepseek_v3",
"prompt": "The future of AI is",
"max_tokens": 50,
"max_completion_tokens": 50,
"temperature": 0
}'
```

View File

@@ -840,7 +840,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{
"model": "deepseek_v3.2",
"prompt": "The future of AI is",
"max_tokens": 50,
"max_completion_tokens": 50,
"temperature": 0
}'
```

View File

@@ -142,7 +142,7 @@ llm = LLM(
)
sampling_params = SamplingParams(
max_tokens=512
max_completion_tokens=512
)
image_messages = [
@@ -238,7 +238,7 @@ llm = LLM(
)
sampling_params = SamplingParams(
max_tokens=512
max_completion_tokens=512
)
image_messages = [

View File

@@ -127,7 +127,7 @@ curl http://<IP>:<Port>/v1/completions \
-d '{
"model": "qwen-2.5-7b-instruct",
"prompt": "Beijing is a",
"max_tokens": 5,
"max_completion_tokens": 5,
"temperature": 0
}'
```

View File

@@ -156,7 +156,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
]
}
],
"max_tokens": 100,
"max_completion_tokens": 100,
"temperature": 0.7
}'

View File

@@ -269,7 +269,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{
"model": "qwen3",
"prompt": "The future of AI is",
"max_tokens": 50,
"max_completion_tokens": 50,
"temperature": 0
}'
```

View File

@@ -62,7 +62,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
"max_completion_tokens": 4096
}'
```

View File

@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "qwen3-32b-w4a4",
"prompt": "what is large language model?",
"max_tokens": "128",
"max_completion_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.0"

View File

@@ -106,7 +106,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "qwen3-8b-w4a8",
"prompt": "what is large language model?",
"max_tokens": "128",
"max_completion_tokens": "128",
"top_p": "0.95",
"top_k": "40",
"temperature": "0.0"

View File

@@ -82,7 +82,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
"max_completion_tokens": 4096
}'
```

View File

@@ -214,7 +214,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
"max_completion_tokens": 4096
}'
```

View File

@@ -75,7 +75,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32
"max_completion_tokens": 32
}'
```
@@ -103,7 +103,7 @@ if __name__ == '__main__':
prompts = [
"Who are you?",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
enforce_eager=True,

View File

@@ -123,7 +123,7 @@ def main():
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=16384,
max_completion_tokens=16384,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
@@ -243,7 +243,7 @@ evalscope eval \
--datasets omni_bench, gsm8k, bbh \
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
--eval-batch-size 1 \
--generation-config '{"max_tokens": 10000, "temperature": 0.6}' \
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
--limit 100
```

View File

@@ -120,7 +120,7 @@ curl http://localhost:8000/v1/chat/completions \
{"type": "text", "text": "What is the text in the illustrate?"}
]}
],
"max_tokens": 100
"max_completion_tokens": 100
}'
```
@@ -182,7 +182,7 @@ curl http://localhost:8000/v1/chat/completions \
{"type": "text", "text": "What is in this video?"}
]}
],
"max_tokens": 100
"max_completion_tokens": 100
}'
```

View File

@@ -932,7 +932,7 @@ curl http://192.0.0.1:8080/v1/completions \
-d '{
"model": "qwen3-moe",
"prompt": "Who are you?",
"max_tokens": 100,
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -271,7 +271,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
{"type": "text", "text": "What is the text in the illustrate?"}
]}
],
"max_tokens": 100,
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -186,7 +186,7 @@ curl http://localhost:8000/v1/completions \
-d '{
"model": "qwen",
"prompt": "tell me how to sleep well",
"max_tokens": 100,
"max_completion_tokens": 100,
"temperature": 0
}'
```

View File

@@ -296,13 +296,13 @@ Configure the localhost, port, and model weight path in the command to your own
Short question:
```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
```
Long question:
```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
```
### Colocation Scenario
@@ -355,11 +355,11 @@ Configure the localhost, port, and model weight path in the command to your own
Short question:
```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
```
Long question:
```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
```

View File

@@ -51,7 +51,7 @@ The following is a simple example of how to use sleep mode.
# record npu memory use baseline in case other process is running
used_bytes_baseline = total - free
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
sampling_params = SamplingParams(temperature=0, max_tokens=10)
sampling_params = SamplingParams(temperature=0, max_completion_tokens=10)
output = llm.generate(prompt, sampling_params)
llm.sleep(level=1)
@@ -110,7 +110,7 @@ The following is a simple example of how to use sleep mode.
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 7,
"max_completion_tokens": 7,
"temperature": 0
}'
```