[Doc] Update max_tokens to max_completion_tokens in all docs (#6248)

### What this PR does / why we need it? Fix: ``` DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field. ``` - vLLM version: v0.14.1 - vLLM main: d68209402d Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-26 11:57:40 +08:00
parent 418fccf0bc
commit e3eefdecbd
28 changed files with 43 additions and 43 deletions
--- a/docs/source/developer_guide/evaluation/using_evalscope.md
+++ b/docs/source/developer_guide/evaluation/using_evalscope.md
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
-        "max_tokens": 7,
+        "max_completion_tokens": 7,
        "temperature": 0
    }'
 ```
--- a/docs/source/developer_guide/evaluation/using_lm_eval.md
+++ b/docs/source/developer_guide/evaluation/using_lm_eval.md
@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \
 "C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
 "D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
 "<|im_start|>assistant\n"'",
-        "max_tokens": 1,
+        "max_completion_tokens": 1,
        "temperature": 0,
        "stop": ["<|im_end|>"]
    }' | python3 -m json.tool
--- a/docs/source/developer_guide/evaluation/using_opencompass.md
+++ b/docs/source/developer_guide/evaluation/using_opencompass.md
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
-        "max_tokens": 7,
+        "max_completion_tokens": 7,
        "temperature": 0
    }'
 ```
--- a/docs/source/developer_guide/feature_guide/disaggregated_prefill.md
+++ b/docs/source/developer_guide/feature_guide/disaggregated_prefill.md
@@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
 #### Mooncake Connector

 1. The request is sent to the Proxy’s `_handle_completions` endpoint.
-2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
+2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
 3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
 4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
 5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
@@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
 1. The request is sent to the Proxy’s `_handle_completions` endpoint.
 2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
 3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
-4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
+4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
 5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
 6. The D node performs decoding and returns the result.

--- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md
@@ -112,7 +112,7 @@ JSON
     -d '{
           "model": "Qwen/Qwen2.5-0.5B-Instruct",
           "prompt": "Explain gravity in one sentence.",
-           "max_tokens": 32,
+           "max_completion_tokens": 32,
           "temperature": 0
         }' | python -m json.tool
   ```
--- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
+++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md
@@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \
    -d '{
         "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "prompt": "Beijing is a",
-        "max_tokens": 5,
+        "max_completion_tokens": 5,
        "temperature": 0
 }' | python3 -m json.tool
 ```