[Doc] Update max_tokens to max_completion_tokens in all docs (#6248)
### What this PR does / why we need it?
Fix:
```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```
- vLLM version: v0.14.1
- vLLM main:
d68209402d
Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"max_completion_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
|
||||
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
|
||||
"<|im_start|>assistant\n"'",
|
||||
"max_tokens": 1,
|
||||
"max_completion_tokens": 1,
|
||||
"temperature": 0,
|
||||
"stop": ["<|im_end|>"]
|
||||
}' | python3 -m json.tool
|
||||
|
||||
@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-7B-Instruct",
|
||||
"prompt": "The future of AI is",
|
||||
"max_tokens": 7,
|
||||
"max_completion_tokens": 7,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
@@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
@@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
6. The D node performs decoding and returns the result.
|
||||
|
||||
|
||||
@@ -112,7 +112,7 @@ JSON
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Explain gravity in one sentence.",
|
||||
"max_tokens": 32,
|
||||
"max_completion_tokens": 32,
|
||||
"temperature": 0
|
||||
}' | python -m json.tool
|
||||
```
|
||||
|
||||
@@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen2.5-0.5B-Instruct",
|
||||
"prompt": "Beijing is a",
|
||||
"max_tokens": 5,
|
||||
"max_completion_tokens": 5,
|
||||
"temperature": 0
|
||||
}' | python3 -m json.tool
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user