[Doc] Update max_tokens to max_completion_tokens in all docs (#6248)

### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
d68209402d

Signed-off-by: shen-shanshan <467638484@qq.com>
This commit is contained in:
Shanshan Shen
2026-01-26 11:57:40 +08:00
committed by GitHub
parent 418fccf0bc
commit e3eefdecbd
28 changed files with 43 additions and 43 deletions

View File

@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "Qwen/Qwen2.5-7B-Instruct", "model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 7, "max_completion_tokens": 7,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \
"C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\ "C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\
"D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\ "D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\
"<|im_start|>assistant\n"'", "<|im_start|>assistant\n"'",
"max_tokens": 1, "max_completion_tokens": 1,
"temperature": 0, "temperature": 0,
"stop": ["<|im_end|>"] "stop": ["<|im_end|>"]
}' | python3 -m json.tool }' | python3 -m json.tool

View File

@@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "Qwen/Qwen2.5-7B-Instruct", "model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 7, "max_completion_tokens": 7,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
#### Mooncake Connector #### Mooncake Connector
1. The request is sent to the Proxys `_handle_completions` endpoint. 1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. 2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
3. After the P nodes scheduler finishes prefill, `update_from_output` invokes the schedule connectors `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy. 3. After the P nodes scheduler finishes prefill, `update_from_output` invokes the schedule connectors `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
4. The Proxy calls `select_decoder` to choose a D node and forwards the request. 4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result. 5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
@@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
1. The request is sent to the Proxys `_handle_completions` endpoint. 1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint. 2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete. 3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. 4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
5. During processing, the P nodes scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding. 5. During processing, the P nodes scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
6. The D node performs decoding and returns the result. 6. The D node performs decoding and returns the result.

View File

@@ -112,7 +112,7 @@ JSON
-d '{ -d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct", "model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Explain gravity in one sentence.", "prompt": "Explain gravity in one sentence.",
"max_tokens": 32, "max_completion_tokens": 32,
"temperature": 0 "temperature": 0
}' | python -m json.tool }' | python -m json.tool
``` ```

View File

@@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct", "model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a", "prompt": "Beijing is a",
"max_tokens": 5, "max_completion_tokens": 5,
"temperature": 0 "temperature": 0
}' | python3 -m json.tool }' | python3 -m json.tool
``` ```

View File

@@ -168,7 +168,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct", "model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "Beijing is a", "prompt": "Beijing is a",
"max_tokens": 5, "max_completion_tokens": 5,
"temperature": 0 "temperature": 0
}' | python3 -m json.tool }' | python3 -m json.tool
``` ```

View File

@@ -78,7 +78,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 64, "max_completion_tokens": 64,
"top_p": 0.95, "top_p": 0.95,
"top_k": 50, "top_k": 50,
"temperature": 0.6 "temperature": 0.6
@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 64, "max_completion_tokens": 64,
"top_p": 0.95, "top_p": 0.95,
"top_k": 50, "top_k": 50,
"temperature": 0.6 "temperature": 0.6
@@ -138,7 +138,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 64, "max_completion_tokens": 64,
"top_p": 0.95, "top_p": 0.95,
"top_k": 50, "top_k": 50,
"temperature": 0.6 "temperature": 0.6
@@ -179,7 +179,7 @@ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:", "prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:",
"max_tokens": 64, "max_completion_tokens": 64,
"top_p": 0.95, "top_p": 0.95,
"top_k": 50, "top_k": 50,
"temperature": 0.6 "temperature": 0.6
@@ -221,7 +221,7 @@ prompts = [
"The future of AI is", "The future of AI is",
] ]
# Create a sampling params object. # Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0) sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
# Create an LLM. # Create an LLM.
llm = LLM( llm = LLM(
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
@@ -264,7 +264,7 @@ prompts = [
"The future of AI is", "The future of AI is",
] ]
# Create a sampling params object. # Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0) sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0)
# Create an LLM. # Create an LLM.
llm = LLM( llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct", model="Qwen/Qwen2.5-7B-Instruct",
@@ -307,7 +307,7 @@ prompts = [
"The future of AI is", "The future of AI is",
] ]
# Create a sampling params object. # Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6) sampling_params = SamplingParams(max_completion_tokens=100, top_p=0.95, top_k=50, temperature=0.6)
# Create an LLM. # Create an LLM.
llm = LLM( llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct", model="Qwen/Qwen2.5-VL-3B-Instruct",

View File

@@ -243,7 +243,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{ -d '{
"model": "deepseek_r1", "model": "deepseek_r1",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 50, "max_completion_tokens": 50,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -661,7 +661,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{ -d '{
"model": "deepseek_v3", "model": "deepseek_v3",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 50, "max_completion_tokens": 50,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -840,7 +840,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{ -d '{
"model": "deepseek_v3.2", "model": "deepseek_v3.2",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 50, "max_completion_tokens": 50,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -142,7 +142,7 @@ llm = LLM(
) )
sampling_params = SamplingParams( sampling_params = SamplingParams(
max_tokens=512 max_completion_tokens=512
) )
image_messages = [ image_messages = [
@@ -238,7 +238,7 @@ llm = LLM(
) )
sampling_params = SamplingParams( sampling_params = SamplingParams(
max_tokens=512 max_completion_tokens=512
) )
image_messages = [ image_messages = [

View File

@@ -127,7 +127,7 @@ curl http://<IP>:<Port>/v1/completions \
-d '{ -d '{
"model": "qwen-2.5-7b-instruct", "model": "qwen-2.5-7b-instruct",
"prompt": "Beijing is a", "prompt": "Beijing is a",
"max_tokens": 5, "max_completion_tokens": 5,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -156,7 +156,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j
] ]
} }
], ],
"max_tokens": 100, "max_completion_tokens": 100,
"temperature": 0.7 "temperature": 0.7
}' }'

View File

@@ -269,7 +269,7 @@ curl http://<node0_ip>:<port>/v1/completions \
-d '{ -d '{
"model": "qwen3", "model": "qwen3",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 50, "max_completion_tokens": 50,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -62,7 +62,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6, "temperature": 0.6,
"top_p": 0.95, "top_p": 0.95,
"top_k": 20, "top_k": 20,
"max_tokens": 4096 "max_completion_tokens": 4096
}' }'
``` ```

View File

@@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "qwen3-32b-w4a4", "model": "qwen3-32b-w4a4",
"prompt": "what is large language model?", "prompt": "what is large language model?",
"max_tokens": "128", "max_completion_tokens": "128",
"top_p": "0.95", "top_p": "0.95",
"top_k": "40", "top_k": "40",
"temperature": "0.0" "temperature": "0.0"

View File

@@ -106,7 +106,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "qwen3-8b-w4a8", "model": "qwen3-8b-w4a8",
"prompt": "what is large language model?", "prompt": "what is large language model?",
"max_tokens": "128", "max_completion_tokens": "128",
"top_p": "0.95", "top_p": "0.95",
"top_k": "40", "top_k": "40",
"temperature": "0.0" "temperature": "0.0"

View File

@@ -82,7 +82,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6, "temperature": 0.6,
"top_p": 0.95, "top_p": 0.95,
"top_k": 20, "top_k": 20,
"max_tokens": 4096 "max_completion_tokens": 4096
}' }'
``` ```

View File

@@ -214,7 +214,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6, "temperature": 0.6,
"top_p": 0.95, "top_p": 0.95,
"top_k": 20, "top_k": 20,
"max_tokens": 4096 "max_completion_tokens": 4096
}' }'
``` ```

View File

@@ -75,7 +75,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
"temperature": 0.6, "temperature": 0.6,
"top_p": 0.95, "top_p": 0.95,
"top_k": 20, "top_k": 20,
"max_tokens": 32 "max_completion_tokens": 32
}' }'
``` ```
@@ -103,7 +103,7 @@ if __name__ == '__main__':
prompts = [ prompts = [
"Who are you?", "Who are you?",
] ]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct", llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4, tensor_parallel_size=4,
enforce_eager=True, enforce_eager=True,

View File

@@ -123,7 +123,7 @@ def main():
temperature=0.6, temperature=0.6,
top_p=0.95, top_p=0.95,
top_k=20, top_k=20,
max_tokens=16384, max_completion_tokens=16384,
) )
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
@@ -243,7 +243,7 @@ evalscope eval \
--datasets omni_bench, gsm8k, bbh \ --datasets omni_bench, gsm8k, bbh \
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \ --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
--eval-batch-size 1 \ --eval-batch-size 1 \
--generation-config '{"max_tokens": 10000, "temperature": 0.6}' \ --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
--limit 100 --limit 100
``` ```

View File

@@ -120,7 +120,7 @@ curl http://localhost:8000/v1/chat/completions \
{"type": "text", "text": "What is the text in the illustrate?"} {"type": "text", "text": "What is the text in the illustrate?"}
]} ]}
], ],
"max_tokens": 100 "max_completion_tokens": 100
}' }'
``` ```
@@ -182,7 +182,7 @@ curl http://localhost:8000/v1/chat/completions \
{"type": "text", "text": "What is in this video?"} {"type": "text", "text": "What is in this video?"}
]} ]}
], ],
"max_tokens": 100 "max_completion_tokens": 100
}' }'
``` ```

View File

@@ -932,7 +932,7 @@ curl http://192.0.0.1:8080/v1/completions \
-d '{ -d '{
"model": "qwen3-moe", "model": "qwen3-moe",
"prompt": "Who are you?", "prompt": "Who are you?",
"max_tokens": 100, "max_completion_tokens": 100,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -271,7 +271,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \
{"type": "text", "text": "What is the text in the illustrate?"} {"type": "text", "text": "What is the text in the illustrate?"}
]} ]}
], ],
"max_tokens": 100, "max_completion_tokens": 100,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -186,7 +186,7 @@ curl http://localhost:8000/v1/completions \
-d '{ -d '{
"model": "qwen", "model": "qwen",
"prompt": "tell me how to sleep well", "prompt": "tell me how to sleep well",
"max_tokens": 100, "max_completion_tokens": 100,
"temperature": 0 "temperature": 0
}' }'
``` ```

View File

@@ -296,13 +296,13 @@ Configure the localhost, port, and model weight path in the command to your own
Short question: Short question:
```shell ```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }' curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
``` ```
Long question: Long question:
```shell ```shell
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }' curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
``` ```
### Colocation Scenario ### Colocation Scenario
@@ -355,11 +355,11 @@ Configure the localhost, port, and model weight path in the command to your own
Short question: Short question:
```shell ```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }' curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }'
``` ```
Long question: Long question:
```shell ```shell
curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }' curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }'
``` ```

View File

@@ -51,7 +51,7 @@ The following is a simple example of how to use sleep mode.
# record npu memory use baseline in case other process is running # record npu memory use baseline in case other process is running
used_bytes_baseline = total - free used_bytes_baseline = total - free
llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True) llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)
sampling_params = SamplingParams(temperature=0, max_tokens=10) sampling_params = SamplingParams(temperature=0, max_completion_tokens=10)
output = llm.generate(prompt, sampling_params) output = llm.generate(prompt, sampling_params)
llm.sleep(level=1) llm.sleep(level=1)
@@ -110,7 +110,7 @@ The following is a simple example of how to use sleep mode.
-d '{ -d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct", "model": "Qwen/Qwen2.5-0.5B-Instruct",
"prompt": "The future of AI is", "prompt": "The future of AI is",
"max_tokens": 7, "max_completion_tokens": 7,
"temperature": 0 "temperature": 0
}' }'
``` ```