From e3eefdecbd4aa8c2f621eadc51c23121e3b04509 Mon Sep 17 00:00:00 2001 From: Shanshan Shen <467638484@qq.com> Date: Mon, 26 Jan 2026 11:57:40 +0800 Subject: [PATCH] [Doc] Update `max_tokens` to `max_completion_tokens` in all docs (#6248) ### What this PR does / why we need it? Fix: ``` DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field. ``` - vLLM version: v0.14.1 - vLLM main: https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60 Signed-off-by: shen-shanshan <467638484@qq.com> --- .../developer_guide/evaluation/using_evalscope.md | 2 +- .../developer_guide/evaluation/using_lm_eval.md | 2 +- .../evaluation/using_opencompass.md | 2 +- .../feature_guide/disaggregated_prefill.md | 4 ++-- .../performance_and_debug/msprobe_guide.md | 2 +- .../service_profiling_guide.md | 2 +- docs/source/quick_start.md | 2 +- docs/source/tutorials/310p.md | 14 +++++++------- docs/source/tutorials/DeepSeek-R1.md | 2 +- docs/source/tutorials/DeepSeek-V3.1.md | 2 +- docs/source/tutorials/DeepSeek-V3.2.md | 2 +- docs/source/tutorials/Qwen-VL-Dense.md | 4 ++-- docs/source/tutorials/Qwen2.5-7B.md | 2 +- docs/source/tutorials/Qwen2.5-Omni.md | 2 +- docs/source/tutorials/Qwen3-235B-A22B.md | 2 +- docs/source/tutorials/Qwen3-30B-A3B.md | 2 +- docs/source/tutorials/Qwen3-32B-W4A4.md | 2 +- docs/source/tutorials/Qwen3-8B-W4A8.md | 2 +- docs/source/tutorials/Qwen3-Coder-30B-A3B.md | 2 +- docs/source/tutorials/Qwen3-Dense.md | 2 +- docs/source/tutorials/Qwen3-Next.md | 4 ++-- .../tutorials/Qwen3-Omni-30B-A3B-Thinking.md | 4 ++-- docs/source/tutorials/Qwen3-VL-30B-A3B-Instruct.md | 4 ++-- .../pd_disaggregation_mooncake_multi_node.md | 2 +- .../pd_disaggregation_mooncake_single_node.md | 2 +- docs/source/tutorials/ray.md | 2 +- docs/source/user_guide/feature_guide/kv_pool.md | 8 ++++---- docs/source/user_guide/feature_guide/sleep_mode.md | 4 ++-- 28 files changed, 43 insertions(+), 43 deletions(-) diff --git a/docs/source/developer_guide/evaluation/using_evalscope.md b/docs/source/developer_guide/evaluation/using_evalscope.md index 99d0b783..014322d4 100644 --- a/docs/source/developer_guide/evaluation/using_evalscope.md +++ b/docs/source/developer_guide/evaluation/using_evalscope.md @@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", - "max_tokens": 7, + "max_completion_tokens": 7, "temperature": 0 }' ``` diff --git a/docs/source/developer_guide/evaluation/using_lm_eval.md b/docs/source/developer_guide/evaluation/using_lm_eval.md index 961531a7..2e5453c3 100644 --- a/docs/source/developer_guide/evaluation/using_lm_eval.md +++ b/docs/source/developer_guide/evaluation/using_lm_eval.md @@ -66,7 +66,7 @@ curl http://localhost:8000/v1/completions \ "C. Asset-Liability Ratio=65.22%, Current Ratio=1.75\n"\ "D. Asset-Liability Ratio=68.00%, Current Ratio=2.50<|im_end|>\n"\ "<|im_start|>assistant\n"'", - "max_tokens": 1, + "max_completion_tokens": 1, "temperature": 0, "stop": ["<|im_end|>"] }' | python3 -m json.tool diff --git a/docs/source/developer_guide/evaluation/using_opencompass.md b/docs/source/developer_guide/evaluation/using_opencompass.md index 9d21c66c..e4268dba 100644 --- a/docs/source/developer_guide/evaluation/using_opencompass.md +++ b/docs/source/developer_guide/evaluation/using_opencompass.md @@ -48,7 +48,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "The future of AI is", - "max_tokens": 7, + "max_completion_tokens": 7, "temperature": 0 }' ``` diff --git a/docs/source/developer_guide/feature_guide/disaggregated_prefill.md b/docs/source/developer_guide/feature_guide/disaggregated_prefill.md index 9e358862..efeb3113 100644 --- a/docs/source/developer_guide/feature_guide/disaggregated_prefill.md +++ b/docs/source/developer_guide/feature_guide/disaggregated_prefill.md @@ -39,7 +39,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec #### Mooncake Connector 1. The request is sent to the Proxy’s `_handle_completions` endpoint. -2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. +2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`. 3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy. 4. The Proxy calls `select_decoder` to choose a D node and forwards the request. 5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result. @@ -49,7 +49,7 @@ Our design diagram is shown below, illustrating the pull and push schemes respec 1. The request is sent to the Proxy’s `_handle_completions` endpoint. 2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint. 3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete. -4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. +4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`. 5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding. 6. The D node performs decoding and returns the result. diff --git a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md index b18e2a3d..30558f00 100644 --- a/docs/source/developer_guide/performance_and_debug/msprobe_guide.md +++ b/docs/source/developer_guide/performance_and_debug/msprobe_guide.md @@ -112,7 +112,7 @@ JSON -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "Explain gravity in one sentence.", - "max_tokens": 32, + "max_completion_tokens": 32, "temperature": 0 }' | python -m json.tool ``` diff --git a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md index 987c3749..d8b91b7a 100644 --- a/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md +++ b/docs/source/developer_guide/performance_and_debug/service_profiling_guide.md @@ -50,7 +50,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "Beijing is a", - "max_tokens": 5, + "max_completion_tokens": 5, "temperature": 0 }' | python3 -m json.tool ``` diff --git a/docs/source/quick_start.md b/docs/source/quick_start.md index 3fc56039..97133676 100644 --- a/docs/source/quick_start.md +++ b/docs/source/quick_start.md @@ -168,7 +168,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "Beijing is a", - "max_tokens": 5, + "max_completion_tokens": 5, "temperature": 0 }' | python3 -m json.tool ``` diff --git a/docs/source/tutorials/310p.md b/docs/source/tutorials/310p.md index ad7be5a2..a157cd1e 100644 --- a/docs/source/tutorials/310p.md +++ b/docs/source/tutorials/310p.md @@ -78,7 +78,7 @@ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "The future of AI is", - "max_tokens": 64, + "max_completion_tokens": 64, "top_p": 0.95, "top_k": 50, "temperature": 0.6 @@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "The future of AI is", - "max_tokens": 64, + "max_completion_tokens": 64, "top_p": 0.95, "top_k": 50, "temperature": 0.6 @@ -138,7 +138,7 @@ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "The future of AI is", - "max_tokens": 64, + "max_completion_tokens": 64, "top_p": 0.95, "top_k": 50, "temperature": 0.6 @@ -179,7 +179,7 @@ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "[unused9]系统:[unused10][unused9]用户:'${question}'[unused10][unused9]助手:", - "max_tokens": 64, + "max_completion_tokens": 64, "top_p": 0.95, "top_k": 50, "temperature": 0.6 @@ -221,7 +221,7 @@ prompts = [ "The future of AI is", ] # Create a sampling params object. -sampling_params = SamplingParams(max_tokens=100, temperature=0.0) +sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0) # Create an LLM. llm = LLM( model="Qwen/Qwen3-0.6B", @@ -264,7 +264,7 @@ prompts = [ "The future of AI is", ] # Create a sampling params object. -sampling_params = SamplingParams(max_tokens=100, temperature=0.0) +sampling_params = SamplingParams(max_completion_tokens=100, temperature=0.0) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", @@ -307,7 +307,7 @@ prompts = [ "The future of AI is", ] # Create a sampling params object. -sampling_params = SamplingParams(max_tokens=100, top_p=0.95, top_k=50, temperature=0.6) +sampling_params = SamplingParams(max_completion_tokens=100, top_p=0.95, top_k=50, temperature=0.6) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-VL-3B-Instruct", diff --git a/docs/source/tutorials/DeepSeek-R1.md b/docs/source/tutorials/DeepSeek-R1.md index f0e16f96..1305b454 100644 --- a/docs/source/tutorials/DeepSeek-R1.md +++ b/docs/source/tutorials/DeepSeek-R1.md @@ -243,7 +243,7 @@ curl http://:/v1/completions \ -d '{ "model": "deepseek_r1", "prompt": "The future of AI is", - "max_tokens": 50, + "max_completion_tokens": 50, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/DeepSeek-V3.1.md b/docs/source/tutorials/DeepSeek-V3.1.md index 37c82125..250cafab 100644 --- a/docs/source/tutorials/DeepSeek-V3.1.md +++ b/docs/source/tutorials/DeepSeek-V3.1.md @@ -661,7 +661,7 @@ curl http://:/v1/completions \ -d '{ "model": "deepseek_v3", "prompt": "The future of AI is", - "max_tokens": 50, + "max_completion_tokens": 50, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/DeepSeek-V3.2.md b/docs/source/tutorials/DeepSeek-V3.2.md index e375053e..b44a4059 100644 --- a/docs/source/tutorials/DeepSeek-V3.2.md +++ b/docs/source/tutorials/DeepSeek-V3.2.md @@ -840,7 +840,7 @@ curl http://:/v1/completions \ -d '{ "model": "deepseek_v3.2", "prompt": "The future of AI is", - "max_tokens": 50, + "max_completion_tokens": 50, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/Qwen-VL-Dense.md b/docs/source/tutorials/Qwen-VL-Dense.md index d7fcca7e..94841a5f 100644 --- a/docs/source/tutorials/Qwen-VL-Dense.md +++ b/docs/source/tutorials/Qwen-VL-Dense.md @@ -142,7 +142,7 @@ llm = LLM( ) sampling_params = SamplingParams( - max_tokens=512 + max_completion_tokens=512 ) image_messages = [ @@ -238,7 +238,7 @@ llm = LLM( ) sampling_params = SamplingParams( - max_tokens=512 + max_completion_tokens=512 ) image_messages = [ diff --git a/docs/source/tutorials/Qwen2.5-7B.md b/docs/source/tutorials/Qwen2.5-7B.md index b0c2689d..abab2564 100644 --- a/docs/source/tutorials/Qwen2.5-7B.md +++ b/docs/source/tutorials/Qwen2.5-7B.md @@ -127,7 +127,7 @@ curl http://:/v1/completions \ -d '{ "model": "qwen-2.5-7b-instruct", "prompt": "Beijing is a", - "max_tokens": 5, + "max_completion_tokens": 5, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/Qwen2.5-Omni.md b/docs/source/tutorials/Qwen2.5-Omni.md index 020ed5cc..d5582d5b 100644 --- a/docs/source/tutorials/Qwen2.5-Omni.md +++ b/docs/source/tutorials/Qwen2.5-Omni.md @@ -156,7 +156,7 @@ curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/j ] } ], - "max_tokens": 100, + "max_completion_tokens": 100, "temperature": 0.7 }' diff --git a/docs/source/tutorials/Qwen3-235B-A22B.md b/docs/source/tutorials/Qwen3-235B-A22B.md index 85181437..bcdde1bb 100644 --- a/docs/source/tutorials/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/Qwen3-235B-A22B.md @@ -269,7 +269,7 @@ curl http://:/v1/completions \ -d '{ "model": "qwen3", "prompt": "The future of AI is", - "max_tokens": 50, + "max_completion_tokens": 50, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/Qwen3-30B-A3B.md b/docs/source/tutorials/Qwen3-30B-A3B.md index 62243f9f..b6724e85 100644 --- a/docs/source/tutorials/Qwen3-30B-A3B.md +++ b/docs/source/tutorials/Qwen3-30B-A3B.md @@ -62,7 +62,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso "temperature": 0.6, "top_p": 0.95, "top_k": 20, - "max_tokens": 4096 + "max_completion_tokens": 4096 }' ``` diff --git a/docs/source/tutorials/Qwen3-32B-W4A4.md b/docs/source/tutorials/Qwen3-32B-W4A4.md index dbc07a8f..28b9ecb4 100644 --- a/docs/source/tutorials/Qwen3-32B-W4A4.md +++ b/docs/source/tutorials/Qwen3-32B-W4A4.md @@ -108,7 +108,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "qwen3-32b-w4a4", "prompt": "what is large language model?", - "max_tokens": "128", + "max_completion_tokens": "128", "top_p": "0.95", "top_k": "40", "temperature": "0.0" diff --git a/docs/source/tutorials/Qwen3-8B-W4A8.md b/docs/source/tutorials/Qwen3-8B-W4A8.md index 476e151a..8d07b117 100644 --- a/docs/source/tutorials/Qwen3-8B-W4A8.md +++ b/docs/source/tutorials/Qwen3-8B-W4A8.md @@ -106,7 +106,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "qwen3-8b-w4a8", "prompt": "what is large language model?", - "max_tokens": "128", + "max_completion_tokens": "128", "top_p": "0.95", "top_k": "40", "temperature": "0.0" diff --git a/docs/source/tutorials/Qwen3-Coder-30B-A3B.md b/docs/source/tutorials/Qwen3-Coder-30B-A3B.md index 0fa8c89a..a0670f31 100644 --- a/docs/source/tutorials/Qwen3-Coder-30B-A3B.md +++ b/docs/source/tutorials/Qwen3-Coder-30B-A3B.md @@ -82,7 +82,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso "temperature": 0.6, "top_p": 0.95, "top_k": 20, - "max_tokens": 4096 + "max_completion_tokens": 4096 }' ``` diff --git a/docs/source/tutorials/Qwen3-Dense.md b/docs/source/tutorials/Qwen3-Dense.md index a5261435..86f13ee3 100644 --- a/docs/source/tutorials/Qwen3-Dense.md +++ b/docs/source/tutorials/Qwen3-Dense.md @@ -214,7 +214,7 @@ curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/jso "temperature": 0.6, "top_p": 0.95, "top_k": 20, - "max_tokens": 4096 + "max_completion_tokens": 4096 }' ``` diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 043ae359..d778ecc9 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -75,7 +75,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso "temperature": 0.6, "top_p": 0.95, "top_k": 20, - "max_tokens": 32 + "max_completion_tokens": 32 }' ``` @@ -103,7 +103,7 @@ if __name__ == '__main__': prompts = [ "Who are you?", ] - sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32) + sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32) llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, diff --git a/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md index a3d6d7b1..748a8a8b 100644 --- a/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md +++ b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md @@ -123,7 +123,7 @@ def main(): temperature=0.6, top_p=0.95, top_k=20, - max_tokens=16384, + max_completion_tokens=16384, ) processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) @@ -243,7 +243,7 @@ evalscope eval \ --datasets omni_bench, gsm8k, bbh \ --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \ --eval-batch-size 1 \ - --generation-config '{"max_tokens": 10000, "temperature": 0.6}' \ + --generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \ --limit 100 ``` diff --git a/docs/source/tutorials/Qwen3-VL-30B-A3B-Instruct.md b/docs/source/tutorials/Qwen3-VL-30B-A3B-Instruct.md index 705dc5a6..d1368d84 100644 --- a/docs/source/tutorials/Qwen3-VL-30B-A3B-Instruct.md +++ b/docs/source/tutorials/Qwen3-VL-30B-A3B-Instruct.md @@ -120,7 +120,7 @@ curl http://localhost:8000/v1/chat/completions \ {"type": "text", "text": "What is the text in the illustrate?"} ]} ], - "max_tokens": 100 + "max_completion_tokens": 100 }' ``` @@ -182,7 +182,7 @@ curl http://localhost:8000/v1/chat/completions \ {"type": "text", "text": "What is in this video?"} ]} ], - "max_tokens": 100 + "max_completion_tokens": 100 }' ``` diff --git a/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md b/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md index 74745ef4..797df320 100644 --- a/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md +++ b/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md @@ -932,7 +932,7 @@ curl http://192.0.0.1:8080/v1/completions \ -d '{ "model": "qwen3-moe", "prompt": "Who are you?", - "max_tokens": 100, + "max_completion_tokens": 100, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md b/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md index d9eedd18..4e5a5df4 100644 --- a/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md +++ b/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md @@ -271,7 +271,7 @@ curl http://192.0.0.1:8080/v1/chat/completions \ {"type": "text", "text": "What is the text in the illustrate?"} ]} ], - "max_tokens": 100, + "max_completion_tokens": 100, "temperature": 0 }' ``` diff --git a/docs/source/tutorials/ray.md b/docs/source/tutorials/ray.md index 1b6527a2..af1c3fc0 100644 --- a/docs/source/tutorials/ray.md +++ b/docs/source/tutorials/ray.md @@ -186,7 +186,7 @@ curl http://localhost:8000/v1/completions \ -d '{ "model": "qwen", "prompt": "tell me how to sleep well", - "max_tokens": 100, + "max_completion_tokens": 100, "temperature": 0 }' ``` diff --git a/docs/source/user_guide/feature_guide/kv_pool.md b/docs/source/user_guide/feature_guide/kv_pool.md index 8e919e73..63f3d26f 100644 --- a/docs/source/user_guide/feature_guide/kv_pool.md +++ b/docs/source/user_guide/feature_guide/kv_pool.md @@ -296,13 +296,13 @@ Configure the localhost, port, and model weight path in the command to your own Short question: ```shell -curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }' +curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }' ``` Long question: ```shell -curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }' +curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }' ``` ### Colocation Scenario @@ -355,11 +355,11 @@ Configure the localhost, port, and model weight path in the command to your own Short question: ```shell -curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }' +curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_completion_tokens": 200, "temperature":0.0 }' ``` Long question: ```shell -curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }' +curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_completion_tokens": 256, "temperature":0.0 }' ``` diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md index 4a81595d..64a24576 100644 --- a/docs/source/user_guide/feature_guide/sleep_mode.md +++ b/docs/source/user_guide/feature_guide/sleep_mode.md @@ -51,7 +51,7 @@ The following is a simple example of how to use sleep mode. # record npu memory use baseline in case other process is running used_bytes_baseline = total - free llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True) - sampling_params = SamplingParams(temperature=0, max_tokens=10) + sampling_params = SamplingParams(temperature=0, max_completion_tokens=10) output = llm.generate(prompt, sampling_params) llm.sleep(level=1) @@ -110,7 +110,7 @@ The following is a simple example of how to use sleep mode. -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "prompt": "The future of AI is", - "max_tokens": 7, + "max_completion_tokens": 7, "temperature": 0 }' ```