From 9268ad11e3db29e8896815549b7fb1a2ce387af8 Mon Sep 17 00:00:00 2001 From: ming1212 <104972349+ming1212@users.noreply.github.com> Date: Thu, 18 Dec 2025 15:16:33 +0800 Subject: [PATCH] =?UTF-8?q?Qwen3-Next=EF=BC=9AUpdate=20the=20gpu-memory-ut?= =?UTF-8?q?ilization=20parameter=20to=200.7=20(#5129)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What this PR does / why we need it? Update the gpu-memory-utilization parameter to 0.7 - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: ming1212 <2717180080@qq.com> Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com> Signed-off-by: Mengqing Cao Co-authored-by: Mengqing Cao --- docs/source/tutorials/Qwen3-Next.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 9bde7964..a0a7a6d1 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -19,6 +19,9 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main) ## Deployment + +If the machine environment is an Atlas 800I A3(64G*16), the deployment approach stays identical. + ### Run docker container ```{code-block} bash @@ -92,7 +95,7 @@ Run the following script to start the vLLM server on multi-NPU: For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8. ```bash -vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' +vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' ``` Once your server is started, you can query the model with input prompts.