Qwen3-Next:Update the gpu-memory-utilization parameter to 0.7 (#5129)
### What this PR does / why we need it?
Update the gpu-memory-utilization parameter to 0.7
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: ming1212 <2717180080@qq.com>
Signed-off-by: ming1212 <104972349+ming1212@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
@@ -19,6 +19,9 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
|
|||||||
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main)
|
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main)
|
||||||
|
|
||||||
## Deployment
|
## Deployment
|
||||||
|
|
||||||
|
If the machine environment is an Atlas 800I A3(64G*16), the deployment approach stays identical.
|
||||||
|
|
||||||
### Run docker container
|
### Run docker container
|
||||||
|
|
||||||
```{code-block} bash
|
```{code-block} bash
|
||||||
@@ -92,7 +95,7 @@ Run the following script to start the vLLM server on multi-NPU:
|
|||||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
|
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 4, and for 32 GB of memory, tensor-parallel-size should be at least 8.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
|
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 4096 --gpu-memory-utilization 0.7 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Once your server is started, you can query the model with input prompts.
|
Once your server is started, you can query the model with input prompts.
|
||||||
|
|||||||
Reference in New Issue
Block a user