[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: menogrey <1299267905@qq.com>
This commit is contained in:
zhangyiming
2025-12-23 08:53:30 +08:00
committed by GitHub
parent c3a8d13ca7
commit f883a2edb9
2 changed files with 21 additions and 21 deletions

View File

@@ -25,7 +25,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
### Model Weight ### Model Weight
- `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1) - `DeepSeek-V3.1`(BF16 version): [Download model weight](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1)
- `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8). - `DeepSeek-V3.1-w8a8`(Quantized version without mtp): [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-w8a8).
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`. - `DeepSeek-V3.1-w8a8-mtp-QuaRot`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8-mtp-QuaRot). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
- `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model. - `Method of Quantify`: [msmodelslim](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-v31-w8a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96-mtp-%E9%87%8F%E5%8C%96). You can use these methods to quantify the model.
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -82,7 +82,7 @@ If you want to deploy multi-node environment, you need to set up environment on
### Single-node Deployment ### Single-node Deployment
- Quantized model `DeepSeek-V3.1_w8a8mix_mtp` can be deployed on 1 Atlas 800 A3 (64G × 16). - Quantized model `DeepSeek-V3.1-w8a8-mtp-QuaRot` can be deployed on 1 Atlas 800 A3 (64G × 16).
Run the following script to execute online inference. Run the following script to execute online inference.
@@ -107,7 +107,7 @@ export HCCL_SOCKET_IFNAME=$nic_name
export VLLM_ASCEND_ENABLE_MLAPO=1 export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8015 \ --port 8015 \
--data-parallel-size 4 \ --data-parallel-size 4 \
@@ -135,7 +135,7 @@ The parameters are explained as follows:
### Multi-node Deployment ### Multi-node Deployment
- `DeepSeek-V3.1_w8a8mix_mtp`: require at least 2 Atlas 800 A2 (64G × 8). - `DeepSeek-V3.1-w8a8-mtp-QuaRot`: require at least 2 Atlas 800 A2 (64G × 8).
Run the following scripts on two nodes respectively. Run the following scripts on two nodes respectively.
@@ -172,7 +172,7 @@ export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0 export HCCL_INTRA_ROCE_ENABLE=0
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8004 \ --port 8004 \
--data-parallel-size 4 \ --data-parallel-size 4 \
@@ -226,7 +226,7 @@ export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0 export HCCL_INTRA_ROCE_ENABLE=0
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8004 \ --port 8004 \
--headless \ --headless \
@@ -255,7 +255,7 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \
We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md). We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md).
Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case. Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case.
- `DeepSeek-V3.1_w8a8mix_mtp 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16). - `DeepSeek-V3.1-w8a8-mtp-QuaRot 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16).
To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests. To run the vllm-ascend `Prefill-Decode Disaggregation` service, you need to deploy a `launch_dp_program.py` script and a `run_dp_template.sh` script on each node and deploy a `proxy.sh` script on prefill master node to forward requests.
@@ -400,7 +400,7 @@ export ASCEND_RT_VISIBLE_DEVICE=$1
export ASCEND_BUFFER_POOL=4:8 export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -479,7 +479,7 @@ export ASCEND_RT_VISIBLE_DEVICE=$1
export ASCEND_BUFFER_POOL=4:8 export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -558,7 +558,7 @@ export ASCEND_RT_VISIBLE_DEVICE=$1
export ASCEND_BUFFER_POOL=4:8 export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -637,7 +637,7 @@ export ASCEND_RT_VISIBLE_DEVICE=$1
export ASCEND_BUFFER_POOL=4:8 export ASCEND_BUFFER_POOL=4:8
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ vllm serve /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -772,7 +772,7 @@ Here are two accuracy evaluation methods.
### Using AISBench ### Using AISBench
1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. 1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1_w8a8mix_mtp` in `vllm-ascend:0.11.0rc1` for reference only. 2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8-mtp-QuaRot` in `vllm-ascend:0.11.0rc1` for reference only.
| dataset | version | metric | mode | vllm-api-general-chat | note | | dataset | version | metric | mode | vllm-api-general-chat | note |
|----- | ----- | ----- | ----- | -----| ----- | |----- | ----- | ----- | ----- | -----| ----- |
@@ -790,7 +790,7 @@ Refer to [Using AISBench for performance evaluation](../developer_guide/evaluati
### Using vLLM Benchmark ### Using vLLM Benchmark
Run performance evaluation of `DeepSeek-V3.1_w8a8mix_mtp` as an example. Run performance evaluation of `DeepSeek-V3.1-w8a8-mtp-QuaRot` as an example.
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
@@ -802,7 +802,7 @@ There are three `vllm bench` subcommand:
Take the `serve` as an example. Run the code as follows. Take the `serve` as an example. Run the code as follows.
```shell ```shell
vllm bench serve --model vllm-ascend/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model /weights/DeepSeek-V3.1-w8a8-mtp-QuaRot --dataset-name random --random-input 1024 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
``` ```
After about several minutes, you can get the performance evaluation result. After about several minutes, you can get the performance evaluation result.

View File

@@ -19,7 +19,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur
- `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16) - `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8) - `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now. - `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot) - `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot)
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
@@ -289,7 +289,7 @@ Before you start, please
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \ vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -364,7 +364,7 @@ Before you start, please
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \ vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -441,7 +441,7 @@ Before you start, please
export VLLM_ASCEND_ENABLE_MLAPO=1 export VLLM_ASCEND_ENABLE_MLAPO=1
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \ vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -519,7 +519,7 @@ Before you start, please
export VLLM_ASCEND_ENABLE_MLAPO=1 export VLLM_ASCEND_ENABLE_MLAPO=1
vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \ vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port $2 \ --port $2 \
--data-parallel-size $3 \ --data-parallel-size $3 \
@@ -626,7 +626,7 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval
```shell ```shell
lm_eval \ lm_eval \
--model local-completions \ --model local-completions \
--model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
--tasks gsm8k \ --tasks gsm8k \
--output_path ./ --output_path ./
``` ```
@@ -654,7 +654,7 @@ Take the `serve` as an example. Run the code as follows.
```shell ```shell
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
``` ```
After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is: After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is: