add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1 Update deepseek3.2 tutorial doc - vLLM version: v0.12.0 - vLLM main: ad32e3e19c Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-13 22:09:59 +08:00
parent 0f92d34a70
commit 42ceaf08a1
10 changed files with 530 additions and 278 deletions
--- a/docs/source/tutorials/DeepSeek-V3.2.md
+++ b/docs/source/tutorials/DeepSeek-V3.2.md
@@ -2,7 +2,7 @@

 ## Introduction

-DeepSeek-V3.2-Exp is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.
+DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

 This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

@@ -18,6 +18,8 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur

 - `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
 - `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
+- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelscope.cn/models/deepseek-ai/DeepSeek-V3.2/)
+- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot)

 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`

@@ -27,10 +29,10 @@ If you want to deploy multi-node environment, you need to verify multi-node comm

 ### Installation

-You can using our official docker image and install extra operator for supporting `DeepSeek-V3.2-Exp`.
+You can using our official docker image and install extra operator for supporting `DeepSeek-V3.2`.

 :::{note}
-Only AArch64 architecture are supported currently due to extra operator's installation limitations.
+We strongly recommend you to install triton ascend package to speed up the inference.
 :::

 :::::{tab-set}
@@ -39,7 +41,7 @@ Only AArch64 architecture are supported currently due to extra operator's instal
 ::::{tab-item} A3 series
 :sync: A3

-1. Start the docker image on your each node.
+Start the docker image on your each node.

 ```{code-block} bash
   :substitutions:
@@ -78,23 +80,11 @@ docker run --rm \
    -it $IMAGE bash
 ```

-2. Install the package `custom-ops` to make the kernels available.
-
-```shell
-wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/CANN-custom_ops-sfa-linux.aarch64.run
-chmod +x ./CANN-custom_ops-sfa-linux.aarch64.run
-./CANN-custom_ops-sfa-linux.aarch64.run --quiet
-export ASCEND_CUSTOM_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize:${ASCEND_CUSTOM_OPP_PATH}
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
-wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/custom_ops-1.0-cp311-cp311-linux_aarch64.whl
-pip install custom_ops-1.0-cp311-cp311-linux_aarch64.whl
-```
-
 ::::
 ::::{tab-item} A2 series
 :sync: A2

-1. Start the docker image on your each node.
+Start the docker image on your each node.

 ```{code-block} bash
   :substitutions:
@@ -125,18 +115,6 @@ docker run --rm \
    -it $IMAGE bash
 ```

-2. Install the package `custom-ops` to make the kernels available.
-
-```shell
-wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/CANN-custom_ops-sfa-linux.aarch64.run
-chmod +x ./CANN-custom_ops-sfa-linux.aarch64.run
-./CANN-custom_ops-sfa-linux.aarch64.run --quiet
-export ASCEND_CUSTOM_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize:${ASCEND_CUSTOM_OPP_PATH}
-export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
-wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/custom_ops-1.0-cp311-cp311-linux_aarch64.whl
-pip install custom_ops-1.0-cp311-cp311-linux_aarch64.whl
-```
-
 ::::
 :::::

@@ -144,229 +122,457 @@ In addition, if you don't want to use the docker image as above, you can also bu

 - Install `vllm-ascend` from source, refer to [installation](../installation.md).

- Install extra operator for supporting `DeepSeek-V3.2-Exp`, refer to the above tab.
-
 If you want to deploy multi-node environment, you need to set up environment on each node.

 ## Deployment

-### Single-node Deployment
-
-Only the quantized model `DeepSeek-V3.2-Exp-w8a8` can be deployed on 1 Atlas 800 A3.
-
-Run the following script to execute online inference.
-
-```shell
-#!/bin/sh
-export VLLM_USE_MODELSCOPE=true
-
-vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 16 \
--seed 1024 \
--quantization ascend \
--served-model-name deepseek_v3.2 \
--max-num-seqs 16 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92
-```
-
-### Multi-node Deployment
-
- `DeepSeek-V3.2-Exp`: require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8).
- `DeepSeek-V3.2-Exp-w8a8`: require 2 Atlas 800 A2 (64G × 8).
-
-:::::{tab-set}
-:sync-group: install
-
-::::{tab-item} DeepSeek-V3.2-Exp A3 series
-:sync: A3
-
-Run the following scripts on two nodes respectively.
-
-**Node 0**
-
-```shell
-#!/bin/sh
-
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxxx"
-local_ip="xxxx"
-
-export VLLM_USE_MODELSCOPE=True
-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export HCCL_BUFFSIZE=1024
-
-vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name deepseek_v3.2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9
-```
-
-**Node 1**
-
-```shell
-#!/bin/sh
-
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="xxx"
-
-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
-
-export VLLM_USE_MODELSCOPE=True
-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export HCCL_BUFFSIZE=1024
-
-vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name deepseek_v3.2 \
--max-num-seqs 16 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--enable-expert-parallel \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92
-```
-
-::::
-::::{tab-item} DeepSeek-V3.2-Exp-W8A8 A2 series
-:sync: A2
-
-Run the following scripts on two nodes respectively.
-
-**Node 0**
-
-```shell
-#!/bin/sh
-
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxxx"
-local_ip="xxxx"
-
-export VLLM_USE_MODELSCOPE=True
-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export HCCL_BUFFSIZE=1024
-export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
-
-vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $local_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek_v3.2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--trust-remote-code \
--quantization ascend \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.9
-```
-
-**Node 1**
-
-```shell
-#!/bin/sh
-
-# this obtained through ifconfig
-# nic_name is the network interface name corresponding to local_ip of the current node
-nic_name="xxx"
-local_ip="xxx"
-
-# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
-node0_ip="xxxx"
-
-export VLLM_USE_MODELSCOPE=True
-export HCCL_IF_IP=$local_ip
-export GLOO_SOCKET_IFNAME=$nic_name
-export TP_SOCKET_IFNAME=$nic_name
-export HCCL_SOCKET_IFNAME=$nic_name
-export OMP_PROC_BIND=false
-export OMP_NUM_THREADS=10
-export HCCL_BUFFSIZE=1024
-export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
-
-vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--seed 1024 \
--served-model-name deepseek_v3.2 \
--max-num-seqs 16 \
--max-model-len 17450 \
--max-num-batched-tokens 17450 \
--enable-expert-parallel \
--trust-remote-code \
--quantization ascend \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92
-```
-
-::::
-:::::
+:::{note}
+In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
+:::

 ### Prefill-Decode Disaggregation

-Not supported yet.
+We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
+
+Before you start, please
+1. prepare the script `launch_online_dp.py` on each node.
+
+    ```
+    import argparse
+    import multiprocessing
+    import os
+    import subprocess
+    import sys
+
+    def parse_args():
+        parser = argparse.ArgumentParser()
+        parser.add_argument(
+            "--dp-size",
+            type=int,
+            required=True,
+            help="Data parallel size."
+        )
+        parser.add_argument(
+            "--tp-size",
+            type=int,
+            default=1,
+            help="Tensor parallel size."
+        )
+        parser.add_argument(
+            "--dp-size-local",
+            type=int,
+            default=-1,
+            help="Local data parallel size."
+        )
+        parser.add_argument(
+            "--dp-rank-start",
+            type=int,
+            default=0,
+            help="Starting rank for data parallel."
+        )
+        parser.add_argument(
+            "--dp-address",
+            type=str,
+            required=True,
+            help="IP address for data parallel master node."
+        )
+        parser.add_argument(
+            "--dp-rpc-port",
+            type=str,
+            default=12345,
+            help="Port for data parallel master node."
+        )
+        parser.add_argument(
+            "--vllm-start-port",
+            type=int,
+            default=9000,
+            help="Starting port for the engine."
+        )
+        return parser.parse_args()
+
+    args = parse_args()
+    dp_size = args.dp_size
+    tp_size = args.tp_size
+    dp_size_local = args.dp_size_local
+    if dp_size_local == -1:
+        dp_size_local = dp_size
+    dp_rank_start = args.dp_rank_start
+    dp_address = args.dp_address
+    dp_rpc_port = args.dp_rpc_port
+    vllm_start_port = args.vllm_start_port
+
+    def run_command(visiable_devices, dp_rank, vllm_engine_port):
+        command = [
+            "bash",
+            "./run_dp_template.sh",
+            visiable_devices,
+            str(vllm_engine_port),
+            str(dp_size),
+            str(dp_rank),
+            dp_address,
+            dp_rpc_port,
+            str(tp_size),
+        ]
+        subprocess.run(command, check=True)
+
+    if __name__ == "__main__":
+        template_path = "./run_dp_template.sh"
+        if not os.path.exists(template_path):
+            print(f"Template file {template_path} does not exist.")
+            sys.exit(1)
+
+        processes = []
+        num_cards = dp_size_local * tp_size
+        for i in range(dp_size_local):
+            dp_rank = dp_rank_start + i
+            vllm_engine_port = vllm_start_port + i
+            visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
+            process = multiprocessing.Process(target=run_command,
+                                            args=(visiable_devices, dp_rank,
+                                                    vllm_engine_port))
+            processes.append(process)
+            process.start()
+
+        for process in processes:
+            process.join()
+
+    ```
+
+2. prepare the script `run_dp_template.sh` on each node.
+
+    1. Prefill node 0
+
+        ```
+        nic_name="enp48s3u1u1" # change to your own nic name
+        local_ip=141.61.39.105 # change to your own ip
+
+        export HCCL_OP_EXPANSION_MODE="AIV"
+
+        export HCCL_IF_IP=$local_ip
+        export GLOO_SOCKET_IFNAME=$nic_name
+        export TP_SOCKET_IFNAME=$nic_name
+        export HCCL_SOCKET_IFNAME=$nic_name
+
+        export OMP_PROC_BIND=false
+        export OMP_NUM_THREADS=10
+        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+        export VLLM_USE_V1=1
+        export HCCL_BUFFSIZE=256
+
+        export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
+        export VLLM_TORCH_PROFILER_WITH_STACK=0
+
+        export ASCEND_AGGREGATE_ENABLE=1
+        export ASCEND_TRANSPORT_PRINT=1
+        export ACL_OP_INIT_MODE=1
+        export ASCEND_A3_ENABLE=1
+        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
+
+        export ASCEND_RT_VISIBLE_DEVICES=$1
+
+        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+
+
+        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \
+            --host 0.0.0.0 \
+            --port $2 \
+            --data-parallel-size $3 \
+            --data-parallel-rank $4 \
+            --data-parallel-address $5 \
+            --data-parallel-rpc-port $6 \
+            --tensor-parallel-size $7 \
+            --enable-expert-parallel \
+            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
+            --seed 1024 \
+            --served-model-name dsv3 \
+            --max-model-len 68000 \
+            --max-num-batched-tokens 32550 \
+            --trust-remote-code \
+            --max-num-seqs 64 \
+            --gpu-memory-utilization 0.82 \
+            --quantization ascend \
+            --enforce-eager \
+            --no-enable-prefix-caching \
+            --kv-transfer-config \
+            '{"kv_connector": "MooncakeConnector",
+            "kv_role": "kv_producer",
+            "kv_port": "30000",
+            "engine_id": "0",
+            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
+            "kv_connector_extra_config": {
+                        "use_ascend_direct": true,
+                        "prefill": {
+                                "dp_size": 2,
+                                "tp_size": 16
+                        },
+                        "decode": {
+                                "dp_size": 8,
+                                "tp_size": 4
+                        }
+                }
+            }'
+
+        ```
+
+    2. Prefill node 1
+
+        ```
+        nic_name="enp48s3u1u1" # change to your own nic name
+        local_ip=141.61.39.113 # change to your own ip
+
+        export HCCL_OP_EXPANSION_MODE="AIV"
+
+        export HCCL_IF_IP=$local_ip
+        export GLOO_SOCKET_IFNAME=$nic_name
+        export TP_SOCKET_IFNAME=$nic_name
+        export HCCL_SOCKET_IFNAME=$nic_name
+
+        export OMP_PROC_BIND=false
+        export OMP_NUM_THREADS=10
+        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+        export VLLM_USE_V1=1
+        export HCCL_BUFFSIZE=256
+
+        export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
+        export VLLM_TORCH_PROFILER_WITH_STACK=0
+
+        export ASCEND_AGGREGATE_ENABLE=1
+        export ASCEND_TRANSPORT_PRINT=1
+        export ACL_OP_INIT_MODE=1
+        export ASCEND_A3_ENABLE=1
+        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
+
+        export ASCEND_RT_VISIBLE_DEVICES=$1
+        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+
+        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+
+
+        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \
+            --host 0.0.0.0 \
+            --port $2 \
+            --data-parallel-size $3 \
+            --data-parallel-rank $4 \
+            --data-parallel-address $5 \
+            --data-parallel-rpc-port $6 \
+            --tensor-parallel-size $7 \
+            --enable-expert-parallel \
+            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
+            --seed 1024 \
+            --served-model-name dsv3 \
+            --max-model-len 68000 \
+            --max-num-batched-tokens 32550 \
+            --trust-remote-code \
+            --max-num-seqs 64 \
+            --gpu-memory-utilization 0.82 \
+            --quantization ascend \
+            --enforce-eager \
+            --no-enable-prefix-caching \
+            --kv-transfer-config \
+            '{"kv_connector": "MooncakeConnector",
+            "kv_role": "kv_producer",
+            "kv_port": "30000",
+            "engine_id": "0",
+            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
+            "kv_connector_extra_config": {
+                        "use_ascend_direct": true,
+                        "prefill": {
+                                "dp_size": 2,
+                                "tp_size": 16
+                        },
+                        "decode": {
+                                "dp_size": 8,
+                                "tp_size": 4
+                        }
+                }
+            }'
+        ```
+
+    3. Decode node 0
+
+        ```
+        nic_name="enp48s3u1u1" # change to your own nic name
+        local_ip=141.61.39.117 # change to your own ip
+
+        export HCCL_OP_EXPANSION_MODE="AIV"
+
+        export HCCL_IF_IP=$local_ip
+        export GLOO_SOCKET_IFNAME=$nic_name
+        export TP_SOCKET_IFNAME=$nic_name
+        export HCCL_SOCKET_IFNAME=$nic_name
+
+        #Mooncake
+        export OMP_PROC_BIND=false
+        export OMP_NUM_THREADS=10
+
+        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+        export VLLM_USE_V1=1
+        export HCCL_BUFFSIZE=256
+
+        export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
+        export VLLM_TORCH_PROFILER_WITH_STACK=0
+
+        export ASCEND_AGGREGATE_ENABLE=1
+        export ASCEND_TRANSPORT_PRINT=1
+        export ACL_OP_INIT_MODE=1
+        export ASCEND_A3_ENABLE=1
+        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
+
+        export TASK_QUEUE_ENABLE=1
+
+        export ASCEND_RT_VISIBLE_DEVICES=$1
+
+        export VLLM_ASCEND_ENABLE_MLAPO=1
+
+
+        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \
+            --host 0.0.0.0 \
+            --port $2 \
+            --data-parallel-size $3 \
+            --data-parallel-rank $4 \
+            --data-parallel-address $5 \
+            --data-parallel-rpc-port $6 \
+            --tensor-parallel-size $7 \
+            --enable-expert-parallel \
+            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
+            --seed 1024 \
+            --served-model-name dsv3 \
+            --max-model-len 68000 \
+            --max-num-batched-tokens 4 \
+            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
+            --trust-remote-code \
+            --max-num-seqs 1 \
+            --gpu-memory-utilization 0.95 \
+            --no-enable-prefix-caching \
+            --async-scheduling \
+            --quantization ascend \
+            --kv-transfer-config \
+            '{"kv_connector": "MooncakeConnector",
+            "kv_role": "kv_consumer",
+            "kv_port": "30100",
+            "engine_id": "1",
+            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
+            "kv_connector_extra_config": {
+                        "use_ascend_direct": true,
+                        "prefill": {
+                                "dp_size": 2,
+                                "tp_size": 16
+                        },
+                        "decode": {
+                                "dp_size": 8,
+                                "tp_size": 4
+                        }
+                }
+            }'
+        ```
+
+    4. Decode node 1
+
+        ```
+        nic_name="enp48s3u1u1" # change to your own nic name
+        local_ip=141.61.39.181 # change to your own ip
+
+        export HCCL_OP_EXPANSION_MODE="AIV"
+
+        export HCCL_IF_IP=$local_ip
+        export GLOO_SOCKET_IFNAME=$nic_name
+        export TP_SOCKET_IFNAME=$nic_name
+        export HCCL_SOCKET_IFNAME=$nic_name
+
+        #Mooncake
+        export OMP_PROC_BIND=false
+        export OMP_NUM_THREADS=10
+
+        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+        export VLLM_USE_V1=1
+        export HCCL_BUFFSIZE=256
+
+        export VLLM_TORCH_PROFILER_DIR="./vllm_profile"
+        export VLLM_TORCH_PROFILER_WITH_STACK=0
+
+        export ASCEND_AGGREGATE_ENABLE=1
+        export ASCEND_TRANSPORT_PRINT=1
+        export ACL_OP_INIT_MODE=1
+        export ASCEND_A3_ENABLE=1
+        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
+
+        export TASK_QUEUE_ENABLE=1
+
+        export ASCEND_RT_VISIBLE_DEVICES=$1
+
+        export VLLM_ASCEND_ENABLE_MLAPO=1
+
+
+        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot \
+            --host 0.0.0.0 \
+            --port $2 \
+            --data-parallel-size $3 \
+            --data-parallel-rank $4 \
+            --data-parallel-address $5 \
+            --data-parallel-rpc-port $6 \
+            --tensor-parallel-size $7 \
+            --enable-expert-parallel \
+            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
+            --seed 1024 \
+            --served-model-name dsv3 \
+            --max-model-len 68000 \
+            --max-num-batched-tokens 4 \
+            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[2, 4, 6, 8]}' \
+            --trust-remote-code \
+            --async-scheduling \
+            --max-num-seqs 1 \
+            --gpu-memory-utilization 0.95 \
+            --no-enable-prefix-caching \
+            --quantization ascend \
+            --kv-transfer-config \
+            '{"kv_connector": "MooncakeConnector",
+            "kv_role": "kv_consumer",
+            "kv_port": "30100",
+            "engine_id": "1",
+            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
+            "kv_connector_extra_config": {
+                        "use_ascend_direct": true,
+                        "prefill": {
+                                "dp_size": 2,
+                                "tp_size": 16
+                        },
+                        "decode": {
+                                "dp_size": 8,
+                                "tp_size": 4
+                        }
+                }
+            }'
+        ```
+
+Once the preparation is done, you can start the server with the following command on each node:
+
+1. Prefill node 0
+
+```
+# change ip to your own
+python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
+```
+
+2. Prefill node 1
+
+```
+# change ip to your own
+python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
+```
+
+3. Decode node 0
+
+```
+# change ip to your own
+python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
+```
+
+4. Decode node 1
+
+```
+# change ip to your own
+python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
+```

 ## Functional Verification

@@ -391,15 +597,11 @@ Here are two accuracy evaluation methods.

 1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details.

-2. After execution, you can get the result, here is the result of `DeepSeek-V3.2-Exp-W8A8` in `vllm-ascend:0.11.0rc0` for reference only.
-
-| dataset | version | metric | mode | vllm-api-general-chat |
-|----- | ----- | ----- | ----- | -----|
-| cevaldataset | - | accuracy | gen | 92.20 |
+2. After execution, you can get the result.

 ### Using Language Model Evaluation Harness

-As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-Exp-W8A8` in online mode.
+As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.

 1. Refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.

@@ -408,17 +610,12 @@ As an example, take the `gsm8k` dataset as a test dataset, and run accuracy eval
 ```shell
 lm_eval \
  --model local-completions \
-  --model_args model=/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
+  --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --output_path ./
 ```

-3. After execution, you can get the result, here is the result of `DeepSeek-V3.2-Exp-W8A8` in `vllm-ascend:0.11.0rc0` for reference only.
-
-|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
-|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9591|±  |0.0055|
-|gsm8k|      3|strict-match    |     5|exact_match|↑  |0.9583|±  |0.0055|
+3. After execution, you can get the result.

 ## Performance

@@ -428,7 +625,7 @@ Refer to [Using AISBench for performance evaluation](../developer_guide/evaluati

 ### Using vLLM Benchmark

-Run performance evaluation of `DeepSeek-V3.2-Exp-W8A8` as an example.
+Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.

 Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.

@@ -441,7 +638,15 @@ Take the `serve` as an example. Run the code as follows.

 ```shell
 export VLLM_USE_MODELSCOPE=true
-vllm bench serve --model vllm-ascend/DeepSeek-V3.2-Exp-W8A8  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
+vllm bench serve --model vllm-ascend/DeepSeek-V3.2-W8A8  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
 ```

-After about several minutes, you can get the performance evaluation result.
+After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
+
+**Hardware**: A3-752T, 4 node
+
+**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
+
+**Input/Output**: 64k/3k
+
+**Performance**: 255tps, TPOT 23ms