xc-llm-ascend/docs/source/tutorials/models/DeepSeek-V3.2.md

# DeepSeek-V3.2

## Introduction

DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

## Supported Features

Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.

Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.

## Environment Preparation

### Model Weight

- `DeepSeek-V3.2-Exp-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8)
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)

It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

### Verify Multi-node Communication(Optional)

If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).

### Installation

You can use our official docker image to run `DeepSeek-V3.2` directly.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci8 \
    --device /dev/davinci9 \
    --device /dev/davinci10 \
    --device /dev/davinci11 \
    --device /dev/davinci12 \
    --device /dev/davinci13 \
    --device /dev/davinci14 \
    --device /dev/davinci15 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
::::{tab-item} A2 series
:sync: A2

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
:::::

In addition, if you don't want to use the docker image as above, you can also build all from source:

- Install `vllm-ascend` from source, refer to [installation](../../installation.md).

If you want to deploy multi-node environment, you need to set up environment on each node.

## Deployment

:::{note}
In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
:::

### Single-node Deployment

- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).

Run the following script to execute online inference.

```shell
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

In PD-disaggregated deployments, `layer_sharding` is supported only on prefill/P nodes with `kv_role="kv_producer"`. Do not enable it on decode/D nodes or `kv_role="kv_both"` nodes.

### Multi-node Deployment

- `DeepSeek-V3.2-w8a8`: require at least 2 Atlas 800 A2 (64G × 8).

Run the following scripts on two nodes respectively.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

**Node0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

**Node1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

::::
::::{tab-item} A2 series
:sync: A2

**Node0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

**Node1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

::::
:::::

### Prefill-Decode Disaggregation

We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.

Before you start, please

1. prepare the script `launch_online_dp.py` on each node:

    ```python
    import argparse
    import multiprocessing
    import os
    import subprocess
    import sys

    def parse_args():
        parser = argparse.ArgumentParser()
        parser.add_argument(
            "--dp-size",
            type=int,
            required=True,
            help="Data parallel size."
        )
        parser.add_argument(
            "--tp-size",
            type=int,
            default=1,
            help="Tensor parallel size."
        )
        parser.add_argument(
            "--dp-size-local",
            type=int,
            default=-1,
            help="Local data parallel size."
        )
        parser.add_argument(
            "--dp-rank-start",
            type=int,
            default=0,
            help="Starting rank for data parallel."
        )
        parser.add_argument(
            "--dp-address",
            type=str,
            required=True,
            help="IP address for data parallel master node."
        )
        parser.add_argument(
            "--dp-rpc-port",
            type=str,
            default=12345,
            help="Port for data parallel master node."
        )
        parser.add_argument(
            "--vllm-start-port",
            type=int,
            default=9000,
            help="Starting port for the engine."
        )
        return parser.parse_args()

    args = parse_args()
    dp_size = args.dp_size
    tp_size = args.tp_size
    dp_size_local = args.dp_size_local
    if dp_size_local == -1:
        dp_size_local = dp_size
    dp_rank_start = args.dp_rank_start
    dp_address = args.dp_address
    dp_rpc_port = args.dp_rpc_port
    vllm_start_port = args.vllm_start_port

    def run_command(visible_devices, dp_rank, vllm_engine_port):
        command = [
            "bash",
            "./run_dp_template.sh",
            visible_devices,
            str(vllm_engine_port),
            str(dp_size),
            str(dp_rank),
            dp_address,
            dp_rpc_port,
            str(tp_size),
        ]
        subprocess.run(command, check=True)

    if __name__ == "__main__":
        template_path = "./run_dp_template.sh"
        if not os.path.exists(template_path):
            print(f"Template file {template_path} does not exist.")
            sys.exit(1)

        processes = []
        num_cards = dp_size_local * tp_size
        for i in range(dp_size_local):
            dp_rank = dp_rank_start + i
            vllm_engine_port = vllm_start_port + i
            visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
            process = multiprocessing.Process(target=run_command,
                                            args=(visible_devices, dp_rank,
                                                    vllm_engine_port))
            processes.append(process)
            process.start()

        for process in processes:
            process.join()

    ```

2. prepare the script `run_dp_template.sh` on each node.

    1. Prefill node 0

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.105 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export ASCEND_RT_VISIBLE_DEVICES=$1

        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 32560 \
            --trust-remote-code \
            --max-num-seqs 64 \
            --gpu-memory-utilization 0.82 \
            --quantization ascend \
            --enforce-eager \
            --no-enable-prefix-caching \
            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeLayerwiseConnector",
            "kv_role": "kv_producer",
            "kv_port": "30000",
            "engine_id": "0",
            "kv_connector_extra_config": {
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }'

        ```

    2. Prefill node 1

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.113 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export ASCEND_RT_VISIBLE_DEVICES=$1

        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 32560 \
            --trust-remote-code \
            --max-num-seqs 64 \
            --gpu-memory-utilization 0.82 \
            --quantization ascend \
            --enforce-eager \
            --no-enable-prefix-caching \
            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeLayerwiseConnector",
            "kv_role": "kv_producer",
            "kv_port": "30000",
            "engine_id": "0",
            "kv_connector_extra_config": {
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }'
        ```

    3. Decode node 0

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.117 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        #Mooncake
        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10

        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256


        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export TASK_QUEUE_ENABLE=1

        export ASCEND_RT_VISIBLE_DEVICES=$1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 12 \
            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --async-scheduling \
            --quantization ascend \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeLayerwiseConnector",
            "kv_role": "kv_consumer",
            "kv_port": "30100",
            "engine_id": "1",
            "kv_connector_extra_config": {
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }' \
            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

    4. Decode node 1

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.181 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        #Mooncake
        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10

        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export TASK_QUEUE_ENABLE=1

        export ASCEND_RT_VISIBLE_DEVICES=$1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 12 \
            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
            --async-scheduling \
            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --quantization ascend \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeLayerwiseConnector",
            "kv_role": "kv_consumer",
            "kv_port": "30100",
            "engine_id": "1",
            "kv_connector_extra_config": {
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }' \
            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

Once the preparation is done, you can start the server with the following command on each node:
Refer to [Distributed DP Server With Large-Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/large_scale_ep.html) to get the detailed boot method.

1. Prefill node 0

    ```shell
    # change ip to your own
    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
    ```

2. Prefill node 1

    ```shell
    # change ip to your own
    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
    ```

3. Decode node 0

    ```shell
    # change ip to your own
    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
    ```

4. Decode node 1

    ```shell
    # change ip to your own
    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
    ```

### Request Forwarding

To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_layerwise_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)

```shell
unset http_proxy
unset https_proxy

python load_balance_proxy_layerwise_server_example.py \
    --port 8000 \
    --host 141.61.39.105 \
    --prefiller-hosts \
       141.61.39.105 \
       141.61.39.113 \
    --prefiller-ports \
       9100 \
       9100 \
    --decoder-hosts \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.181 \
      141.61.39.181 \
      141.61.39.181 \
      141.61.39.181 \
    --decoder-ports \
      9100 9101 9102 9103 \
      9100 9101 9102 9103 \
```

## Functional Verification

Once your server is started, you can query the model with input prompts:

```shell
curl http://<node0_ip>:<port>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek_v3.2",
        "prompt": "The future of AI is",
        "max_completion_tokens": 50,
        "temperature": 0
    }'
```

## Accuracy Evaluation

Here are two accuracy evaluation methods.

### Using AISBench

1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.

2. After execution, you can get the result.

### Using Language Model Evaluation Harness

As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.

1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.

2. Run `lm_eval` to execute the accuracy evaluation.

    ```shell
    lm_eval \
    --model local-completions \
    --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
    --tasks gsm8k \
    --output_path ./
    ```

3. After execution, you can get the result.

## Performance

### Using AISBench

Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.

The performance result is:  

**Hardware**: A3-752T, 4 node

**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4

**Input/Output**: 64k/3k

**Performance**: 533tps, TPOT 32ms

### Using vLLM Benchmark

Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.

Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.

There are three `vllm bench` subcommands:

- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

Take the `serve` as an example. Run the code as follows.

```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```

## Function Call

The function call feature is supported from v0.13.0rc1 on. Please use the latest version.

Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.
-												[Doc] Update tutorial index (#4920)

Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 20:53:13 +08:00
+								# DeepSeek-V3.2
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Introduction
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
 								## Supported Features
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Environment Preparation
 								### Model Weight
-												[Doc]Refresh model tutorial examples and serving commands (#7426)

### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples

- adjust some command snippets and notes for better copy-paste usability

- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`（**Offline** inference currently **does not support** the
"max_completion_tokens"）
``` bash
Traceback (most recent call last):
  File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999):  HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, 
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine 
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + 
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```

- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
    model=model_name,
    task="score",       # need fix
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
    model=model_name,
    runner="pooling",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```

- modify **PaddleOCR-VL**  parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
 is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```

These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: MrZ20 <2609716663@qq.com>
											
										
										
											2026-03-20 11:34:18 +08:00
+								- `DeepSeek-V3.2-Exp-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8)
-												[DOC]Fix model weight download links (#5436)

Updated download links for DeepSeek-V3.2 model weights.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/81786c87748b0177111dfdc07af5351d8389baa1

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
											
										
										
											2025-12-27 17:14:31 +08:00
+								- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+								It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Verify Multi-node Communication(Optional)
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Installation
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								You can use our official docker image to run `DeepSeek-V3.2` directly.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Start the docker image on your each node.
-												[Doc] Fix DeepSeek-V3.2-Exp doc, add docker command. (#4479)

### What this PR does / why we need it?
Fix DeepSeek-V3.2-Exp doc, add docker command.

- vLLM version: v0.11.2

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-01 22:29:21 +08:00
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci1 \
 								    --device /dev/davinci2 \
 								    --device /dev/davinci3 \
 								    --device /dev/davinci4 \
 								    --device /dev/davinci5 \
 								    --device /dev/davinci6 \
 								    --device /dev/davinci7 \
 								    --device /dev/davinci8 \
 								    --device /dev/davinci9 \
 								    --device /dev/davinci10 \
 								    --device /dev/davinci11 \
 								    --device /dev/davinci12 \
 								    --device /dev/davinci13 \
 								    --device /dev/davinci14 \
 								    --device /dev/davinci15 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								::::
 								::::{tab-item} A2 series
 								:sync: A2
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Start the docker image on your each node.
-												[Doc] Fix DeepSeek-V3.2-Exp doc, add docker command. (#4479)

### What this PR does / why we need it?
Fix DeepSeek-V3.2-Exp doc, add docker command.

- vLLM version: v0.11.2

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-01 22:29:21 +08:00
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci1 \
 								    --device /dev/davinci2 \
 								    --device /dev/davinci3 \
 								    --device /dev/davinci4 \
 								    --device /dev/davinci5 \
 								    --device /dev/davinci6 \
 								    --device /dev/davinci7 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								::::
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								:::::
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								In addition, if you don't want to use the docker image as above, you can also build all from source:
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								If you want to deploy multi-node environment, you need to set up environment on each node.
 								## Deployment
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								:::{note}
 								In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
 								:::
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								### Single-node Deployment
 								- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).
 								Run the following script to execute online inference.
 								```shell
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8000 \
 								--data-parallel-size 2 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
-												[BugFix] Add async communication check for capturing mode (#8149)

### What this PR does / why we need it?
Introduce a check to not using asynchronous communication under
`enable_dsa_cp_with_layer_shard` branch on capturing mode. This change
prevents potential stream and event issues when operating in
graph/capturing mode, ensuring safer communication practices.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both)

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
											
										
										
											2026-04-12 21:52:54 +08:00
+								In PD-disaggregated deployments, `layer_sharding` is supported only on prefill/P nodes with `kv_role="kv_producer"`. Do not enable it on decode/D nodes or `kv_role="kv_both"` nodes.
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								### Multi-node Deployment
 								- `DeepSeek-V3.2-w8a8`: require at least 2 Atlas 800 A2 (64G × 8).
 								Run the following scripts on two nodes respectively.
 								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
 								**Node0**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								**Node1**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								**Node0**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=100
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								export HCCL_CONNECT_TIMEOUT=120
 								export HCCL_INTRA_PCIE_ENABLE=1
 								export HCCL_INTRA_ROCE_ENABLE=0
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
 								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								```
 								**Node1**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=100
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								export HCCL_CONNECT_TIMEOUT=120
 								export HCCL_INTRA_PCIE_ENABLE=1
 								export HCCL_INTRA_ROCE_ENABLE=0
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
 								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								```
 								::::
 								:::::
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								### Prefill-Decode Disaggregation
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
 								Before you start, please
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+. prepare the script `launch_online_dp.py` on each node:
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								    ```python
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								    import argparse
 								    import multiprocessing
 								    import os
 								    import subprocess
 								    import sys
 								    def parse_args():
 								        parser = argparse.ArgumentParser()
 								        parser.add_argument(
 								            "--dp-size",
 								            type=int,
 								            required=True,
 								            help="Data parallel size."
 								        )
 								        parser.add_argument(
 								            "--tp-size",
 								            type=int,
 								            default=1,
 								            help="Tensor parallel size."
 								        )
 								        parser.add_argument(
 								            "--dp-size-local",
 								            type=int,
 								            default=-1,
 								            help="Local data parallel size."
 								        )
 								        parser.add_argument(
 								            "--dp-rank-start",
 								            type=int,
 								            default=0,
 								            help="Starting rank for data parallel."
 								        )
 								        parser.add_argument(
 								            "--dp-address",
 								            type=str,
 								            required=True,
 								            help="IP address for data parallel master node."
 								        )
 								        parser.add_argument(
 								            "--dp-rpc-port",
 								            type=str,
 								            default=12345,
 								            help="Port for data parallel master node."
 								        )
 								        parser.add_argument(
 								            "--vllm-start-port",
 								            type=int,
 								            default=9000,
 								            help="Starting port for the engine."
 								        )
 								        return parser.parse_args()
 								    args = parse_args()
 								    dp_size = args.dp_size
 								    tp_size = args.tp_size
 								    dp_size_local = args.dp_size_local
 								    if dp_size_local == -1:
 								        dp_size_local = dp_size
 								    dp_rank_start = args.dp_rank_start
 								    dp_address = args.dp_address
 								    dp_rpc_port = args.dp_rpc_port
 								    vllm_start_port = args.vllm_start_port
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								    def run_command(visible_devices, dp_rank, vllm_engine_port):
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        command = [
 								            "bash",
 								            "./run_dp_template.sh",
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								            visible_devices,
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            str(vllm_engine_port),
 								            str(dp_size),
 								            str(dp_rank),
 								            dp_address,
 								            dp_rpc_port,
 								            str(tp_size),
 								        ]
 								        subprocess.run(command, check=True)
 								    if __name__ == "__main__":
 								        template_path = "./run_dp_template.sh"
 								        if not os.path.exists(template_path):
 								            print(f"Template file {template_path} does not exist.")
 								            sys.exit(1)
 								        processes = []
 								        num_cards = dp_size_local * tp_size
 								        for i in range(dp_size_local):
 								            dp_rank = dp_rank_start + i
 								            vllm_engine_port = vllm_start_port + i
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								            visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            process = multiprocessing.Process(target=run_command,
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								                                            args=(visible_devices, dp_rank,
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								                                                    vllm_engine_port))
 								            processes.append(process)
 								            process.start()
 								        for process in processes:
 								            process.join()
 								    ```
 . prepare the script `run_dp_template.sh` on each node.
 . Prefill node 0
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.105 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
 								        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            --max-num-batched-tokens 32560 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
 								            --max-num-seqs 64 \
 								            --gpu-memory-utilization 0.82 \
 								            --quantization ascend \
 								            --enforce-eager \
 								            --no-enable-prefix-caching \
-												[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)

### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-01-20 12:40:54 +08:00
+								            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --kv-transfer-config \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            '{"kv_connector": "MooncakeLayerwiseConnector",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_producer",
 								            "kv_port": "30000",
 								            "engine_id": "0",
 								            "kv_connector_extra_config": {
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
 								            }'
 								        ```
 . Prefill node 1
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.113 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
 								        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            --max-num-batched-tokens 32560 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
 								            --max-num-seqs 64 \
 								            --gpu-memory-utilization 0.82 \
 								            --quantization ascend \
 								            --enforce-eager \
 								            --no-enable-prefix-caching \
-												[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)

### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-01-20 12:40:54 +08:00
+								            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --kv-transfer-config \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            '{"kv_connector": "MooncakeLayerwiseConnector",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_producer",
 								            "kv_port": "30000",
 								            "engine_id": "0",
 								            "kv_connector_extra_config": {
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
 								            }'
 								        ```
 . Decode node 0
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.117 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        #Mooncake
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export TASK_QUEUE_ENABLE=1
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-batched-tokens 12 \
 								            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-seqs 4 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --gpu-memory-utilization 0.95 \
 								            --no-enable-prefix-caching \
 								            --async-scheduling \
 								            --quantization ascend \
 								            --kv-transfer-config \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            '{"kv_connector": "MooncakeLayerwiseConnector",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_consumer",
 								            "kv_port": "30100",
 								            "engine_id": "1",
 								            "kv_connector_extra_config": {
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            }' \
 								            --additional-config '{"recompute_scheduler_enable" : true}'
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        ```
 . Decode node 1
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.181 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        #Mooncake
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export TASK_QUEUE_ENABLE=1
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-batched-tokens 12 \
 								            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
 								            --async-scheduling \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-seqs 4 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --gpu-memory-utilization 0.95 \
 								            --no-enable-prefix-caching \
 								            --quantization ascend \
 								            --kv-transfer-config \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								            '{"kv_connector": "MooncakeLayerwiseConnector",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_consumer",
 								            "kv_port": "30100",
 								            "engine_id": "1",
 								            "kv_connector_extra_config": {
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            }' \
 								            --additional-config '{"recompute_scheduler_enable" : true}'
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        ```
 								Once the preparation is done, you can start the server with the following command on each node:
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								Refer to [Distributed DP Server With Large-Scale Expert Parallelism](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/large_scale_ep.html) to get the detailed boot method.
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
 . Prefill node 0
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								    ```shell
 								    # change ip to your own
 								    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
 								    ```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Prefill node 1
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								    ```shell
 								    # change ip to your own
 								    python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
 								    ```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Decode node 0
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								    ```shell
 								    # change ip to your own
 								    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
 								    ```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Decode node 1
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								    ```shell
 								    # change ip to your own
 								    python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
 								    ```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								### Request Forwarding
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_layerwise_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
 								```shell
 								unset http_proxy
 								unset https_proxy
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								python load_balance_proxy_layerwise_server_example.py \
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								    --port 8000 \
-												[doc] Refresh the documentation for DeepSeek-V3.2 (#7403)

### What this PR does / why we need it?
Updated the DSV32 document.
1. Changed the PD separation boot mode to layerwise.
2. Changed max-num-batched-tokens to a multiple of the TP to avoid
triggering a verification error.
3. Added a link to help users adjust the configuration.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87

Signed-off-by: wyh145 <1987244901@qq.com>
											
										
										
											2026-03-18 14:59:48 +08:00
+								    --host 141.61.39.105 \
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								    --prefiller-hosts \
 .61.39.105 \
 .61.39.113 \
 								    --prefiller-ports \
 \
 \
 								    --decoder-hosts \
 .61.39.117 \
 .61.39.117 \
 .61.39.117 \
 .61.39.117 \
 .61.39.181 \
 .61.39.181 \
 .61.39.181 \
 .61.39.181 \
 								    --decoder-ports \
 9101 9102 9103 \
 9101 9102 9103 \
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								## Functional Verification
 								Once your server is started, you can query the model with input prompts:
 								```shell
 								curl http://<node0_ip>:<port>/v1/completions \
 								    -H "Content-Type: application/json" \
 								    -d '{
 								        "model": "deepseek_v3.2",
 								        "prompt": "The future of AI is",
-												[Doc] Update `max_tokens` to `max_completion_tokens` in all docs (#6248)

### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2026-01-26 11:57:40 +08:00
+								        "max_completion_tokens": 50,
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								        "temperature": 0
 								    }'
 								```
 								## Accuracy Evaluation
 								Here are two accuracy evaluation methods.
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. After execution, you can get the result.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Using Language Model Evaluation Harness
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 . Run `lm_eval` to execute the accuracy evaluation.
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								    ```shell
 								    lm_eval \
 								    --model local-completions \
 								    --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
 								    --tasks gsm8k \
 								    --output_path ./
 								    ```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. After execution, you can get the result.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Performance
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								The performance result is:
 								**Hardware**: A3-752T, 4 node
 								**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
 								**Input/Output**: 64k/3k
 								**Performance**: 533tps, TPOT 32ms
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								### Using vLLM Benchmark
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/) for more details.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								There are three `vllm bench` subcommands:
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								- `latency`: Benchmark the latency of a single batch of requests.
 								- `serve`: Benchmark the online serving throughput.
 								- `throughput`: Benchmark offline inference throughput.
 								Take the `serve` as an example. Run the code as follows.
 								```shell
 								export VLLM_USE_MODELSCOPE=true
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+								vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								```
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								## Function Call
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								The function call feature is supported from v0.13.0rc1 on. Please use the latest version.
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.