xc-llm-ascend/docs/source/tutorials/models/DeepSeek-V3.2.md

# DeepSeek-V3.2

## Introduction

DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

## Supported Features

Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.

Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.

## Environment Preparation

### Model Weight

- `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)

It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.

### Verify Multi-node Communication(Optional)

If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).

### Installation

You can use our official docker image to run `DeepSeek-V3.2` directly.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci8 \
    --device /dev/davinci9 \
    --device /dev/davinci10 \
    --device /dev/davinci11 \
    --device /dev/davinci12 \
    --device /dev/davinci13 \
    --device /dev/davinci14 \
    --device /dev/davinci15 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
::::{tab-item} A2 series
:sync: A2

Start the docker image on your each node.

```{code-block} bash
   :substitutions:

export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
docker run --rm \
    --name vllm-ascend \
    --shm-size=1g \
    --net=host \
    --device /dev/davinci0 \
    --device /dev/davinci1 \
    --device /dev/davinci2 \
    --device /dev/davinci3 \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash
```

::::
:::::

In addition, if you don't want to use the docker image as above, you can also build all from source:

- Install `vllm-ascend` from source, refer to [installation](../../installation.md).

If you want to deploy multi-node environment, you need to set up environment on each node.

## Deployment

:::{note}
In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
:::

### Single-node Deployment

- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).

Run the following script to execute online inference.

```shell
export HCCL_OP_EXPANSION_MODE="AIV"
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 2 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

### Multi-node Deployment

- `DeepSeek-V3.2-w8a8`: require at least 2 Atlas 800 A2 (64G × 8).

Run the following scripts on two nodes respectively.

:::::{tab-set}
:sync-group: install

::::{tab-item} A3 series
:sync: A3

**Node0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

**Node1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 12890 \
--tensor-parallel-size 16 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

::::
::::{tab-item} A2 series
:sync: A2

**Node0**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

**Node1**

```{code-block} bash
   :substitutions:
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

export HCCL_OP_EXPANSION_MODE="AIV"

export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
--port 8077 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address $node0_ip \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3_2 \
--enable-expert-parallel \
--max-num-seqs 16 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

::::
:::::

### Prefill-Decode Disaggregation

We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.

Before you start, please

1. prepare the script `launch_online_dp.py` on each node:

    ```python
    import argparse
    import multiprocessing
    import os
    import subprocess
    import sys

    def parse_args():
        parser = argparse.ArgumentParser()
        parser.add_argument(
            "--dp-size",
            type=int,
            required=True,
            help="Data parallel size."
        )
        parser.add_argument(
            "--tp-size",
            type=int,
            default=1,
            help="Tensor parallel size."
        )
        parser.add_argument(
            "--dp-size-local",
            type=int,
            default=-1,
            help="Local data parallel size."
        )
        parser.add_argument(
            "--dp-rank-start",
            type=int,
            default=0,
            help="Starting rank for data parallel."
        )
        parser.add_argument(
            "--dp-address",
            type=str,
            required=True,
            help="IP address for data parallel master node."
        )
        parser.add_argument(
            "--dp-rpc-port",
            type=str,
            default=12345,
            help="Port for data parallel master node."
        )
        parser.add_argument(
            "--vllm-start-port",
            type=int,
            default=9000,
            help="Starting port for the engine."
        )
        return parser.parse_args()

    args = parse_args()
    dp_size = args.dp_size
    tp_size = args.tp_size
    dp_size_local = args.dp_size_local
    if dp_size_local == -1:
        dp_size_local = dp_size
    dp_rank_start = args.dp_rank_start
    dp_address = args.dp_address
    dp_rpc_port = args.dp_rpc_port
    vllm_start_port = args.vllm_start_port

    def run_command(visible_devices, dp_rank, vllm_engine_port):
        command = [
            "bash",
            "./run_dp_template.sh",
            visible_devices,
            str(vllm_engine_port),
            str(dp_size),
            str(dp_rank),
            dp_address,
            dp_rpc_port,
            str(tp_size),
        ]
        subprocess.run(command, check=True)

    if __name__ == "__main__":
        template_path = "./run_dp_template.sh"
        if not os.path.exists(template_path):
            print(f"Template file {template_path} does not exist.")
            sys.exit(1)

        processes = []
        num_cards = dp_size_local * tp_size
        for i in range(dp_size_local):
            dp_rank = dp_rank_start + i
            vllm_engine_port = vllm_start_port + i
            visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
            process = multiprocessing.Process(target=run_command,
                                            args=(visible_devices, dp_rank,
                                                    vllm_engine_port))
            processes.append(process)
            process.start()

        for process in processes:
            process.join()

    ```

2. prepare the script `run_dp_template.sh` on each node.

    1. Prefill node 0

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.105 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export ASCEND_RT_VISIBLE_DEVICES=$1

        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 32550 \
            --trust-remote-code \
            --max-num-seqs 64 \
            --gpu-memory-utilization 0.82 \
            --quantization ascend \
            --enforce-eager \
            --no-enable-prefix-caching \
            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_producer",
            "kv_port": "30000",
            "engine_id": "0",
            "kv_connector_extra_config": {
                        "use_ascend_direct": true,
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }'

        ```

    2. Prefill node 1

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.113 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export ASCEND_RT_VISIBLE_DEVICES=$1
        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 32550 \
            --trust-remote-code \
            --max-num-seqs 64 \
            --gpu-memory-utilization 0.82 \
            --quantization ascend \
            --enforce-eager \
            --no-enable-prefix-caching \
            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_producer",
            "kv_port": "30000",
            "engine_id": "0",
            "kv_connector_extra_config": {
                        "use_ascend_direct": true,
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }'
        ```

    3. Decode node 0

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.117 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        #Mooncake
        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10

        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256


        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export TASK_QUEUE_ENABLE=1

        export ASCEND_RT_VISIBLE_DEVICES=$1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 12 \
            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --async-scheduling \
            --quantization ascend \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_consumer",
            "kv_port": "30100",
            "engine_id": "1",
            "kv_connector_extra_config": {
                        "use_ascend_direct": true,
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }' \
            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

    4. Decode node 1

        ```shell
        nic_name="enp48s3u1u1" # change to your own nic name
        local_ip=141.61.39.181 # change to your own ip

        export HCCL_OP_EXPANSION_MODE="AIV"

        export HCCL_IF_IP=$local_ip
        export GLOO_SOCKET_IFNAME=$nic_name
        export TP_SOCKET_IFNAME=$nic_name
        export HCCL_SOCKET_IFNAME=$nic_name

        #Mooncake
        export OMP_PROC_BIND=false
        export OMP_NUM_THREADS=10

        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
        export VLLM_USE_V1=1
        export HCCL_BUFFSIZE=256

        export ASCEND_AGGREGATE_ENABLE=1
        export ASCEND_TRANSPORT_PRINT=1
        export ACL_OP_INIT_MODE=1
        export ASCEND_A3_ENABLE=1
        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000

        export TASK_QUEUE_ENABLE=1

        export ASCEND_RT_VISIBLE_DEVICES=$1


        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
            --host 0.0.0.0 \
            --port $2 \
            --data-parallel-size $3 \
            --data-parallel-rank $4 \
            --data-parallel-address $5 \
            --data-parallel-rpc-port $6 \
            --tensor-parallel-size $7 \
            --enable-expert-parallel \
            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
            --profiler-config \
            '{"profiler": "torch",
            "torch_profiler_dir": "./vllm_profile",
            "torch_profiler_with_stack": false}' \
            --seed 1024 \
            --served-model-name dsv3 \
            --max-model-len 68000 \
            --max-num-batched-tokens 12 \
            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
            --trust-remote-code \
            --async-scheduling \
            --max-num-seqs 4 \
            --gpu-memory-utilization 0.95 \
            --no-enable-prefix-caching \
            --quantization ascend \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_consumer",
            "kv_port": "30100",
            "engine_id": "1",
            "kv_connector_extra_config": {
                        "use_ascend_direct": true,
                        "prefill": {
                                "dp_size": 2,
                                "tp_size": 16
                        },
                        "decode": {
                                "dp_size": 8,
                                "tp_size": 4
                        }
                }
            }' \
            --additional-config '{"recompute_scheduler_enable" : true}'
        ```

Once the preparation is done, you can start the server with the following command on each node:

1. Prefill node 0

```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```

2. Prefill node 1

```shell
# change ip to your own
python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
```

3. Decode node 0

```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```

4. Decode node 1

```shell
# change ip to your own
python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
```

### Request Forwarding

To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

```shell
unset http_proxy
unset https_proxy

python load_balance_proxy_server_example.py \
    --port 8000 \
    --host 0.0.0.0 \
    --prefiller-hosts \
       141.61.39.105 \
       141.61.39.113 \
    --prefiller-ports \
       9100 \
       9100 \
    --decoder-hosts \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.117 \
      141.61.39.181 \
      141.61.39.181 \
      141.61.39.181 \
      141.61.39.181 \
    --decoder-ports \
      9100 9101 9102 9103 \
      9100 9101 9102 9103 \
```

## Functional Verification

Once your server is started, you can query the model with input prompts:

```shell
curl http://<node0_ip>:<port>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek_v3.2",
        "prompt": "The future of AI is",
        "max_completion_tokens": 50,
        "temperature": 0
    }'
```

## Accuracy Evaluation

Here are two accuracy evaluation methods.

### Using AISBench

1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.

2. After execution, you can get the result.

### Using Language Model Evaluation Harness

As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.

1. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.

2. Run `lm_eval` to execute the accuracy evaluation.

```shell
lm_eval \
  --model local-completions \
  --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
  --tasks gsm8k \
  --output_path ./
```

3. After execution, you can get the result.

## Performance

### Using AISBench

Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.

The performance result is:  

**Hardware**: A3-752T, 4 node

**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4

**Input/Output**: 64k/3k

**Performance**: 533tps, TPOT 32ms

### Using vLLM Benchmark

Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.

Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.

There are three `vllm bench` subcommands:

- `latency`: Benchmark the latency of a single batch of requests.
- `serve`: Benchmark the online serving throughput.
- `throughput`: Benchmark offline inference throughput.

Take the `serve` as an example. Run the code as follows.

```shell
export VLLM_USE_MODELSCOPE=true
vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
```

## Function Call

The function call feature is supported from v0.13.0rc1 on. Please use the latest version.

Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.
-												[Doc] Update tutorial index (#4920)

Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-11 20:53:13 +08:00
+								# DeepSeek-V3.2
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Introduction
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								DeepSeek-V3.2 is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism, which is designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.
 								## Supported Features
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Environment Preparation
 								### Model Weight
 								- `DeepSeek-V3.2-Exp`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-BF16)
 								- `DeepSeek-V3.2-Exp-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/DeepSeek-V3.2-Exp-w8a8)
-												[Doc] Fix DeepSeek-V3.2 tutorial. (#5190)

### What this PR does / why we need it?
Fix DeepSeek-V3.2 tutorial.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-22 11:30:17 +08:00
+								- `DeepSeek-V3.2`(BF16 version): require 2 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800 A2 (64G × 8) nodes. Model weight in BF16 not found now.
-												[DOC]Fix model weight download links (#5436)

Updated download links for DeepSeek-V3.2 model weights.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/81786c87748b0177111dfdc07af5351d8389baa1

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
											
										
										
											2025-12-27 17:14:31 +08:00
+								- `DeepSeek-V3.2-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.2-W8A8/)
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+								It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Verify Multi-node Communication(Optional)
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication).
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Installation
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								You can use our official docker image to run `DeepSeek-V3.2` directly.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Start the docker image on your each node.
-												[Doc] Fix DeepSeek-V3.2-Exp doc, add docker command. (#4479)

### What this PR does / why we need it?
Fix DeepSeek-V3.2-Exp doc, add docker command.

- vLLM version: v0.11.2

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-01 22:29:21 +08:00
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci1 \
 								    --device /dev/davinci2 \
 								    --device /dev/davinci3 \
 								    --device /dev/davinci4 \
 								    --device /dev/davinci5 \
 								    --device /dev/davinci6 \
 								    --device /dev/davinci7 \
 								    --device /dev/davinci8 \
 								    --device /dev/davinci9 \
 								    --device /dev/davinci10 \
 								    --device /dev/davinci11 \
 								    --device /dev/davinci12 \
 								    --device /dev/davinci13 \
 								    --device /dev/davinci14 \
 								    --device /dev/davinci15 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								::::
 								::::{tab-item} A2 series
 								:sync: A2
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Start the docker image on your each node.
-												[Doc] Fix DeepSeek-V3.2-Exp doc, add docker command. (#4479)

### What this PR does / why we need it?
Fix DeepSeek-V3.2-Exp doc, add docker command.

- vLLM version: v0.11.2

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-01 22:29:21 +08:00
 								```{code-block} bash
 								   :substitutions:
 								export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
 								docker run --rm \
 								    --name vllm-ascend \
 								    --shm-size=1g \
 								    --net=host \
 								    --device /dev/davinci0 \
 								    --device /dev/davinci1 \
 								    --device /dev/davinci2 \
 								    --device /dev/davinci3 \
 								    --device /dev/davinci4 \
 								    --device /dev/davinci5 \
 								    --device /dev/davinci6 \
 								    --device /dev/davinci7 \
 								    --device /dev/davinci_manager \
 								    --device /dev/devmm_svm \
 								    --device /dev/hisi_hdc \
 								    -v /usr/local/dcmi:/usr/local/dcmi \
 								    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
 								    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
 								    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
 								    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
 								    -v /etc/ascend_install.info:/etc/ascend_install.info \
 								    -v /root/.cache:/root/.cache \
 								    -it $IMAGE bash
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								::::
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								:::::
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos. (#4095)

### What this PR does / why we need it?
Fix DeepSeek-3.2-Exp doc, remove v0.11.0rc0 outdated infos.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-12 09:11:31 +08:00
+								In addition, if you don't want to use the docker image as above, you can also build all from source:
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								- Install `vllm-ascend` from source, refer to [installation](../../installation.md).
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								If you want to deploy multi-node environment, you need to set up environment on each node.
 								## Deployment
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								:::{note}
 								In this tutorial, we suppose you downloaded the model weight to `/root/.cache/`. Feel free to change it to your own path.
 								:::
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								### Single-node Deployment
 								- Quantized model `DeepSeek-V3.2-w8a8` can be deployed on 1 Atlas 800 A3 (64G × 16).
 								Run the following script to execute online inference.
 								```shell
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8000 \
 								--data-parallel-size 2 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								### Multi-node Deployment
 								- `DeepSeek-V3.2-w8a8`: require at least 2 Atlas 800 A2 (64G × 8).
 								Run the following scripts on two nodes respectively.
 								:::::{tab-set}
 								:sync-group: install
 								::::{tab-item} A3 series
 								:sync: A3
 								**Node0**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								**Node1**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=10
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 12890 \
 								--tensor-parallel-size 16 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
 								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
 								```
 								::::
 								::::{tab-item} A2 series
 								:sync: A2
 								**Node0**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=100
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								export HCCL_CONNECT_TIMEOUT=120
 								export HCCL_INTRA_PCIE_ENABLE=1
 								export HCCL_INTRA_ROCE_ENABLE=0
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
 								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								```
 								**Node1**
 								```{code-block} bash
 								   :substitutions:
 								# this obtained through ifconfig
 								# nic_name is the network interface name corresponding to local_ip of the current node
 								nic_name="xxx"
 								local_ip="xxx"
 								# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
 								node0_ip="xxxx"
 								export HCCL_OP_EXPANSION_MODE="AIV"
 								export HCCL_IF_IP=$local_ip
 								export GLOO_SOCKET_IFNAME=$nic_name
 								export TP_SOCKET_IFNAME=$nic_name
 								export HCCL_SOCKET_IFNAME=$nic_name
 								export OMP_PROC_BIND=false
 								export OMP_NUM_THREADS=100
 								export VLLM_USE_V1=1
 								export HCCL_BUFFSIZE=200
 								export VLLM_ASCEND_ENABLE_MLAPO=1
 								export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
+								export HCCL_CONNECT_TIMEOUT=120
 								export HCCL_INTRA_PCIE_ENABLE=1
 								export HCCL_INTRA_ROCE_ENABLE=0
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
 								--host 0.0.0.0 \
 								--port 8077 \
 								--headless \
 								--data-parallel-size 2 \
 								--data-parallel-size-local 1 \
 								--data-parallel-start-rank 1 \
 								--data-parallel-address $node0_ip \
 								--data-parallel-rpc-port 13389 \
 								--tensor-parallel-size 8 \
 								--quantization ascend \
 								--seed 1024 \
 								--served-model-name deepseek_v3_2 \
 								--enable-expert-parallel \
 								--max-num-seqs 16 \
 								--max-model-len 8192 \
 								--max-num-batched-tokens 4096 \
 								--trust-remote-code \
 								--no-enable-prefix-caching \
 								--gpu-memory-utilization 0.92 \
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
 								--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
 								--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
-												[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node deployment (#6196)

### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2026-01-24 11:29:07 +08:00
 								```
 								::::
 								:::::
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								### Prefill-Decode Disaggregation
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								We'd like to show the deployment guide of `DeepSeek-V3.2` on multi-node environment with 1P1D for better performance.
 								Before you start, please
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+. prepare the script `launch_online_dp.py` on each node:
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								    ```python
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								    import argparse
 								    import multiprocessing
 								    import os
 								    import subprocess
 								    import sys
 								    def parse_args():
 								        parser = argparse.ArgumentParser()
 								        parser.add_argument(
 								            "--dp-size",
 								            type=int,
 								            required=True,
 								            help="Data parallel size."
 								        )
 								        parser.add_argument(
 								            "--tp-size",
 								            type=int,
 								            default=1,
 								            help="Tensor parallel size."
 								        )
 								        parser.add_argument(
 								            "--dp-size-local",
 								            type=int,
 								            default=-1,
 								            help="Local data parallel size."
 								        )
 								        parser.add_argument(
 								            "--dp-rank-start",
 								            type=int,
 								            default=0,
 								            help="Starting rank for data parallel."
 								        )
 								        parser.add_argument(
 								            "--dp-address",
 								            type=str,
 								            required=True,
 								            help="IP address for data parallel master node."
 								        )
 								        parser.add_argument(
 								            "--dp-rpc-port",
 								            type=str,
 								            default=12345,
 								            help="Port for data parallel master node."
 								        )
 								        parser.add_argument(
 								            "--vllm-start-port",
 								            type=int,
 								            default=9000,
 								            help="Starting port for the engine."
 								        )
 								        return parser.parse_args()
 								    args = parse_args()
 								    dp_size = args.dp_size
 								    tp_size = args.tp_size
 								    dp_size_local = args.dp_size_local
 								    if dp_size_local == -1:
 								        dp_size_local = dp_size
 								    dp_rank_start = args.dp_rank_start
 								    dp_address = args.dp_address
 								    dp_rpc_port = args.dp_rpc_port
 								    vllm_start_port = args.vllm_start_port
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								    def run_command(visible_devices, dp_rank, vllm_engine_port):
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        command = [
 								            "bash",
 								            "./run_dp_template.sh",
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								            visible_devices,
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            str(vllm_engine_port),
 								            str(dp_size),
 								            str(dp_rank),
 								            dp_address,
 								            dp_rpc_port,
 								            str(tp_size),
 								        ]
 								        subprocess.run(command, check=True)
 								    if __name__ == "__main__":
 								        template_path = "./run_dp_template.sh"
 								        if not os.path.exists(template_path):
 								            print(f"Template file {template_path} does not exist.")
 								            sys.exit(1)
 								        processes = []
 								        num_cards = dp_size_local * tp_size
 								        for i in range(dp_size_local):
 								            dp_rank = dp_rank_start + i
 								            vllm_engine_port = vllm_start_port + i
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								            visible_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            process = multiprocessing.Process(target=run_command,
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								                                            args=(visible_devices, dp_rank,
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								                                                    vllm_engine_port))
 								            processes.append(process)
 								            process.start()
 								        for process in processes:
 								            process.join()
 								    ```
 . prepare the script `run_dp_template.sh` on each node.
 . Prefill node 0
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.105 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
 								        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
 								            --max-num-batched-tokens 32550 \
 								            --trust-remote-code \
 								            --max-num-seqs 64 \
 								            --gpu-memory-utilization 0.82 \
 								            --quantization ascend \
 								            --enforce-eager \
 								            --no-enable-prefix-caching \
-												[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)

### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-01-20 12:40:54 +08:00
+								            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --kv-transfer-config \
-												[Doc][P/D] Fix MooncakeConnector's name (#5172)

### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
											
										
										
											2025-12-18 22:29:19 +08:00
+								            '{"kv_connector": "MooncakeConnectorV1",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_producer",
 								            "kv_port": "30000",
 								            "engine_id": "0",
 								            "kv_connector_extra_config": {
 								                        "use_ascend_direct": true,
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
 								            }'
 								        ```
 . Prefill node 1
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.113 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
 								            --max-num-batched-tokens 32550 \
 								            --trust-remote-code \
 								            --max-num-seqs 64 \
 								            --gpu-memory-utilization 0.82 \
 								            --quantization ascend \
 								            --enforce-eager \
 								            --no-enable-prefix-caching \
-												[Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (#5921)

### What this PR does / why we need it?

#### Documentation Improvements

New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.

#### CI and Testing Updates

Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.

including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.

### Does this PR introduce any user-facing change?

Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.

### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.

Environment: * vLLM version: v0.13.0

Base Commit:
[11b6af5](https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9)

Hardware: Ascend A3/A2 multi-node cluster.

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-01-20 12:40:54 +08:00
+								            --additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --kv-transfer-config \
-												[Doc][P/D] Fix MooncakeConnector's name (#5172)

### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
											
										
										
											2025-12-18 22:29:19 +08:00
+								            '{"kv_connector": "MooncakeConnectorV1",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_producer",
 								            "kv_port": "30000",
 								            "engine_id": "0",
 								            "kv_connector_extra_config": {
 								                        "use_ascend_direct": true,
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
 								            }'
 								        ```
 . Decode node 0
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.117 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        #Mooncake
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export TASK_QUEUE_ENABLE=1
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-batched-tokens 12 \
 								            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-seqs 4 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --gpu-memory-utilization 0.95 \
 								            --no-enable-prefix-caching \
 								            --async-scheduling \
 								            --quantization ascend \
 								            --kv-transfer-config \
-												[Doc][P/D] Fix MooncakeConnector's name (#5172)

### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
											
										
										
											2025-12-18 22:29:19 +08:00
+								            '{"kv_connector": "MooncakeConnectorV1",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_consumer",
 								            "kv_port": "30100",
 								            "engine_id": "1",
 								            "kv_connector_extra_config": {
 								                        "use_ascend_direct": true,
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            }' \
 								            --additional-config '{"recompute_scheduler_enable" : true}'
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        ```
 . Decode node 1
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								        ```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        nic_name="enp48s3u1u1" # change to your own nic name
 								        local_ip=141.61.39.181 # change to your own ip
 								        export HCCL_OP_EXPANSION_MODE="AIV"
 								        export HCCL_IF_IP=$local_ip
 								        export GLOO_SOCKET_IFNAME=$nic_name
 								        export TP_SOCKET_IFNAME=$nic_name
 								        export HCCL_SOCKET_IFNAME=$nic_name
 								        #Mooncake
 								        export OMP_PROC_BIND=false
 								        export OMP_NUM_THREADS=10
 								        export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 								        export VLLM_USE_V1=1
 								        export HCCL_BUFFSIZE=256
 								        export ASCEND_AGGREGATE_ENABLE=1
 								        export ASCEND_TRANSPORT_PRINT=1
 								        export ACL_OP_INIT_MODE=1
 								        export ASCEND_A3_ENABLE=1
 								        export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300000
 								        export TASK_QUEUE_ENABLE=1
 								        export ASCEND_RT_VISIBLE_DEVICES=$1
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								        vllm serve /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --host 0.0.0.0 \
 								            --port $2 \
 								            --data-parallel-size $3 \
 								            --data-parallel-rank $4 \
 								            --data-parallel-address $5 \
 								            --data-parallel-rpc-port $6 \
 								            --tensor-parallel-size $7 \
 								            --enable-expert-parallel \
 								            --speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp"}' \
-												[Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (#5928)

### What this PR does / why we need it?

Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.

### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/11b6af5280d6d6dfb8953af16e67b25f819b3be9

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
											
										
										
											2026-01-19 09:27:55 +08:00
+								            --profiler-config \
 								            '{"profiler": "torch",
 								            "torch_profiler_dir": "./vllm_profile",
 								            "torch_profiler_with_stack": false}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --seed 1024 \
 								            --served-model-name dsv3 \
 								            --max-model-len 68000 \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-batched-tokens 12 \
 								            --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY",  "cudagraph_capture_sizes":[3, 6, 9, 12]}' \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --trust-remote-code \
 								            --async-scheduling \
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            --max-num-seqs 4 \
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            --gpu-memory-utilization 0.95 \
 								            --no-enable-prefix-caching \
 								            --quantization ascend \
 								            --kv-transfer-config \
-												[Doc][P/D] Fix MooncakeConnector's name (#5172)

### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
											
										
										
											2025-12-18 22:29:19 +08:00
+								            '{"kv_connector": "MooncakeConnectorV1",
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								            "kv_role": "kv_consumer",
 								            "kv_port": "30100",
 								            "engine_id": "1",
 								            "kv_connector_extra_config": {
 								                        "use_ascend_direct": true,
 								                        "prefill": {
 								                                "dp_size": 2,
 								                                "tp_size": 16
 								                        },
 								                        "decode": {
 								                                "dp_size": 8,
 								                                "tp_size": 4
 								                        }
 								                }
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								            }' \
 								            --additional-config '{"recompute_scheduler_enable" : true}'
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								        ```
 								Once the preparation is done, you can start the server with the following command on each node:
 . Prefill node 0
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								# change ip to your own
 								python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 0 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								```
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Prefill node 1
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								# change ip to your own
 								python launch_online_dp.py --dp-size 2 --tp-size 16 --dp-size-local 1 --dp-rank-start 1 --dp-address 141.61.39.105 --dp-rpc-port 12890 --vllm-start-port 9100
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								```
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Decode node 0
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								# change ip to your own
 								python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 0 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								```
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. Decode node 1
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
+								```shell
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								# change ip to your own
 								python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-start 4 --dp-address 141.61.39.117 --dp-rpc-port 12777 --vllm-start-port 9100
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
+								### Request Forwarding
-												[DOC] enable both flashcomm1 and cudagraph (#6807)

## What this PR does / why we need it?

This PR updates the DeepSeek-V3.2 documentation to include the latest
performance optimizations and configuration improvements.

### Changes

- **Enable FlashComm1**: Added `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`
environment variable across all deployment scenarios to enable
FlashComm1 for improved communication performance
- **Layer Sharding**: Added `--additional-config '{"layer_sharding":
["q_b_proj", "o_proj"]}'` configuration to enable layer sharding for
better memory distribution
- **CUDA Graph Optimization**: Updated cudagraph capture sizes from
`[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]` to `[8, 16, 24, 32, 40,
48]`
- **Speculative Decoding**: Increased `num_speculative_tokens` from 2 to
3
- **Documentation Links**: Fixed request forwarding documentation to use
proper GitHub repository links

## Does this PR introduce _any_ user-facing change?

Yes, users can now follow the updated documentation to enable FlashComm1
and layer sharding for improved DeepSeek-V3.2 performance.

## How was this patch tested?

Existing documentation examples have been validated to ensure
configuration consistency across all deployment scenarios.

---

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-27 14:52:55 +08:00
+								To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
-												[DOC] add request forwarding (#6780)

### What this PR does / why we need it?

- New section: "Request Forwarding" documentation in
docs/source/tutorials/models/DeepSeek-V3.2.md
- Environment fix: Changed VLLM_ASCEND_ENABLE_FLASHCOMM1 from 0 to 1 in
the DeepSeek-V3 configuration examples

### Does this PR introduce _any_ user-facing change?

Documentation update only - provides new configuration guidance for
request forwarding setups

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
											
										
										
											2026-02-25 14:43:51 +08:00
 								```shell
 								unset http_proxy
 								unset https_proxy
 								python load_balance_proxy_server_example.py \
 								    --port 8000 \
 								    --host 0.0.0.0 \
 								    --prefiller-hosts \
 .61.39.105 \
 .61.39.113 \
 								    --prefiller-ports \
 \
 \
 								    --decoder-hosts \
 .61.39.117 \
 .61.39.117 \
 .61.39.117 \
 .61.39.117 \
 .61.39.181 \
 .61.39.181 \
 .61.39.181 \
 .61.39.181 \
 								    --decoder-ports \
 9101 9102 9103 \
 9101 9102 9103 \
 								```
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								## Functional Verification
 								Once your server is started, you can query the model with input prompts:
 								```shell
 								curl http://<node0_ip>:<port>/v1/completions \
 								    -H "Content-Type: application/json" \
 								    -d '{
 								        "model": "deepseek_v3.2",
 								        "prompt": "The future of AI is",
-												[Doc] Update `max_tokens` to `max_completion_tokens` in all docs (#6248)

### What this PR does / why we need it?

Fix:

```
DeprecationWarning: max_tokens is deprecated in favor of the max_completion_tokens field.
```

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

Signed-off-by: shen-shanshan <467638484@qq.com>
											
										
										
											2026-01-26 11:57:40 +08:00
+								        "max_completion_tokens": 50,
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								        "temperature": 0
 								    }'
 								```
 								## Accuracy Evaluation
 								Here are two accuracy evaluation methods.
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. After execution, you can get the result.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								### Using Language Model Evaluation Harness
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-W8A8` in online mode.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+. Refer to [Using lm_eval](../../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 . Run `lm_eval` to execute the accuracy evaluation.
 								```shell
 								lm_eval \
 								  --model local-completions \
-												[Doc] Update the weight download URL. (#5238)

### What this PR does / why we need it?
Update the weight download URL. Because the model was renamed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-12-23 08:53:30 +08:00
+								  --model_args model=/root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								  --tasks gsm8k \
 								  --output_path ./
 								```
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+. After execution, you can get the result.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								## Performance
 								### Using AISBench
-												[Doc][Misc] Restructure tutorial documentation (#6501)

### What this PR does / why we need it?

This PR refactors the tutorial documentation by restructuring it into
three categories: Models, Features, and Hardware. This improves the
organization and navigation of the tutorials, making it easier for users
to find relevant information.

- The single `tutorials/index.md` is split into three separate index
files:
  - `docs/source/tutorials/models/index.md`
  - `docs/source/tutorials/features/index.md`
  - `docs/source/tutorials/hardwares/index.md`
- Existing tutorial markdown files have been moved into their respective
new subdirectories (`models/`, `features/`, `hardwares/`).
- The main `index.md` has been updated to link to these new tutorial
sections.

This change makes the documentation structure more logical and scalable
for future additions.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the structure and URLs of the tutorial
documentation pages. Users following old links to tutorials will
encounter broken links. It is recommended to set up redirects if the
documentation framework supports them.

### How was this patch tested?

These are documentation-only changes. The documentation should be built
and reviewed locally to ensure all links are correct and the pages
render as expected.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-10 15:03:35 +08:00
+								Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								The performance result is:
 								**Hardware**: A3-752T, 4 node
 								**Deployment**: 1P1D, Prefill node: DP2+TP16, Decode Node: DP8+TP4
 								**Input/Output**: 64k/3k
 								**Performance**: 533tps, TPOT 32ms
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								### Using vLLM Benchmark
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
+								Run performance evaluation of `DeepSeek-V3.2-W8A8` as an example.
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
 								Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
-												[main][Docs] Fix spelling errors across documentation (#6649)

Fix various spelling mistakes in the project documentation to improve
clarity and correctness.
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-10 11:14:57 +08:00
+								There are three `vllm bench` subcommands:
-												[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it?
reformat markdown files via markdownlint

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bde38c11df0ea066a740efe9b77fff5418be45df

---------

Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
											
										
										
											2026-01-15 09:06:01 +08:00
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								- `latency`: Benchmark the latency of a single batch of requests.
 								- `serve`: Benchmark the online serving throughput.
 								- `throughput`: Benchmark offline inference throughput.
 								Take the `serve` as an example. Run the code as follows.
 								```shell
 								export VLLM_USE_MODELSCOPE=true
-												[main][Docs] Fix typos across documentation (#6728)

## Summary

Fix typos and improve grammar consistency across 50 documentation files.
 
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/9562912cead1f11e8540fb91306c5cbda66f0007

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
											
										
										
											2026-02-13 15:50:05 +08:00
+								vllm bench serve --model /root/.cache/Eco-Tech/DeepSeek-V3.2-w8a8-mtp-QuaRot  --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
-												[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)

### What this PR does / why we need it?
Refactor the DeepSeek-V3.2-Exp tutorial.

- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac

---------

Signed-off-by: menogrey <1299267905@qq.com>
											
										
										
											2025-11-04 18:58:33 +08:00
+								```
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								## Function Call
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								The function call feature is supported from v0.13.0rc1 on. Please use the latest version.
-												add release note for 0.12.0 (#4995)

Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-13 22:09:59 +08:00
-												[doc] update using command (#5373)

### What this PR does / why we need it?
Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial.

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-25 22:28:35 +08:00
+								Refer to [DeepSeek-V3.2 Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#tool-calling-example) for details.