diff --git a/docs/source/tutorials/DeepSeek-R1.md b/docs/source/tutorials/DeepSeek-R1.md new file mode 100644 index 00000000..d198ef0d --- /dev/null +++ b/docs/source/tutorials/DeepSeek-R1.md @@ -0,0 +1,290 @@ +# DeepSeek-R1 + +## Introduction + +DeepSeek-R1 is a high-performance Mixture-of-Experts (MoE) large language model developed by DeepSeek Company. It excels in complex logical reasoning, mathematical problem-solving, and code generation. By dynamically activating its expert networks, it delivers exceptional performance while maintaining computational efficiency. Building upon R1, DeepSeek-R1-W8A8 is a fully quantized version of the model. It employs 8-bit integer (INT8) quantization for both weights and activations, which significantly reduces the model's memory footprint and computational requirements, enabling more efficient deployment and application in resource-constrained environments. +This article takes the deepseek- R1-W8A8 version as an example to introduce the deployment of the R1 series models. + +## Supported Features + +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. + +## Environment Preparation + +### Model Weight + +- `DeepSeek-R1-W8A8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8) nodes. [Download model weight](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8) + +It is recommended to download the model weight to the shared directory of multiple nodes. + +### Verify Multi-node Communication(Optional) + +If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../installation.md#verify-multi-node-communication). + +### Installation + +You can using our official docker image to run `DeepSeek-R1-W8A8` directly. + +Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). + +```{code-block} bash + :substitutions: +# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]). +# Update the vllm-ascend image according to your environment. +# Note you should download the weight to /root/.cache in advance. +# Update the vllm-ascend image +export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| +export NAME=vllm-ascend + +# Run the container using the defined variables +# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance +docker run --rm \ +--name $NAME \ +--net=host \ +--shm-size=1g \ +--device /dev/davinci0 \ +--device /dev/davinci1 \ +--device /dev/davinci2 \ +--device /dev/davinci3 \ +--device /dev/davinci4 \ +--device /dev/davinci5 \ +--device /dev/davinci6 \ +--device /dev/davinci7 \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /etc/hccn.conf:/etc/hccn.conf \ +-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \ +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-it $IMAGE bash +``` + +If you want to deploy multi-node environment, you need to set up environment on each node. + +## Deployment +### Service-oriented Deployment + +- `DeepSeek-R1-W8A8`: require 1 Atlas 800 A3 (64G × 16) nodes or 2 Atlas 800 A2 (64G × 8). + +:::::{tab-set} +:sync-group: install + +::::{tab-item} DeepSeek-R1-W8A8 A3 series + +```shell +#!/bin/sh + +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxxx" +local_ip="xxxx" + +# AIV +export HCCL_OP_EXPANSION_MODE="AIV" +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export VLLM_ASCEND_ENABLE_MLAPO=1 +export VLLM_USE_MODELSCOPE=True + +vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ + --host 0.0.0.0 \ + --port 8000 \ + --data-parallel-size 4 \ + --tensor-parallel-size 4 \ + --quantization ascend \ + --seed 1024 \ + --served-model-name deepseek_r1 \ + --enable-expert-parallel \ + --max-num-seqs 16 \ + --max-model-len 16384 \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --gpu-memory-utilization 0.92 \ + --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' +``` + +:::: +::::{tab-item} DeepSeek-R1-W8A8 A2 series + +Run the following scripts on two nodes respectively. + +**Node 0** + +```shell +#!/bin/sh + +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxxx" +local_ip="xxxx" + +# AIV +export HCCL_OP_EXPANSION_MODE="AIV" +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_ENABLE_MLAPO=1 +export HCCL_INTRA_PCIE_ENABLE=1 +export HCCL_INTRA_ROCE_ENABLE=0 +export VLLM_USE_MODELSCOPE=True + +vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ + --host 0.0.0.0 \ + --port 8000 \ + --data-parallel-size 4 \ + --data-parallel-size-local 2 \ + --data-parallel-address $local_ip \ + --data-parallel-rpc-port 13389 \ + --tensor-parallel-size 4 \ + --quantization ascend \ + --seed 1024 \ + --served-model-name deepseek_r1 \ + --enable-expert-parallel \ + --max-num-seqs 16 \ + --max-model-len 16384 \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --gpu-memory-utilization 0.94 \ + --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' +``` + +**Node 1** + +```shell +#!/bin/sh + +# this obtained through ifconfig +# nic_name is the network interface name corresponding to local_ip of the current node +nic_name="xxxx" +local_ip="xxxx" +node0_ip="xxxx" # same as the local_IP address in node 0 + +# AIV +export HCCL_OP_EXPANSION_MODE="AIV" +export HCCL_IF_IP=$local_ip +export GLOO_SOCKET_IFNAME=$nic_name +export TP_SOCKET_IFNAME=$nic_name +export HCCL_SOCKET_IFNAME=$nic_name +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export VLLM_ASCEND_ENABLE_MLAPO=1 +export HCCL_INTRA_PCIE_ENABLE=1 +export HCCL_INTRA_ROCE_ENABLE=0 +export VLLM_USE_MODELSCOPE=True + +vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ + --host 0.0.0.0 \ + --port 8000 \ + --headless \ + --data-parallel-size 4 \ + --data-parallel-size-local 2 \ + --data-parallel-start-rank 2 \ + --data-parallel-address $node0_ip \ + --data-parallel-rpc-port 13389 \ + --tensor-parallel-size 4 \ + --quantization ascend \ + --seed 1024 \ + --served-model-name deepseek_r1 \ + --enable-expert-parallel \ + --max-num-seqs 16 \ + --max-model-len 16384 \ + --max-num-batched-tokens 4096 \ + --trust-remote-code \ + --gpu-memory-utilization 0.94 \ + --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' +``` + +:::: +::::: + +### Prefill-Decode Disaggregation + +We recommend using Mooncake for deployment: [Mooncake](./multi_node_pd_disaggregation_mooncake.md). + +This solution has been tested and demonstrates excellent performance. + +## Functional Verification + +Once your server is started, you can query the model with input prompts: + +```shell +curl http://:/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek_r1", + "prompt": "The future of AI is", + "max_tokens": 50, + "temperature": 0 + }' +``` + +## Accuracy Evaluation + +Here are two accuracy evaluation methods. + +### Using AISBench + +1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. + +2. After execution, you can get the result, here is the result of `DeepSeek-R1-W8A8` in `vllm-ascend:0.11.0rc2` for reference only. + +| dataset | version | metric | mode | vllm-api-general-chat | +|----- | ----- | ----- | ----- | -----| +| aime2024dataset | - | accuracy | gen | 80.00 | +| gpqadataset | - | accuracy | gen | 72.22 | + +### Using Language Model Evaluation Harness + +As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-R1-W8A8` in online mode. + +1. Refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation. + +2. Run `lm_eval` to execute the accuracy evaluation. + +```shell +lm_eval \ + --model local-completions \ + --model_args model=path/DeepSeek-R1-W8A8,base_url=http://:/v1/completions,tokenized_requests=False,trust_remote_code=True \ + --tasks gsm8k \ + --output_path ./ +``` + +3. After execution, you can get the result. + +## Performance +### Using AISBench + +Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. + +### Using vLLM Benchmark + +Run performance evaluation of `DeepSeek-R1-W8A8` as an example. + +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```shell +export VLLM_USE_MODELSCOPE=true +vllm bench serve --model path/DeepSeek-R1-W8A8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +``` + +After about several minutes, you can get the performance evaluation result. diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 7309656a..c3032a8a 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -18,6 +18,7 @@ Qwen3-Dense multi_npu_qwen3_moe multi_npu_quantization single_node_300i +DeepSeek-R1.md DeepSeek-V3.1.md DeepSeek-V3.2-Exp.md Qwen3-235B-A22B.md