[feat] add ascend readme and docker release (#8700)

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com> Signed-off-by: lichaoran <pkwarcraft@gmail.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-08-13 04:25:42 +08:00
parent 305b27c124
commit 2ecbd8b8bf
7 changed files with 467 additions and 18 deletions
--- a/docs/basic_usage/deepseek.md
+++ b/docs/basic_usage/deepseek.md
@@ -24,6 +24,7 @@ To run DeepSeek V3/R1 models, the requirements are as follows:
 | **Quantized weights (int8)** | 16 x A100/800 |
 | | 32 x L40S |
 | | Xeon 6980P CPU |
+| | 2 x Atlas 800I A3 |

 <style>
 .md-typeset__table {
@@ -64,6 +65,7 @@ Detailed commands for reference:
 - [16 x A100 (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-16-a100a800-with-int8-quantization)
 - [32 x L40S (int8)](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-32-l40s-with-int8-quantization)
 - [Xeon 6980P CPU](../platforms/cpu_server.md#example-running-deepseek-r1)
+- [2 x Atlas 800I A3 (int8)](../platforms/ascend_npu.md#running-deepseek-v3)

 ### Download Weights
 If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. Please refer to [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#61-inference-with-deepseek-infer-demo-example-only) official guide to download the weights.
--- a/docs/platforms/ascend_npu.md
+++ b/docs/platforms/ascend_npu.md
@@ -1,7 +1,206 @@
-# Ascend NPUs
+# SGLang on Ascend NPUs

-## Install
-TODO
+You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems.
+
+## System Settings
+
+### CPU performance power scheme
+
+The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended.
+
+```shell
+echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+
+# Make sure changes are applied successfully
+cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance
+```
+
+### Disable NUMA balancing
+
+```shell
+sudo sysctl -w kernel.numa_balancing=0
+
+# Check
+cat /proc/sys/kernel/numa_balancing # shows 0
+```
+
+### Prevent swapping out system memory
+
+```shell
+sudo sysctl -w vm.swappiness=10
+
+# Check
+cat /proc/sys/vm/swappiness # shows 10
+```
+
+## Installing SGLang
+
+### Method 1: Installing from source with prerequisites
+
+#### Python Version
+
+Only `python==3.11` is supported currently. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda).
+
+```shell
+conda create --name sglang_npu python=3.11
+conda activate sglang_npu
+```
+
+#### MemFabric Adaptor
+
+_TODO: MemFabric is still a working project yet open sourced til August/September, 2025. We will release it as prebuilt wheel package for now._
+
+_Notice: Prebuilt wheel package is based on `aarch64`, please leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) to let us know the requests for `amd64` build._
+
+MemFabric Adaptor is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters.
+
+```shell
+MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl"
+MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}"
+wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}"
+```
+
+#### Pytorch and Pytorch Framework Adaptor on Ascend
+
+Only `torch==2.6.0` is supported currently due to NPUgraph and Triton-on-Ascend's limitation, however a more generalized version will be release by the end of September, 2025.
+
+```shell
+PYTORCH_VERSION=2.6.0
+TORCHVISION_VERSION=0.21.0
+pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu
+
+PTA_VERSION="v7.1.0.1-pytorch2.6.0"
+PTA_NAME="torch_npu-2.6.0.post1-cp311-cp311-manylinux_2_28_aarch64.whl"
+PTA_URL="https://gitee.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}"
+wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}"
+```
+
+#### vLLM
+
+vLLM is still a major prerequisite on Ascend NPU. Because of `torch==2.6.0` limitation, only vLLM v0.8.5 is supported.
+
+```shell
+VLLM_TAG=v0.8.5
+git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG
+(cd vllm && VLLM_TARGET_DEVICE="empty" pip install -v -e .)
+```
+
+#### Triton on Ascend
+
+_Notice:_ We recommend installing triton-ascend from source due to its rapid development, the version on PYPI can't keep up for know. This problem will be solved on Sep. 2025, afterwards `pip install` would be the one and only installing method.
+
+Please follow Triton-on-Ascend's [installation guide from source](https://gitee.com/ascend/triton-ascend#2%E6%BA%90%E4%BB%A3%E7%A0%81%E5%AE%89%E8%A3%85-triton-ascend) to install the latest `triton-ascend` package.
+
+#### DeepEP-compatible Library
+
+We are also providing a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md).
+
+#### Installing SGLang from source
+
+```shell
+# Use the last release branch
+git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
+cd sglang
+
+pip install --upgrade pip
+pip install -e python[srt_npu]
+```
+
+### Method 2: Using docker
+
+__Notice:__ `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters.
+
+__Notice:__ The following docker command is based on Atlas 800I A3 machines. If you are using Atlas 800I A2, make sure only `davinci[0-7]` are mapped into container.
+
+```shell
+# Clone the SGLang repository
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker
+
+# Build the docker image
+docker build -t sglang-npu:main -f Dockerfile.npu .
+
+alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
+    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
+    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
+    --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
+    --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
+    --device=/dev/davinci_manager --device=/dev/hisi_hdc \
+    --volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
+    --volume /etc/ascend_install.info:/etc/ascend_install.info \
+    --volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'
+
+drun --env "HF_TOKEN=<secret>" \
+    sglang-npu:main \
+    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000
+```

 ## Examples
-TODO
+
+### Running DeepSeek-V3
+
+Running DeepSeek with PD disaggregation on 2 x Atlas 800I A3.
+Model weights could be found [here](https://modelers.cn/models/State_Cloud/Deepseek-R1-bf16-hfd-w8a8).
+
+Prefill:
+
+```shell
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
+
+drun sglang-npu:main \
+    python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
+    --trust-remote-code \
+    --attention-backend ascend \
+    --mem-fraction-static 0.8 \
+    --quantization w8a8_int8 \
+    --tp-size 16 \
+    --dp-size 1 \
+    --nnodes 1 \
+    --node-rank 0 \
+    --disaggregation-mode prefill \
+    --disaggregation-bootstrap-port 6657 \
+    --disaggregation-transfer-backend ascend \
+    --dist-init-addr <PREFILL_HOST_IP>:6688 \
+    --host <PREFILL_HOST_IP> \
+    --port 8000
+```
+
+Decode:
+
+```shell
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
+export HCCL_BUFFSIZE=200
+export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
+
+drun sglang-npu:main \
+    python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
+    --trust-remote-code \
+    --attention-backend ascend \
+    --mem-fraction-static 0.8 \
+    --quantization w8a8_int8 \
+    --enable-deepep-moe \
+    --deepep-mode low_latency \
+    --tp-size 16 \
+    --dp-size 1 \
+    --ep-size 16 \
+    --nnodes 1 \
+    --node-rank 0 \
+    --disaggregation-mode decode \
+    --disaggregation-transfer-backend ascend \
+    --dist-init-addr <DECODE_HOST_IP>:6688 \
+    --host <DECODE_HOST_IP> \
+    --port 8001
+```
+
+Mini_LB:
+
+```shell
+drun sglang-npu:main \
+    python -m sglang.srt.disaggregation.launch_lb \
+    --prefill http://<PREFILL_HOST_IP>:8000 \
+    --decode http://<DECODE_HOST_IP>:8001 \
+    --host 127.0.0.1 --port 5000
+```