From e538fa6f9c4ed97cb48cab8e12a7fe9ffc31b176 Mon Sep 17 00:00:00 2001 From: wangxiyuan Date: Thu, 11 Dec 2025 20:53:13 +0800 Subject: [PATCH] [Doc] Update tutorial index (#4920) Update tutorial index and remove useless doc - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: wangxiyuan --- .../feature_guide/disaggregated_prefill.md | 2 +- docs/source/faqs.md | 2 +- .../{single_node_300i.md => 310p.md} | 2 +- docs/source/tutorials/DeepSeek-R1.md | 2 +- docs/source/tutorials/DeepSeek-V3.1.md | 4 +- ...{DeepSeek-V3.2-Exp.md => DeepSeek-V3.2.md} | 2 +- ...imi-k2-thinking.md => Kimi-K2-Thinking.md} | 2 +- docs/source/tutorials/Qwen-VL-Dense.md | 2 +- .../tutorials/{Qwen2.5.md => Qwen2.5-7B.md} | 2 +- ...ngle_npu_qwen2_audio.md => Qwen2_audio.md} | 2 +- docs/source/tutorials/Qwen3-235B-A22B.md | 4 +- ...ulti_npu_qwen3_moe.md => Qwen3-30B-A3B.md} | 2 +- ...le_npu_qwen3_w4a4.md => Qwen3-32B-W4A4.md} | 2 +- ...qwen3_quantization.md => Qwen3-8B-W4A8.md} | 2 +- docs/source/tutorials/Qwen3-Dense.md | 2 +- ...{multi_npu_qwen3_next.md => Qwen3-Next.md} | 2 +- ..._qwen3_embedding.md => Qwen3_embedding.md} | 2 +- docs/source/tutorials/index.md | 40 ++-- docs/source/tutorials/multi_node.md | 210 ----------------- docs/source/tutorials/multi_node_kimi.md | 155 ------------- docs/source/tutorials/multi_node_qwen3vl.md | 163 -------------- docs/source/tutorials/multi_npu.md | 108 --------- .../tutorials/multi_npu_quantization.md | 138 ------------ ... pd_disaggregation_mooncake_multi_node.md} | 2 +- ...pd_disaggregation_mooncake_single_node.md} | 2 +- .../tutorials/{multi_node_ray.md => ray.md} | 2 +- docs/source/tutorials/single_npu.md | 213 ------------------ docs/source/user_guide/release_notes.md | 2 +- .../support_matrix/supported_models.md | 2 +- 29 files changed, 41 insertions(+), 1034 deletions(-) rename docs/source/tutorials/{single_node_300i.md => 310p.md} (99%) rename docs/source/tutorials/{DeepSeek-V3.2-Exp.md => DeepSeek-V3.2.md} (99%) rename docs/source/tutorials/{multi_npu_kimi-k2-thinking.md => Kimi-K2-Thinking.md} (95%) rename docs/source/tutorials/{Qwen2.5.md => Qwen2.5-7B.md} (99%) rename docs/source/tutorials/{single_npu_qwen2_audio.md => Qwen2_audio.md} (99%) rename docs/source/tutorials/{multi_npu_qwen3_moe.md => Qwen3-30B-A3B.md} (99%) rename docs/source/tutorials/{single_npu_qwen3_w4a4.md => Qwen3-32B-W4A4.md} (99%) rename docs/source/tutorials/{single_npu_qwen3_quantization.md => Qwen3-8B-W4A8.md} (99%) rename docs/source/tutorials/{multi_npu_qwen3_next.md => Qwen3-Next.md} (99%) rename docs/source/tutorials/{single_npu_qwen3_embedding.md => Qwen3_embedding.md} (99%) delete mode 100644 docs/source/tutorials/multi_node.md delete mode 100644 docs/source/tutorials/multi_node_kimi.md delete mode 100644 docs/source/tutorials/multi_node_qwen3vl.md delete mode 100644 docs/source/tutorials/multi_npu.md delete mode 100644 docs/source/tutorials/multi_npu_quantization.md rename docs/source/tutorials/{multi_node_pd_disaggregation_mooncake.md => pd_disaggregation_mooncake_multi_node.md} (99%) rename docs/source/tutorials/{single_node_pd_disaggregation_mooncake.md => pd_disaggregation_mooncake_single_node.md} (98%) rename docs/source/tutorials/{multi_node_ray.md => ray.md} (99%) delete mode 100644 docs/source/tutorials/single_npu.md diff --git a/docs/source/developer_guide/feature_guide/disaggregated_prefill.md b/docs/source/developer_guide/feature_guide/disaggregated_prefill.md index 46d3dbe9..0a8657a0 100644 --- a/docs/source/developer_guide/feature_guide/disaggregated_prefill.md +++ b/docs/source/developer_guide/feature_guide/disaggregated_prefill.md @@ -19,7 +19,7 @@ vLLM Ascend currently supports two types of connectors for handling KV cache man - **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner. For step-by-step deployment and configuration, refer to the following guide: -[https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html) +[https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) --- diff --git a/docs/source/faqs.md b/docs/source/faqs.md index 824e28e7..a70e6643 100644 --- a/docs/source/faqs.md +++ b/docs/source/faqs.md @@ -104,7 +104,7 @@ vllm-ascend is a hardware plugin for vLLM. Basically, the version of vllm-ascend ### 8. Does vllm-ascend support Prefill Disaggregation feature? -Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html) for example. +Yes, vllm-ascend supports Prefill Disaggregation feature with Mooncake backend. Take [official tutorial](https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html) for example. ### 9. Does vllm-ascend support quantization method? diff --git a/docs/source/tutorials/single_node_300i.md b/docs/source/tutorials/310p.md similarity index 99% rename from docs/source/tutorials/single_node_300i.md rename to docs/source/tutorials/310p.md index 7619ef16..ad7be5a2 100644 --- a/docs/source/tutorials/single_node_300i.md +++ b/docs/source/tutorials/310p.md @@ -1,4 +1,4 @@ -# Single Node (Atlas 300I Series) +# Atlas 300I ```{note} 1. This Atlas 300I series is currently experimental. In future versions, there may be behavioral changes related to model coverage and performance improvement. diff --git a/docs/source/tutorials/DeepSeek-R1.md b/docs/source/tutorials/DeepSeek-R1.md index d198ef0d..3329de55 100644 --- a/docs/source/tutorials/DeepSeek-R1.md +++ b/docs/source/tutorials/DeepSeek-R1.md @@ -212,7 +212,7 @@ vllm serve vllm-ascend/DeepSeek-R1-W8A8 \ ### Prefill-Decode Disaggregation -We recommend using Mooncake for deployment: [Mooncake](./multi_node_pd_disaggregation_mooncake.md). +We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md). This solution has been tested and demonstrates excellent performance. diff --git a/docs/source/tutorials/DeepSeek-V3.1.md b/docs/source/tutorials/DeepSeek-V3.1.md index ccaf4ce7..ec0ee08d 100644 --- a/docs/source/tutorials/DeepSeek-V3.1.md +++ b/docs/source/tutorials/DeepSeek-V3.1.md @@ -1,4 +1,4 @@ -# DeepSeek-V3.1 +# DeepSeek-V3/3.1 ## Introduction @@ -251,7 +251,7 @@ vllm serve /weights/DeepSeek-V3.1_w8a8mix_mtp \ ### Prefill-Decode Disaggregation -We recommend using Mooncake for deployment: [Mooncake](./multi_node_pd_disaggregation_mooncake.md). +We recommend using Mooncake for deployment: [Mooncake](./pd_disaggregation_mooncake_multi_node.md). Take Atlas 800 A3 (64G × 16) for example, we recommend to deploy 2P1D (4 nodes) rather than 1P1D (2 nodes), because there is no enough NPU memory to serve high concurrency in 1P1D case. - `DeepSeek-V3.1_w8a8mix_mtp 2P1D Layerwise` require 4 Atlas 800 A3 (64G × 16). diff --git a/docs/source/tutorials/DeepSeek-V3.2-Exp.md b/docs/source/tutorials/DeepSeek-V3.2.md similarity index 99% rename from docs/source/tutorials/DeepSeek-V3.2-Exp.md rename to docs/source/tutorials/DeepSeek-V3.2.md index 132e7efc..fd53f038 100644 --- a/docs/source/tutorials/DeepSeek-V3.2-Exp.md +++ b/docs/source/tutorials/DeepSeek-V3.2.md @@ -1,4 +1,4 @@ -# DeepSeek-V3.2-Exp +# DeepSeek-V3.2 ## Introduction diff --git a/docs/source/tutorials/multi_npu_kimi-k2-thinking.md b/docs/source/tutorials/Kimi-K2-Thinking.md similarity index 95% rename from docs/source/tutorials/multi_npu_kimi-k2-thinking.md rename to docs/source/tutorials/Kimi-K2-Thinking.md index 6a776f45..0c1f708d 100644 --- a/docs/source/tutorials/multi_npu_kimi-k2-thinking.md +++ b/docs/source/tutorials/Kimi-K2-Thinking.md @@ -1,4 +1,4 @@ -# Multi-NPU (Kimi-K2-Thinking) +# Kimi-K2-Thinking ## Run with Docker diff --git a/docs/source/tutorials/Qwen-VL-Dense.md b/docs/source/tutorials/Qwen-VL-Dense.md index e99d7725..1093aeba 100644 --- a/docs/source/tutorials/Qwen-VL-Dense.md +++ b/docs/source/tutorials/Qwen-VL-Dense.md @@ -1,4 +1,4 @@ -# Qwen-VL-Dense +# Qwen-VL-Dense(Qwen2.5VL-3B/7B, Qwen3-VL-2B/4B/8B/32B) ## Introduction diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5-7B.md similarity index 99% rename from docs/source/tutorials/Qwen2.5.md rename to docs/source/tutorials/Qwen2.5-7B.md index 2555e950..2eadefee 100644 --- a/docs/source/tutorials/Qwen2.5.md +++ b/docs/source/tutorials/Qwen2.5-7B.md @@ -1,4 +1,4 @@ -# Qwen2.5-7B-Instruct +# Qwen2.5-7B ## Introduction diff --git a/docs/source/tutorials/single_npu_qwen2_audio.md b/docs/source/tutorials/Qwen2_audio.md similarity index 99% rename from docs/source/tutorials/single_npu_qwen2_audio.md rename to docs/source/tutorials/Qwen2_audio.md index e093e845..33accd3c 100644 --- a/docs/source/tutorials/single_npu_qwen2_audio.md +++ b/docs/source/tutorials/Qwen2_audio.md @@ -1,4 +1,4 @@ -# Single NPU (Qwen2-Audio-7B) +# Qwen2-Audio-7B ## Run vllm-ascend on Single NPU diff --git a/docs/source/tutorials/Qwen3-235B-A22B.md b/docs/source/tutorials/Qwen3-235B-A22B.md index 43625350..227e0ce4 100644 --- a/docs/source/tutorials/Qwen3-235B-A22B.md +++ b/docs/source/tutorials/Qwen3-235B-A22B.md @@ -251,11 +251,11 @@ INFO: Application startup complete. ### Multi-node Deployment with Ray -- refer to [Multi-Node-Ray (Qwen/Qwen3-235B-A22B)](./multi_node_ray.md). +- refer to [Ray Distributed (Qwen/Qwen3-235B-A22B)](./ray.md). ### Prefill-Decode Disaggregation -- refer to [Prefill-Decode Disaggregation Mooncake Verification (Qwen)](./multi_node_pd_disaggregation_mooncake.md) +- refer to [Prefill-Decode Disaggregation Mooncake Verification (Qwen)](./pd_disaggregation_mooncake_multi_node.md) ## Functional Verification diff --git a/docs/source/tutorials/multi_npu_qwen3_moe.md b/docs/source/tutorials/Qwen3-30B-A3B.md similarity index 99% rename from docs/source/tutorials/multi_npu_qwen3_moe.md rename to docs/source/tutorials/Qwen3-30B-A3B.md index 7f86e973..a78971c0 100644 --- a/docs/source/tutorials/multi_npu_qwen3_moe.md +++ b/docs/source/tutorials/Qwen3-30B-A3B.md @@ -1,4 +1,4 @@ -# Multi-NPU (Qwen3-30B-A3B) +# Qwen3-30B-A3B ## Run vllm-ascend on Multi-NPU with Qwen3 MoE diff --git a/docs/source/tutorials/single_npu_qwen3_w4a4.md b/docs/source/tutorials/Qwen3-32B-W4A4.md similarity index 99% rename from docs/source/tutorials/single_npu_qwen3_w4a4.md rename to docs/source/tutorials/Qwen3-32B-W4A4.md index 5d6d6a03..6a03af84 100644 --- a/docs/source/tutorials/single_npu_qwen3_w4a4.md +++ b/docs/source/tutorials/Qwen3-32B-W4A4.md @@ -1,4 +1,4 @@ -# Single-NPU (Qwen3 32B W4A4) +# Qwen3-32B-W4A4 ## Introduction diff --git a/docs/source/tutorials/single_npu_qwen3_quantization.md b/docs/source/tutorials/Qwen3-8B-W4A8.md similarity index 99% rename from docs/source/tutorials/single_npu_qwen3_quantization.md rename to docs/source/tutorials/Qwen3-8B-W4A8.md index 40acff34..cbfdd657 100644 --- a/docs/source/tutorials/single_npu_qwen3_quantization.md +++ b/docs/source/tutorials/Qwen3-8B-W4A8.md @@ -1,4 +1,4 @@ -# Single-NPU (Qwen3-8B-W4A8) +# Qwen3-8B-W4A8 ## Run Docker Container :::{note} diff --git a/docs/source/tutorials/Qwen3-Dense.md b/docs/source/tutorials/Qwen3-Dense.md index 395f9dce..b95e9e5c 100644 --- a/docs/source/tutorials/Qwen3-Dense.md +++ b/docs/source/tutorials/Qwen3-Dense.md @@ -1,4 +1,4 @@ -# Qwen3-Dense +# Qwen3-Dense(Qwen3-0.6B/8B/32B) ## Introduction diff --git a/docs/source/tutorials/multi_npu_qwen3_next.md b/docs/source/tutorials/Qwen3-Next.md similarity index 99% rename from docs/source/tutorials/multi_npu_qwen3_next.md rename to docs/source/tutorials/Qwen3-Next.md index eeb57f5a..68fa435b 100644 --- a/docs/source/tutorials/multi_npu_qwen3_next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -1,4 +1,4 @@ -# Multi-NPU (Qwen3-Next) +# Qwen3-Next ```{note} The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. diff --git a/docs/source/tutorials/single_npu_qwen3_embedding.md b/docs/source/tutorials/Qwen3_embedding.md similarity index 99% rename from docs/source/tutorials/single_npu_qwen3_embedding.md rename to docs/source/tutorials/Qwen3_embedding.md index 49b0d42f..475dae70 100644 --- a/docs/source/tutorials/single_npu_qwen3_embedding.md +++ b/docs/source/tutorials/Qwen3_embedding.md @@ -1,4 +1,4 @@ -# Single NPU (Qwen3-Embedding-8B) +# Qwen3-Embedding-8B The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This guide describes how to run the model with vLLM Ascend. Note that only 0.9.2rc1 and higher versions of vLLM Ascend support the model. diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 22767ff3..c983794c 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -3,30 +3,24 @@ :::{toctree} :caption: Deployment :maxdepth: 1 -single_npu -Qwen-VL-Dense.md -single_npu_qwen2_audio -single_npu_qwen3_embedding -single_npu_qwen3_quantization -single_npu_qwen3_w4a4 -single_node_pd_disaggregation_mooncake -Qwen2.5 -multi_npu_qwen3_next -multi_npu -multi_npu_kimi-k2-thinking +Qwen2.5-Omni.md +Qwen2_audio +Qwen2.5-7B Qwen3-Dense -multi_npu_qwen3_moe -multi_npu_quantization -single_node_300i -DeepSeek-R1.md -DeepSeek-V3.1.md -DeepSeek-V3.2-Exp.md +Qwen-VL-Dense.md +Qwen3-30B-A3B.md Qwen3-235B-A22B.md Qwen3-Coder-30B-A3B -multi_node -multi_node_kimi -multi_node_qwen3vl -multi_node_pd_disaggregation_mooncake -multi_node_ray -Qwen2.5-Omni.md +Qwen3_embedding +Qwen3-8B-W4A8 +Qwen3-32B-W4A4 +Qwen3-Next +DeepSeek-V3.1.md +DeepSeek-V3.2.md +DeepSeek-R1.md +Kimi-K2-Thinking +pd_disaggregation_mooncake_single_node +pd_disaggregation_mooncake_multi_node +ray +310p ::: diff --git a/docs/source/tutorials/multi_node.md b/docs/source/tutorials/multi_node.md deleted file mode 100644 index 5a30c716..00000000 --- a/docs/source/tutorials/multi_node.md +++ /dev/null @@ -1,210 +0,0 @@ -# Multi-Node-DP (DeepSeek) - -## Getting Start -vLLM-Ascend now supports Data Parallel (DP) deployment, enabling model weights to be replicated across multiple NPUs or instances, each processing independent batches of requests. This is particularly useful for scaling throughput across devices while maintaining high resource utilization. - -Each DP rank is deployed as a separate "core engine" process which communicates with front-end process(es) via ZMQ sockets. Data Parallel can be combined with Tensor Parallel, in which case each DP engine owns a number of per-NPU worker processes equal to the TP size. - -For Mixture-of-Experts (MoE) models — especially advanced architectures like DeepSeek that utilize Multi-head Latent Attention (MLA) — a hybrid parallelism approach is recommended: - -- Use **Data Parallel (DP)** for attention layers, which are replicated across devices and handle separate batches. -- Use **Expert or Tensor Parallel (EP/TP)** for expert layers, which are sharded across devices to distribute the computation. - -This division enables attention layers to be replicated across DP ranks, enabling them to process different batches independently. Meanwhile, expert layers are partitioned (sharded) across devices using DP/TP, maximizing hardware utilization and efficiency. - -In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned and expert layers across all ranks are required to synchronize during every forward pass, even if there are fewer requests to be processed than DP ranks. - -For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP `Coordinator` process, which communicates with all of the ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP). - -## Verify Multi-Node Communication Environment - -### Physical Layer Requirements: - -- The physical machines must be located on the same WLAN, with network connectivity. -- All NPUs are connected with optical modules, and the connection status must be normal. - -### Verification Process: - -Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`: - -```bash - # Check the remote switch ports - for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done - # Get the link status of the Ethernet ports (UP or DOWN) - for i in {0..7}; do hccn_tool -i $i -link -g ; done - # Check the network health status - for i in {0..7}; do hccn_tool -i $i -net_health -g ; done - # View the network detected IP configuration - for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done - # View gateway configuration - for i in {0..7}; do hccn_tool -i $i -gateway -g ; done - # View NPU network configuration - cat /etc/hccn.conf -``` - -### NPU Interconnect Verification: -#### 1. Get NPU IP Addresses - -```bash -for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done -``` - -#### 2. Cross-Node PING Test - -```bash -# Execute on the target node (replace with actual IP) -hccn_tool -i 0 -ping -g address 10.20.0.20 -``` - -## Run with Docker -Assume you have two Atlas 800 A2 (64G*8) nodes, and want to deploy the `deepseek-v3.1-w8a8` quantitative model across multiple nodes. - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| -export NAME=vllm-ascend - -# Run the container using the defined variables -# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance -docker run --rm \ ---name $NAME \ ---net=host \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci4 \ ---device /dev/davinci5 \ ---device /dev/davinci6 \ ---device /dev/davinci7 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /mnt/sfs_turbo/.cache:/root/.cache \ --it $IMAGE bash -``` - -Run the following scripts on two nodes respectively. - -:::{note} -Before launching the inference server, ensure the following environment variables are set for multi-node communication. -::: - -**Node 0** - -```shell -#!/bin/sh - -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxxx" -local_ip="xxxx" - -export VLLM_USE_MODELSCOPE=True -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V3.1-W8A8 -# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html -vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \ ---host 0.0.0.0 \ ---port 8004 \ ---data-parallel-size 4 \ ---data-parallel-size-local 2 \ ---data-parallel-address $local_ip \ ---data-parallel-rpc-port 13389 \ ---tensor-parallel-size 4 \ ---seed 1024 \ ---served-model-name deepseek_v3.1 \ ---enable-expert-parallel \ ---max-num-seqs 16 \ ---max-model-len 8192 \ ---quantization ascend \ ---max-num-batched-tokens 8192 \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.9 -``` - -**Node 1** - -```shell -#!/bin/sh - -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxx" -local_ip="xxx" - -# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) -node0_ip="xxxx" - -export VLLM_USE_MODELSCOPE=True -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \ ---host 0.0.0.0 \ ---port 8004 \ ---headless \ ---data-parallel-size 4 \ ---data-parallel-size-local 2 \ ---data-parallel-start-rank 2 \ ---data-parallel-address $node0_ip \ ---data-parallel-rpc-port 13389 \ ---tensor-parallel-size 4 \ ---seed 1024 \ ---quantization ascend \ ---served-model-name deepseek_v3.1 \ ---max-num-seqs 16 \ ---max-model-len 8192 \ ---max-num-batched-tokens 8192 \ ---enable-expert-parallel \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.92 -``` - -The deployment view looks like: -![alt text](../assets/multi_node_dp_deepseek.png) - -Once your server is started, you can query the model with input prompts: - -```shell -curl http://{ node0 ip:8004 }/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "deepseek_v3.1", - "prompt": "The future of AI is", - "max_tokens": 50, - "temperature": 0 - }' -``` - -## Run Benchmarks -For details, refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks). - -```shell -export VLLM_USE_MODELSCOPE=true -vllm bench serve --model vllm-ascend/DeepSeek-V3.1-W8A8 --served-model-name deepseek_v3.1 \ ---dataset-name random --random-input-len 128 --random-output-len 128 \ ---num-prompts 200 --trust-remote-code --base-url "http://{ node0 ip }:8004" --request-rate 1 -``` diff --git a/docs/source/tutorials/multi_node_kimi.md b/docs/source/tutorials/multi_node_kimi.md deleted file mode 100644 index f37d9bf4..00000000 --- a/docs/source/tutorials/multi_node_kimi.md +++ /dev/null @@ -1,155 +0,0 @@ -# Multi-Node-DP (Kimi-K2) - -## Verify Multi-Node Communication Environment - -Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process). - -## Run with Docker -Assume you have two Atlas 800 A3 (64G*16) or four A2 nodes, and want to deploy the `Kimi-K2-Instruct-W8A8` quantitative model across multiple nodes. - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| -export NAME=vllm-ascend - -# Run the container using the defined variables -# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance -docker run --rm \ ---name $NAME \ ---net=host \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci4 \ ---device /dev/davinci5 \ ---device /dev/davinci6 \ ---device /dev/davinci7 \ ---device /dev/davinci8 \ ---device /dev/davinci9 \ ---device /dev/davinci10 \ ---device /dev/davinci11 \ ---device /dev/davinci12 \ ---device /dev/davinci13 \ ---device /dev/davinci14 \ ---device /dev/davinci15 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /mnt/sfs_turbo/.cache:/home/cache \ --it $IMAGE bash -``` - -Run the following scripts on two nodes respectively. - -:::{note} -Before launching the inference server, ensure the following environment variables are set for multi-node communication. -::: - -**Node 0** - -```shell -#!/bin/sh - -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxxx" -local_ip="xxxx" - -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -# The w8a8 weight can be obtained from https://www.modelscope.cn/models/vllm-ascend/Kimi-K2-Instruct-W8A8 -# If you want to do the quantization manually, please refer to https://vllm-ascend.readthedocs.io/en/latest/user_guide/feature_guide/quantization.html -vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \ ---host 0.0.0.0 \ ---port 8004 \ ---data-parallel-size 4 \ ---api-server-count 2 \ ---data-parallel-size-local 2 \ ---data-parallel-address $local_ip \ ---data-parallel-rpc-port 13389 \ ---seed 1024 \ ---served-model-name kimi \ ---quantization ascend \ ---tensor-parallel-size 8 \ ---enable-expert-parallel \ ---max-num-seqs 16 \ ---max-model-len 8192 \ ---max-num-batched-tokens 8192 \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.9 -``` - -**Node 1** - -```shell -#!/bin/sh - -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxxx" -local_ip="xxxx" - -# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) -node0_ip="xxxx" - -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \ ---host 0.0.0.0 \ ---port 8004 \ ---headless \ ---data-parallel-size 4 \ ---data-parallel-size-local 2 \ ---data-parallel-start-rank 2 \ ---data-parallel-address $node0_ip \ ---data-parallel-rpc-port 13389 \ ---seed 1024 \ ---tensor-parallel-size 8 \ ---served-model-name kimi \ ---max-num-seqs 16 \ ---max-model-len 8192 \ ---quantization ascend \ ---max-num-batched-tokens 8192 \ ---enable-expert-parallel \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.92 -``` - -The deployment view looks like: -![alt text](../assets/multi_node_dp_kimi.png) - -Once your server is started, you can query the model with input prompts: - -```shell -curl http://{ node0 ip:8004 }/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "kimi", - "prompt": "The future of AI is", - "max_tokens": 50, - "temperature": 0 - }' -``` diff --git a/docs/source/tutorials/multi_node_qwen3vl.md b/docs/source/tutorials/multi_node_qwen3vl.md deleted file mode 100644 index 48763b70..00000000 --- a/docs/source/tutorials/multi_node_qwen3vl.md +++ /dev/null @@ -1,163 +0,0 @@ -# Multi-Node-DP (Qwen3-VL-235B-A22B) - -:::{note} -Qwen3 VL relies on the newest version of `transformers` (>4.56.2). Please install it from source. -::: - -## Verify Multi-Node Communication Environment - -Refer to [multi_node.md](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node.html#verification-process). - -## Run with Docker -Assume you have Atlas 800 A3 (64G*16) nodes (or 2 * A2), and want to deploy the `Qwen3-VL-235B-A22B-Instruct` model across multiple nodes. - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---net=host \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci4 \ ---device /dev/davinci5 \ ---device /dev/davinci6 \ ---device /dev/davinci7 \ ---device /dev/davinci8 \ ---device /dev/davinci9 \ ---device /dev/davinci10 \ ---device /dev/davinci11 \ ---device /dev/davinci12 \ ---device /dev/davinci13 \ ---device /dev/davinci14 \ ---device /dev/davinci15 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -Run the following scripts on two nodes respectively. - -:::{note} -Before launching the inference server, ensure the following environment variables are set for multi-node communication. -::: - -Node 0 - -```shell -#!/bin/sh -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxxx" -local_ip="xxxx" - -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ ---host 0.0.0.0 \ ---port 8000 \ ---data-parallel-size 2 \ ---api-server-count 2 \ ---data-parallel-size-local 1 \ ---data-parallel-address $local_ip \ ---data-parallel-rpc-port 13389 \ ---seed 1024 \ ---served-model-name qwen3vl \ ---tensor-parallel-size 8 \ ---enable-expert-parallel \ ---max-num-seqs 16 \ ---max-model-len 32768 \ ---max-num-batched-tokens 4096 \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.8 \ -``` - -Node 1 - -```shell -#!/bin/sh - -# this obtained through ifconfig -# nic_name is the network interface name corresponding to local_ip of the current node -nic_name="xxxx" -local_ip="xxxx" - -# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) -node0_ip="xxxx" - -export HCCL_IF_IP=$local_ip -export GLOO_SOCKET_IFNAME=$nic_name -export TP_SOCKET_IFNAME=$nic_name -export HCCL_SOCKET_IFNAME=$nic_name -export OMP_PROC_BIND=false -export OMP_NUM_THREADS=10 -export HCCL_BUFFSIZE=1024 - -vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ ---host 0.0.0.0 \ ---port 8000 \ ---headless \ ---data-parallel-size 2 \ ---data-parallel-size-local 1 \ ---data-parallel-start-rank 1 \ ---data-parallel-address $node0_ip \ ---data-parallel-rpc-port 13389 \ ---seed 1024 \ ---tensor-parallel-size 8 \ ---served-model-name qwen3vl \ ---max-num-seqs 16 \ ---max-model-len 32768 \ ---max-num-batched-tokens 4096 \ ---enable-expert-parallel \ ---trust-remote-code \ ---no-enable-prefix-caching \ ---gpu-memory-utilization 0.8 \ -``` - -If the service starts successfully, the following information will be displayed on node 0: - -```shell -INFO: Started server process [44610] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Started server process [44611] -INFO: Waiting for application startup. -INFO: Application startup complete. -``` - -Once your server is started, you can query the model with input prompts: - -```shell -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "qwen3vl", - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": [ - {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, - {"type": "text", "text": "What is the text in the illustrate?"} - ]} - ] - }' -``` diff --git a/docs/source/tutorials/multi_npu.md b/docs/source/tutorials/multi_npu.md deleted file mode 100644 index 3dedc972..00000000 --- a/docs/source/tutorials/multi_npu.md +++ /dev/null @@ -1,108 +0,0 @@ -# Multi-NPU (QwQ-32B) - -## Run vllm-ascend on Multi-NPU - -Run docker container: - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -Set up environment variables: - -```bash -# Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=True - -# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory -export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 -``` - -### Online Inference on Multi-NPU - -Run the following script to start the vLLM server on multi-NPU: - -```bash -vllm serve Qwen/QwQ-32B --max-model-len 4096 --port 8000 -tp 4 -``` - -Once your server is started, you can query the model with input prompts. - -```bash -curl http://localhost:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/QwQ-32B", - "prompt": "QwQ-32B是什么?", - "max_tokens": "128", - "top_p": "0.95", - "top_k": "40", - "temperature": "0.6" - }' -``` - -### Offline Inference on Multi-NPU - -Run the following script to execute offline inference on multi-NPU: - -```python -import gc - -import torch - -from vllm import LLM, SamplingParams -from vllm.distributed.parallel_state import (destroy_distributed_environment, - destroy_model_parallel) - -def clean_up(): - destroy_model_parallel() - destroy_distributed_environment() - gc.collect() - torch.npu.empty_cache() - -prompts = [ - "Hello, my name is", - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) -llm = LLM(model="Qwen/QwQ-32B", - tensor_parallel_size=4, - distributed_executor_backend="mp", - max_model_len=4096) - -outputs = llm.generate(prompts, sampling_params) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - -del llm -clean_up() -``` - -If you run this script successfully, you can see the info shown below: - -```bash -Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' -Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' -``` diff --git a/docs/source/tutorials/multi_npu_quantization.md b/docs/source/tutorials/multi_npu_quantization.md deleted file mode 100644 index 23b183db..00000000 --- a/docs/source/tutorials/multi_npu_quantization.md +++ /dev/null @@ -1,138 +0,0 @@ -# Multi-NPU (QwQ-32B-W8A8) - -## Run Docker Container -:::{note} -w8a8 quantization feature is supported by v0.8.4rc2 and later. -::: - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci1 \ ---device /dev/davinci2 \ ---device /dev/davinci3 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -## Install modelslim and Convert Model -:::{note} -You can choose to convert the model yourself or use the quantized model we uploaded, -see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8 -::: - -```bash -# (Optional)This tag is recommended and has been verified -git clone https://gitcode.com/Ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001 - -cd msit/msmodelslim -# Install by run this script -bash install.sh -pip install accelerate - -cd example/Qwen -# Original weight path, Replace with your local model path -MODEL_PATH=/home/models/QwQ-32B -# Path to save converted weight, Replace with your local path -SAVE_PATH=/home/models/QwQ-32B-w8a8 - -# In this conversion process, the npu device is not must, you can also set --device_type cpu to have a conversion -python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --calib_file ../common/boolq.jsonl --w_bit 8 --a_bit 8 --device_type npu --anti_method m1 --trust_remote_code True -``` - -## Verify the Quantized Model -The converted model files look like: - -```bash -. -|-- config.json -|-- configuration.json -|-- generation_config.json -|-- quant_model_description.json -|-- quant_model_weight_w8a8.safetensors -|-- README.md -|-- tokenizer.json -`-- tokenizer_config.json -``` - -Run the following script to start the vLLM server with the quantized model: - -:::{note} -The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released. You can cherry-pick this commit for now. -::: - -```bash -vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend -``` - -Once your server is started, you can query the model with input prompts - -```bash -curl http://localhost:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "qwq-32b-w8a8", - "prompt": "what is large language model?", - "max_tokens": "128", - "top_p": "0.95", - "top_k": "40", - "temperature": "0.0" - }' -``` - -Run the following script to execute offline inference on multi-NPU with the quantized model: - -:::{note} -To enable quantization for ascend, quantization method must be "ascend". -::: - -```python -import gc - -import torch - -from vllm import LLM, SamplingParams -from vllm.distributed.parallel_state import (destroy_distributed_environment, - destroy_model_parallel) - -def clean_up(): - destroy_model_parallel() - destroy_distributed_environment() - gc.collect() - torch.npu.empty_cache() - -prompts = [ - "Hello, my name is", - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) - -llm = LLM(model="/home/models/QwQ-32B-w8a8", - tensor_parallel_size=4, - distributed_executor_backend="mp", - max_model_len=4096, - quantization="ascend") - -outputs = llm.generate(prompts, sampling_params) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") - -del llm -clean_up() -``` diff --git a/docs/source/tutorials/multi_node_pd_disaggregation_mooncake.md b/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md similarity index 99% rename from docs/source/tutorials/multi_node_pd_disaggregation_mooncake.md rename to docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md index d11ea137..b77a9520 100644 --- a/docs/source/tutorials/multi_node_pd_disaggregation_mooncake.md +++ b/docs/source/tutorials/pd_disaggregation_mooncake_multi_node.md @@ -1,4 +1,4 @@ -# Prefill-Decode Disaggregation Mooncake Verification (Deepseek) +# Prefill-Decode Disaggregation (Deepseek) ## Getting Start diff --git a/docs/source/tutorials/single_node_pd_disaggregation_mooncake.md b/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md similarity index 98% rename from docs/source/tutorials/single_node_pd_disaggregation_mooncake.md rename to docs/source/tutorials/pd_disaggregation_mooncake_single_node.md index f86b68d3..7d511d1b 100644 --- a/docs/source/tutorials/single_node_pd_disaggregation_mooncake.md +++ b/docs/source/tutorials/pd_disaggregation_mooncake_single_node.md @@ -1,4 +1,4 @@ -# Prefill-Decode Disaggregation Mooncake Verification (Qwen2.5-VL) +# Prefill-Decode Disaggregation (Qwen2.5-VL) ## Getting Start diff --git a/docs/source/tutorials/multi_node_ray.md b/docs/source/tutorials/ray.md similarity index 99% rename from docs/source/tutorials/multi_node_ray.md rename to docs/source/tutorials/ray.md index 1146bf89..3f700e18 100644 --- a/docs/source/tutorials/multi_node_ray.md +++ b/docs/source/tutorials/ray.md @@ -1,4 +1,4 @@ -# Multi-Node-Ray (Qwen/Qwen3-235B-A22B) +# Ray Distributed (Qwen3-235B-A22B) Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed: diff --git a/docs/source/tutorials/single_npu.md b/docs/source/tutorials/single_npu.md deleted file mode 100644 index 23a2e34a..00000000 --- a/docs/source/tutorials/single_npu.md +++ /dev/null @@ -1,213 +0,0 @@ -# Single NPU (Qwen3-8B) - -## Run vllm-ascend on Single NPU - -### Offline Inference on Single NPU - -Run docker container: - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --it $IMAGE bash -``` - -Set up environment variables: - -```bash -# Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=True - -# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory -export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 -``` - -:::{note} -`max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [here](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html). -::: - -Run the following script to execute offline inference on a single NPU: - -:::::{tab-set} -:sync-group: inference - -::::{tab-item} Graph Mode -:sync: graph mode - -```{code-block} python - :substitutions: -import os -from vllm import LLM, SamplingParams - -prompts = [ - "Hello, my name is", - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) -llm = LLM( - model="Qwen/Qwen3-8B", - max_model_len=26240 -) - -outputs = llm.generate(prompts, sampling_params) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` - -:::: - -::::{tab-item} Eager Mode -:sync: eager mode - -```{code-block} python - :substitutions: -import os -from vllm import LLM, SamplingParams - -prompts = [ - "Hello, my name is", - "The future of AI is", -] -sampling_params = SamplingParams(temperature=0.8, top_p=0.95) -llm = LLM( - model="Qwen/Qwen3-8B", - max_model_len=26240, - enforce_eager=True -) - -outputs = llm.generate(prompts, sampling_params) -for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") -``` - -:::: -::::: - -If you run this script successfully, you can see the info shown below: - -```bash -Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' -Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' -``` - -### Online Serving on Single NPU - -Run docker container to start the vLLM server on a single NPU: - -:::::{tab-set} -:sync-group: inference - -::::{tab-item} Graph Mode -:sync: graph mode - -```{code-block} bash - :substitutions: -# Update the vllm-ascend image -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --e VLLM_USE_MODELSCOPE=True \ --e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ --it $IMAGE \ -vllm serve Qwen/Qwen3-8B --max_model_len 26240 -``` - -:::: - -::::{tab-item} Eager Mode -:sync: eager mode - -```{code-block} bash - :substitutions: -export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| -docker run --rm \ ---name vllm-ascend \ ---shm-size=1g \ ---device /dev/davinci0 \ ---device /dev/davinci_manager \ ---device /dev/devmm_svm \ ---device /dev/hisi_hdc \ --v /usr/local/dcmi:/usr/local/dcmi \ --v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ --v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ --v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ --v /etc/ascend_install.info:/etc/ascend_install.info \ --v /root/.cache:/root/.cache \ --p 8000:8000 \ --e VLLM_USE_MODELSCOPE=True \ --e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ --it $IMAGE \ -vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager -``` - -:::: -::::: - -:::{note} -Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). This will differ with different NPU series based on the HBM size. Please modify the value according to a suitable value for your NPU series. -::: - -If your service start successfully, you can see the info shown below: - -```bash -INFO: Started server process [6873] -INFO: Waiting for application startup. -INFO: Application startup complete. -``` - -Once your server is started, you can query the model with input prompts: - -```bash -curl http://localhost:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen3-8B", - "prompt": "The future of AI is", - "max_tokens": 7, - "temperature": 0 - }' -``` - -If you query the server successfully, you can see the info shown below (client): - -```bash -{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen3-8B","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}} -``` - -Logs of the vllm server: - -```bash -INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK -INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None. -``` diff --git a/docs/source/user_guide/release_notes.md b/docs/source/user_guide/release_notes.md index a44b57ff..764f207e 100644 --- a/docs/source/user_guide/release_notes.md +++ b/docs/source/user_guide/release_notes.md @@ -56,7 +56,7 @@ v0.11.0 will be the next official release version of vLLM Ascend. We'll release ### Core - Performance of Qwen3 and Deepseek V3 series models are improved. -- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html). +- Mooncake layerwise connector is supported now [#2602](https://github.com/vllm-project/vllm-ascend/pull/2602). Find tutorial [here](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html). - MTP > 1 is supported now. [#2708](https://github.com/vllm-project/vllm-ascend/pull/2708) - [Experimental] Graph mode `FULL_DECODE_ONLY` is supported now! And `FULL` will be landing in the next few weeks. [#2128](https://github.com/vllm-project/vllm-ascend/pull/2128) - Pooling models, such as bge-m3, are supported now. [#3171](https://github.com/vllm-project/vllm-ascend/pull/3171) diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index a47d39cb..c7077f9a 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -9,7 +9,7 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Model | Support | Note | BF16 | Supported Hardware | W8A8 | Chunked Prefill | Automatic Prefix Cache | LoRA | Speculative Decoding | Async Scheduling | Tensor Parallel | Pipeline Parallel | Expert Parallel | Data Parallel | Prefill-decode Disaggregation | Piecewise AclGraph | Fullgraph AclGraph | max-model-len | MLP Weight Prefetch | Doc | |-------------------------------|-----------|----------------------------------------------------------------------|------|--------------------|------|-----------------|------------------------|------|----------------------|------------------|-----------------|-------------------|-----------------|---------------|-------------------------------|--------------------|--------------------|---------------|---------------------|-----| | DeepSeek V3/3.1 | ✅ | ||||||||||||||||||| -| DeepSeek V3.2 EXP | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | | 163840 | | [DeepSeek-V3.2-Exp tutorial](../../tutorials/DeepSeek-V3.2-Exp.md) | +| DeepSeek V3.2 EXP | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | | 163840 | | [DeepSeek-V3.2-Exp tutorial](../../tutorials/DeepSeek-V3.2.md) | | DeepSeek R1 | ✅ | ||||||||||||||||||| | DeepSeek Distill (Qwen/Llama) | ✅ | ||||||||||||||||||| | Qwen3 | ✅ | |||||||||||||||||||