[Doc][Misc] Restructure tutorial documentation (#6501)
### What this PR does / why we need it? This PR refactors the tutorial documentation by restructuring it into three categories: Models, Features, and Hardware. This improves the organization and navigation of the tutorials, making it easier for users to find relevant information. - The single `tutorials/index.md` is split into three separate index files: - `docs/source/tutorials/models/index.md` - `docs/source/tutorials/features/index.md` - `docs/source/tutorials/hardwares/index.md` - Existing tutorial markdown files have been moved into their respective new subdirectories (`models/`, `features/`, `hardwares/`). - The main `index.md` has been updated to link to these new tutorial sections. This change makes the documentation structure more logical and scalable for future additions. ### Does this PR introduce _any_ user-facing change? Yes, this PR changes the structure and URLs of the tutorial documentation pages. Users following old links to tutorials will encounter broken links. It is recommended to set up redirects if the documentation framework supports them. ### How was this patch tested? These are documentation-only changes. The documentation should be built and reviewed locally to ensure all links are correct and the pages render as expected. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
15
docs/source/tutorials/features/index.md
Normal file
15
docs/source/tutorials/features/index.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Feature Tutorials
|
||||
|
||||
This section provides tutorials for different features of vLLM Ascend.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Feature Tutorials
|
||||
:maxdepth: 1
|
||||
pd_colocated_mooncake_multi_instance
|
||||
pd_disaggregation_mooncake_single_node
|
||||
pd_disaggregation_mooncake_multi_node
|
||||
long_sequence_context_parallel_single_node
|
||||
long_sequence_context_parallel_multi_node
|
||||
suffix_speculative_decoding
|
||||
ray
|
||||
:::
|
||||
@@ -0,0 +1,371 @@
|
||||
# Long-Sequence Context Parallel (Deepseek)
|
||||
|
||||
## Getting Start
|
||||
|
||||
:::{note}
|
||||
Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future.
|
||||
:::
|
||||
|
||||
vLLM-Ascend now supports long sequence with context parallel options. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Deepseek-V3.1-w8a8 model as an example, use 3 Atlas 800T A3 servers to deploy the “1P1D” architecture. Node p is deployed across multiple machines, while node d is deployed on a single machine. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1). On each server, use 8 NPUs 16 chips to deploy one service instance.In the current example, we will enable the context parallel feature on node p to improve TTFT. Although enabling the DCP feature on node d can reduce memory usage, it would introduce additional communication and small operator overhead. Therefore, we will not enable the DCP feature on node d.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `DeepSeek-V3.1_w8a8mix_mtp`(Quantized version with mix mtp): [Download model weight](https://www.modelscope.cn/models/Eco-Tech/DeepSeek-V3.1-w8a8). Please modify `torch_dtype` from `float16` to `bfloat16` in `config.json`.
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Verify Multi-node Communication
|
||||
|
||||
Refer to [verify multi-node communication environment](../../installation.md#verify-multi-node-communication) to verify multi-node communication.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use our official docker image to run `DeepSeek-V3.1` directly.
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
You need to set up environment on each node.
|
||||
|
||||
## Prefiller/Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
|
||||
|
||||
1. Run the following script to execute online 128k inference on three nodes respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: nodes
|
||||
|
||||
::::{tab-item} Prefiller node 1
|
||||
:sync: prefill node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.1"
|
||||
master_addr="192.0.0.1"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_BUFFSIZE=768
|
||||
export OMP_PROC_BIND=false
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
|
||||
|
||||
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--decode-context-parallel-size 8 \
|
||||
--prefill-context-parallel-size 2 \
|
||||
--cp-kv-cache-interleave-size 128 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--quantization ascend \
|
||||
--enforce-eager \
|
||||
--served-model-name deepseek_v3 \
|
||||
--seed 1024 \
|
||||
--no-enable-chunked-prefill \
|
||||
--no-enable-prefix-caching \
|
||||
--max-num-seqs 1 \
|
||||
--max-model-len 136000 \
|
||||
--max-num-batched-tokens 136000 \
|
||||
--block-size 128 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
--nnodes 2 \
|
||||
--node-rank 0 \
|
||||
--master-addr $master_addr \
|
||||
--master-port 7001 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Prefiller node 2
|
||||
:sync: prefill node2
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.2"
|
||||
master_addr="192.0.0.1"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_BUFFSIZE=768
|
||||
export OMP_PROC_BIND=false
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
|
||||
|
||||
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--decode-context-parallel-size 8 \
|
||||
--prefill-context-parallel-size 2 \
|
||||
--cp-kv-cache-interleave-size 128 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--quantization ascend \
|
||||
--enforce-eager \
|
||||
--served-model-name deepseek_v3 \
|
||||
--seed 1024 \
|
||||
--no-enable-chunked-prefill \
|
||||
--no-enable-prefix-caching \
|
||||
--max-num-seqs 1 \
|
||||
--max-model-len 136000 \
|
||||
--max-num-batched-tokens 136000 \
|
||||
--block-size 128 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
--nnodes 2 \
|
||||
--node-rank 1 \
|
||||
--headless \
|
||||
--master-addr $master_addr \
|
||||
--master-port 7001 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"use_ascend_direct": true,
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1
|
||||
:sync: decoder node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.3"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_BUFFSIZE=768
|
||||
export OMP_PROC_BIND=false
|
||||
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
|
||||
export OMP_NUM_THREADS=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1
|
||||
|
||||
vllm serve /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port 8004 \
|
||||
--api-server-count 1 \
|
||||
--data-parallel-size 1 \
|
||||
--data-parallel-size-local 1 \
|
||||
--data-parallel-start-rank 0 \
|
||||
--data-parallel-address $local_ip \
|
||||
--data-parallel-rpc-port 5980 \
|
||||
--decode-context-parallel-size 1 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name deepseek_v3 \
|
||||
--seed 1024 \
|
||||
--max-model-len 136000 \
|
||||
--max-num-batched-tokens 128 \
|
||||
--enable-chunked-prefill \
|
||||
--max-num-seqs 4 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.96 \
|
||||
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}' \
|
||||
--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4]}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30200",
|
||||
"engine_id": "3",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 16
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
2. Prefill master node `proxy.sh` scripts
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--port 8005 \
|
||||
--host 192.0.0.1 \
|
||||
--prefiller-hosts \
|
||||
192.0.0.1 \
|
||||
--prefiller-ports \
|
||||
8004 \
|
||||
--decoder-hosts \
|
||||
192.0.0.3 \
|
||||
--decoder-ports \
|
||||
8004
|
||||
```
|
||||
|
||||
3. run proxy
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
cd vllm-ascend/examples/disaggregated_prefill_v1/
|
||||
bash proxy.sh
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 16 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism (PCP) sizes.
|
||||
- `--decode-context-parallel-size` 8 are common settings for decode context parallelism (DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
|
||||
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
|
||||
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
|
||||
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tensor-parallel-size > 1.
|
||||
- `export VLLM_ASCEND_ENABLE_CONTEXT_PARALLEL=1` indicates that context parallel is enabled. This environment variable is required in the PD architecture but not needed in the pd co-locate deployment scenario. It will be removed in the future.
|
||||
|
||||
**Notice:**
|
||||
|
||||
- tensor-parallel-size needs to be divisible by decode-context-parallel-size.
|
||||
- decode-context-parallel-size must less than or equal to tensor-parallel-size.
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `DeepSeek-V3.1-w8a8` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----------| ----- | ----- | ----- |-----------------------|
|
||||
| aime2024 | - | accuracy | gen | 86.67 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `DeepSeek-V3.1-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model /path_to_weight/DeepSeek-V3.1_w8a8mix_mtp --dataset-name random --random-input 131072 --num-prompt 20 --request-rate 0 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
| dataset | version | metric | mode | ttft |
|
||||
|---------| ----- |-------------|------|--------|
|
||||
| random | - | performance | perf | 20.7s |
|
||||
@@ -0,0 +1,179 @@
|
||||
# Long-Sequence Context Parallel (Qwen3-235B-A22B)
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8)
|
||||
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Run with Docker
|
||||
|
||||
Start a Docker container on each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16).
|
||||
Quantized version need to start with parameter `--quantization ascend`.
|
||||
|
||||
Run the following script to execute online 128k inference.
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
# Load model from ModelScope to speed up download
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
# To reduce memory fragmentation and avoid out of memory
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=512
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
|
||||
|
||||
vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 8 \
|
||||
--prefill-context-parallel-size 2 \
|
||||
--decode-context-parallel-size 2 \
|
||||
--seed 1024 \
|
||||
--quantization ascend \
|
||||
--served-model-name qwen3 \
|
||||
--max-num-seqs 1 \
|
||||
--max-model-len 133008 \
|
||||
--max-num-batched-tokens 133008 \
|
||||
--enable-expert-parallel \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8]}' \
|
||||
--async-scheduling
|
||||
```
|
||||
|
||||
**Notice:**
|
||||
|
||||
- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \`
|
||||
- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \`
|
||||
|
||||
The parameters are explained as follows:
|
||||
|
||||
- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes.
|
||||
- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes.
|
||||
- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes.
|
||||
- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request.
|
||||
- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency.
|
||||
- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means:
|
||||
- (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`;
|
||||
- (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
|
||||
- Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
|
||||
- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
|
||||
- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
|
||||
- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
|
||||
- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
|
||||
- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings:
|
||||
"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended.
|
||||
- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance.
|
||||
- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1.
|
||||
|
||||
**Notice:**
|
||||
|
||||
- tp_size needs to be divisible by dcp_size
|
||||
- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are two accuracy evaluation methods.
|
||||
|
||||
### Using AISBench
|
||||
|
||||
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
||||
|
||||
2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` for reference only.
|
||||
|
||||
| dataset | version | metric | mode | vllm-api-general-chat |
|
||||
|----------| ----- | ----- | ----- |-----------------------|
|
||||
| aime2024 | - | accuracy | gen | 83.33 |
|
||||
|
||||
## Performance
|
||||
|
||||
### Using AISBench
|
||||
|
||||
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
||||
|
||||
### Using vLLM Benchmark
|
||||
|
||||
Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example.
|
||||
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommands:
|
||||
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```shell
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After about several minutes, you can get the performance evaluation result.
|
||||
|
||||
| dataset | version | metric | mode | ttft |
|
||||
|---------| ----- |-------------|------|--------|
|
||||
| random | - | performance | perf | 17.36s |
|
||||
@@ -0,0 +1,343 @@
|
||||
# PD-Colocated with Mooncake Multi-Instance
|
||||
|
||||
## Getting Started
|
||||
|
||||
vLLM-Ascend now supports PD-colocated deployment with Mooncake features.
|
||||
This guide provides step-by-step instructions to test these features with
|
||||
constrained resources.
|
||||
|
||||
Using the Qwen2.5-72B-Instruct model as an example, this guide demonstrates
|
||||
how to use vllm-ascend v0.11.0 (with vLLM v0.11.0) on two Atlas 800T A2
|
||||
nodes to deploy two vLLM instances. Each instance occupies 4 NPU cards and
|
||||
uses PD-colocated deployment.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements
|
||||
|
||||
- The two Atlas 800T A2 nodes must be physically interconnected via a RoCE
|
||||
network. Without RoCE interconnection, cross-node KV Cache access
|
||||
performance will be significantly degraded.
|
||||
- All NPU cards must communicate properly. Intra-node communication uses HCCS,
|
||||
while inter-node communication uses the RoCE network.
|
||||
|
||||
### Verification Process
|
||||
|
||||
The following process serves as a reference example. Please modify parameters
|
||||
such as IP addresses according to your actual environment.
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands sequentially. The results must all be
|
||||
`success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
```
|
||||
|
||||
2. Check NPU HCCN Configuration:
|
||||
|
||||
Ensure that the hccn.conf file exists in the environment. If using Docker,
|
||||
mount it into the container.
|
||||
|
||||
```bash
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
3. Get NPU IP Addresses:
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g; done
|
||||
```
|
||||
|
||||
4. Cross-Node PING Test:
|
||||
|
||||
```bash
|
||||
# Execute the following command on each node, replacing x.x.x.x
|
||||
# with the target node's NPU card address.
|
||||
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x; done
|
||||
```
|
||||
|
||||
5. Check NPU TLS Configuration
|
||||
|
||||
```bash
|
||||
# The tls settings should be consistent across all nodes
|
||||
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
```
|
||||
|
||||
## Run with Docker
|
||||
|
||||
Start a Docker container on each node.
|
||||
|
||||
```bash
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# This test uses four NPU cards to create the container.
|
||||
# Mount the hccn.conf file from the host node into the container.
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:\
|
||||
/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## (Optional) Install Mooncake
|
||||
|
||||
Mooncake is pre-installed and functional in the v0.11.0 image.
|
||||
The following installation steps are optional.
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by
|
||||
Moonshot AI. Installation and compilation guide:
|
||||
<https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
|
||||
First, obtain the Mooncake project using the following command:
|
||||
|
||||
```bash
|
||||
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
|
||||
cd Mooncake
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
Install MPI:
|
||||
|
||||
```bash
|
||||
apt-get install mpich libmpich-dev -y
|
||||
```
|
||||
|
||||
Install the relevant dependencies (Go installation is not required):
|
||||
|
||||
```bash
|
||||
bash dependencies.sh -y
|
||||
```
|
||||
|
||||
Compile and install:
|
||||
|
||||
```bash
|
||||
mkdir build
|
||||
cd build
|
||||
cmake .. -DUSE_ASCEND_DIRECT=ON
|
||||
make -j
|
||||
make install
|
||||
```
|
||||
|
||||
After installation, verify that Mooncake is installed correctly:
|
||||
|
||||
```bash
|
||||
python -c "import mooncake; print(mooncake.__file__)"
|
||||
# Expected output path:
|
||||
# /usr/local/Ascend/ascend-toolkit/latest/python/
|
||||
# site-packages/mooncake/__init__.py
|
||||
```
|
||||
|
||||
## Start Mooncake Master Service
|
||||
|
||||
To start the Mooncake master service in one of the node containers, use the
|
||||
following command:
|
||||
|
||||
```bash
|
||||
docker exec -it vllm-ascend bash
|
||||
cd /vllm-workspace/Mooncake
|
||||
mooncake_master --port 50088 \
|
||||
--eviction_high_watermark_ratio 0.95 \
|
||||
--eviction_ratio 0.05
|
||||
```
|
||||
|
||||
| Parameter | Value | Explanation |
|
||||
| ----------------------------- | ----- | ------------------------------------- |
|
||||
| port | 50088 | Port for the master service |
|
||||
| eviction_high_watermark_ratio | 0.95 | High watermark ratio (95% threshold) |
|
||||
| eviction_ratio | 0.05 | Percentage to evict when full (5%) |
|
||||
|
||||
## Create a Mooncake Configuration File Named mooncake.json
|
||||
|
||||
The template for the mooncake.json file is as follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata_server": "P2PHANDSHAKE",
|
||||
"protocol": "ascend",
|
||||
"device_name": "",
|
||||
"use_ascend_direct": true,
|
||||
"master_server_address": "<your_server_ip>:50088",
|
||||
"global_segment_size": 107374182400
|
||||
}
|
||||
```
|
||||
|
||||
| Parameter | Value | Explanation |
|
||||
| --------------| ------------------------| -----------------------------------|
|
||||
| metadata_server | P2PHANDSHAKE | Point-to-point handshake mode |
|
||||
| protocol | ascend | Ascend proprietary protocol |
|
||||
| use_ascend_direct | true | Enable direct hardware access |
|
||||
| master_server_address | 90.90.100.188:50088(for example) | Master server address|
|
||||
| global_segment_size | 107374182400 | Size per segment (100 GB) |
|
||||
|
||||
## vLLM Instance Deployment
|
||||
|
||||
Create containers on both Node 1 and Node 2, and launch the
|
||||
Qwen2.5-72B-Instruct model service in each to test the reusability and
|
||||
performance of cross-node, cross-instance KV Cache. Instance 1 utilizes NPU
|
||||
cards [0-3] on the first Atlas 800T A2 server, while Instance 2 utilizes
|
||||
cards [0-3] on the second server.
|
||||
|
||||
### Deploy Instance 1
|
||||
|
||||
Replace file paths, host, and port parameters based on your actual environment
|
||||
configuration.
|
||||
|
||||
```bash
|
||||
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/\
|
||||
latest/python/site-packages:$LD_LIBRARY_PATH
|
||||
export MOONCAKE_CONFIG_PATH="/vllm-workspace/mooncake.json"
|
||||
# NPU buffer pool: quantity:size(MB)
|
||||
# Allocates 4 buffers of 8MB each for KV transfer
|
||||
export ASCEND_BUFFER_POOL=4:8
|
||||
|
||||
vllm serve <path_to_your_model>/Qwen2.5-72B-Instruct/ \
|
||||
--served-model-name qwen \
|
||||
--dtype bfloat16 \
|
||||
--max-model-len 25600 \
|
||||
--tensor-parallel-size 4 \
|
||||
--host <your_server_ip> \
|
||||
--port 8002 \
|
||||
--max-num-batched-tokens 4096 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config '{
|
||||
"kv_connector": "MooncakeConnectorStoreV1",
|
||||
"kv_role": "kv_both",
|
||||
"kv_connector_extra_config": {
|
||||
"use_layerwise": false,
|
||||
"mooncake_rpc_port": "0",
|
||||
"load_async": true,
|
||||
"register_buffer": true
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
### Deploy Instance 2
|
||||
|
||||
The deployment method for Instance 2 is identical to Instance 1. Simply
|
||||
modify the `--host` and `--port` parameters according to your Instance 2
|
||||
configuration.
|
||||
|
||||
### Configuration Parameters
|
||||
|
||||
| Parameter | Value | Explanation |
|
||||
| ----------------- | ----------------------| -------------------------------- |
|
||||
| kv_connector | MooncakeConnectorStoreV1 | Use StoreV1 version |
|
||||
| kv_role | kv_both | Enable both produce and consume |
|
||||
| use_layerwise | false | Transfer entire cache (see note) |
|
||||
| mooncake_rpc_port | 0 | Automatic port assignment |
|
||||
| load_async | true | Enable asynchronous loading |
|
||||
| register_buffer | true | Required for PD-colocated mode |
|
||||
|
||||
**Note on use_layerwise:**
|
||||
|
||||
- `false`: Transfer entire KV Cache (suitable for cross-node with sufficient
|
||||
bandwidth)
|
||||
- `true`: Layer-by-layer transfer (suitable for single-node memory
|
||||
constraints)
|
||||
|
||||
## Benchmark
|
||||
|
||||
We recommend using the **aisbench** tool to assess performance. The test uses
|
||||
**Dataset A**, consisting of fully random data, with the following
|
||||
configuration:
|
||||
|
||||
- Input/output tokens: 1024/10
|
||||
- Total requests: 100
|
||||
- Concurrency: 25
|
||||
|
||||
The test procedure consists of three steps:
|
||||
|
||||
### Step 1: Baseline (No Cache)
|
||||
|
||||
Send Dataset A to Instance 1 on Node 1 and record the Time to First Token
|
||||
(TTFT) as **TTFT1**.
|
||||
|
||||
### Preparation for Step 2
|
||||
|
||||
Before Step 2, send a fully random Dataset B to Instance 1. Due to the
|
||||
unified HBM/DRAM KV Cache with LRU (Least Recently Used) eviction policy,
|
||||
Dataset B's cache evicts Dataset A's cache from HBM, leaving Dataset A's
|
||||
cache only in Node 1's DRAM.
|
||||
|
||||
### Step 2: Local DRAM Hit
|
||||
|
||||
Send Dataset A to Instance 1 again to measure the performance when hitting
|
||||
the KV Cache in local DRAM. Record the TTFT as **TTFT2**.
|
||||
|
||||
### Step 3: Cross-Node DRAM Hit
|
||||
|
||||
Send Dataset A to Instance 2. With the Mooncake KV Cache pool, this results
|
||||
in a cross-node KV Cache hit from Node 1's DRAM. Record the TTFT as
|
||||
**TTFT3**.
|
||||
|
||||
**Model Configuration**:
|
||||
|
||||
```python
|
||||
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
|
||||
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
|
||||
|
||||
models = [
|
||||
dict(
|
||||
attr="service",
|
||||
type=VLLMCustomAPIChatStream,
|
||||
abbr='vllm-api-stream-chat',
|
||||
path="<path_to_your_model>/Qwen2.5-72B-Instruct",
|
||||
model="qwen",
|
||||
request_rate = 0,
|
||||
retry = 2,
|
||||
host_ip = "<your_server_ip>",
|
||||
host_port = 8002,
|
||||
max_out_len = 10,
|
||||
batch_size= 25,
|
||||
trust_remote_code=False,
|
||||
generation_kwargs = dict(
|
||||
temperature = 0,
|
||||
ignore_eos = True,
|
||||
),
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
**Performance Benchmarking Commands**:
|
||||
|
||||
```shell
|
||||
ais_bench --models vllm_api_stream_chat \
|
||||
--datasets gsm8k_gen_0_shot_cot_str_perf \
|
||||
--debug --summarizer default_perf --mode perf
|
||||
```
|
||||
|
||||
### Test Results
|
||||
|
||||
| Requests | Concur | TTFT1 (ms) | TTFT2 (ms) | TTFT3 (ms) |
|
||||
| -------- | ------ | ---------- | ---------- | ---------- |
|
||||
| 100 | 25 | 2322 | 739 | 948 |
|
||||
@@ -0,0 +1,951 @@
|
||||
# Prefill-Decode Disaggregation (Deepseek)
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements
|
||||
|
||||
- The physical machines must be located on the same WLAN, with network connectivity.
|
||||
- All NPUs must be interconnected. Intra-node connectivity is via HCCS, and inter-node connectivity is via RDMA.
|
||||
|
||||
### Verification Process
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} A3
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..15}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
|
||||
```
|
||||
|
||||
2. Check NPU HCCN Configuration:
|
||||
|
||||
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
|
||||
|
||||
```bash
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
3. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
# Get virtual npu ip
|
||||
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
|
||||
```
|
||||
|
||||
4. Get superpodid and SDID
|
||||
|
||||
```bash
|
||||
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
|
||||
```
|
||||
|
||||
5. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace 'x.x.x.x' with virtual npu ip address)
|
||||
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
|
||||
```
|
||||
|
||||
6. Check NPU TLS Configuration
|
||||
|
||||
```bash
|
||||
# The tls settings should be consistent across all nodes
|
||||
for i in {0..15}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} A2
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
```
|
||||
|
||||
2. Check NPU HCCN Configuration:
|
||||
|
||||
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
|
||||
|
||||
```bash
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
3. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g;done
|
||||
```
|
||||
|
||||
4. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
|
||||
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
|
||||
```
|
||||
|
||||
5. Check NPU TLS Configuration
|
||||
|
||||
```bash
|
||||
# The tls settings should be consistent across all nodes
|
||||
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
## Run with Docker
|
||||
|
||||
Start a Docker container on each node.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci8 \
|
||||
--device /dev/davinci9 \
|
||||
--device /dev/davinci10 \
|
||||
--device /dev/davinci11 \
|
||||
--device /dev/davinci12 \
|
||||
--device /dev/davinci13 \
|
||||
--device /dev/davinci14 \
|
||||
--device /dev/davinci15 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install Mooncake
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
|
||||
```
|
||||
|
||||
(Optional) Replace go install url if the network is poor
|
||||
|
||||
```shell
|
||||
cd Mooncake
|
||||
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
|
||||
```
|
||||
|
||||
Install mpi
|
||||
|
||||
```shell
|
||||
apt-get install mpich libmpich-dev -y
|
||||
```
|
||||
|
||||
Install the relevant dependencies. The installation of Go is not required.
|
||||
|
||||
```shell
|
||||
bash dependencies.sh -y
|
||||
```
|
||||
|
||||
Compile and install
|
||||
|
||||
```shell
|
||||
mkdir build
|
||||
cd build
|
||||
cmake .. -DUSE_ASCEND_DIRECT=ON
|
||||
make -j
|
||||
make install
|
||||
```
|
||||
|
||||
Set environment variables
|
||||
|
||||
**Note:**
|
||||
|
||||
- Adjust the Python path according to your specific Python installation
|
||||
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
|
||||
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
## Prefiller/Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.
|
||||
|
||||
### kv_port Configuration Guide
|
||||
|
||||
On Ascend NPU, Mooncake uses AscendDirectTransport for RDMA data transfer, which randomly allocates ports within range `[20000, 20000 + npu_per_node × 1000)`. If `kv_port` overlaps with this range, intermittent port conflicts may occur. To avoid this, configure `kv_port` according to the table below:
|
||||
|
||||
| NPUs per Node | Reserved Port Range | Recommended kv_port |
|
||||
|---------------|---------------------|---------------------|
|
||||
| 8 | 20000 - 27999 | >= 28000 |
|
||||
| 16 | 20000 - 35999 | >= 36000 |
|
||||
|
||||
```{warning}
|
||||
If you occasionally see `zmq.error.ZMQError: Address already in use` during startup, it may be caused by kv_port conflicting with randomly allocated AscendDirectTransport ports. Increase your kv_port value to avoid the reserved range.
|
||||
```
|
||||
|
||||
### launch_online_dp.py
|
||||
|
||||
Use `launch_online_dp.py` to launch external dp vllm servers.
|
||||
[launch\_online\_dp.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/launch_online_dp.py)
|
||||
|
||||
### run_dp_template.sh
|
||||
|
||||
Modify `run_dp_template.sh` on each node.
|
||||
[run\_dp\_template.sh](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)
|
||||
|
||||
#### Layerwise
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: nodes
|
||||
|
||||
::::{tab-item} Prefiller node 1
|
||||
:sync: prefill node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.1"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "36000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Prefiller node 2
|
||||
:sync: prefill node2
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.2"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "36100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1
|
||||
:sync: decoder node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.3"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=600
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 40 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.94 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 2
|
||||
:sync: decoder node2
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.4"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=600
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 40 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.94 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeLayerwiseConnector",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
#### Non-layerwise
|
||||
|
||||
:::::{tab-set}
|
||||
:sync-group: nodes
|
||||
|
||||
::::{tab-item} Prefiller node 1
|
||||
:sync: prefill node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.1"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "36000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Prefiller node 2
|
||||
:sync: prefill node2
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.2"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=256
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--max-num-seqs 8 \
|
||||
--enforce-eager \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"enable_shared_expert_dp": true}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "36100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 1
|
||||
:sync: decoder node1
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.3"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=600
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 40 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.94 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder node 2
|
||||
:sync: decoder node2
|
||||
|
||||
```shell
|
||||
nic_name="eth0" # network card name
|
||||
local_ip="192.0.0.4"
|
||||
export HCCL_IF_IP=$local_ip
|
||||
export GLOO_SOCKET_IFNAME=$nic_name
|
||||
export TP_SOCKET_IFNAME=$nic_name
|
||||
export HCCL_SOCKET_IFNAME=$nic_name
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export HCCL_BUFFSIZE=600
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export VLLM_USE_V1=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=$1
|
||||
vllm serve /path_to_weight/DeepSeek-r1_w8a8_mtp \
|
||||
--host 0.0.0.0 \
|
||||
--port $2 \
|
||||
--data-parallel-size $3 \
|
||||
--data-parallel-rank $4 \
|
||||
--data-parallel-address $5 \
|
||||
--data-parallel-rpc-port $6 \
|
||||
--tensor-parallel-size $7 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--served-model-name ds_r1 \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 256 \
|
||||
--max-num-seqs 40 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.94 \
|
||||
--quantization ascend \
|
||||
--no-enable-prefix-caching \
|
||||
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
|
||||
--additional-config '{"recompute_scheduler_enable":true,"multistream_overlap_shared_expert": true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":16}}' \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "36200",
|
||||
"engine_id": "2",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 2,
|
||||
"tp_size": 8
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 32,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
### Start the service
|
||||
|
||||
```bash
|
||||
# on 190.0.0.1
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.1 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.2
|
||||
python launch_online_dp.py --dp-size 2 --tp-size 8 --dp-size-local 2 --dp-rank-start 0 --dp-address 190.0.0.2 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.3
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 0 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
# on 190.0.0.4
|
||||
python launch_online_dp.py --dp-size 32 --tp-size 1 --dp-size-local 16 --dp-rank-start 16 --dp-address 190.0.0.3 --dp-rpc-port 12321 --vllm-start-port 7100
|
||||
```
|
||||
|
||||
## Example Proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node where your prefiller service instance is deployed. You can find the proxy implementation in the repository's examples directory.
|
||||
|
||||
We provide two different proxy implementations with distinct request routing behaviors:
|
||||
|
||||
- **`load_balance_proxy_layerwise_server_example.py`**: Requests are first routed to the D nodes, which then forward to the P nodes as needed.This proxy is designed for use with the MooncakeLayerwiseConnector.[load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py)
|
||||
|
||||
- **`load_balance_proxy_server_example.py`**: Requests are first routed to the P nodes, which then forward to the D nodes for subsequent processing.This proxy is designed for use with the MooncakeConnector.[load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} Layerwise
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_layerwise_server_example.py \
|
||||
--port 1999 \
|
||||
--host 192.0.0.1 \
|
||||
--prefiller-hosts \
|
||||
192.0.0.1 \
|
||||
192.0.0.1 \
|
||||
192.0.0.2 \
|
||||
192.0.0.2 \
|
||||
--prefiller-ports \
|
||||
7100 7101 7100 7101 \
|
||||
--decoder-hosts \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
--decoder-ports \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Non-layerwise
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--port 1999 \
|
||||
--host 192.0.0.1 \
|
||||
--prefiller-hosts \
|
||||
192.0.0.1 \
|
||||
192.0.0.1 \
|
||||
192.0.0.2 \
|
||||
192.0.0.2 \
|
||||
--prefiller-ports \
|
||||
7100 7101 7100 7101 \
|
||||
--decoder-hosts \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.3 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
192.0.0.4 \
|
||||
--decoder-ports \
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
|
||||
7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115\
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
|Parameter | meaning |
|
||||
| --- | --- |
|
||||
| --port | Proxy service Port |
|
||||
| --host | Proxy service Host IP|
|
||||
| --prefiller-hosts | Hosts of prefiller nodes |
|
||||
| --prefiller-ports | Ports of prefiller nodes |
|
||||
| --decoder-hosts | Hosts of decoder nodes |
|
||||
| --decoder-ports | Ports of decoder nodes |
|
||||
|
||||
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
## Benchmark
|
||||
|
||||
We recommend use aisbench tool to assess performance. [aisbench](https://gitee.com/aisbench/benchmark) Execute the following commands to install aisbench
|
||||
|
||||
```shell
|
||||
git clone https://gitee.com/aisbench/benchmark.git
|
||||
cd benchmark/
|
||||
pip3 install -e ./
|
||||
```
|
||||
|
||||
You need to canncel the http proxy before assessing performance, as following
|
||||
|
||||
```shell
|
||||
# unset proxy
|
||||
unset http_proxy
|
||||
unset https_proxy
|
||||
```
|
||||
|
||||
- You can place your datasets in the dir: `benchmark/ais_bench/datasets`
|
||||
- You can change the configurationin the dir :`benchmark/ais_bench/benchmark/configs/models/vllm_api` Take the ``vllm_api_stream_chat.py`` for examples
|
||||
|
||||
```python
|
||||
models = [
|
||||
dict(
|
||||
attr="service",
|
||||
type=VLLMCustomAPIChatStream,
|
||||
abbr='vllm-api-stream-chat',
|
||||
path="/root/.cache/ds_r1",
|
||||
model="dsr1",
|
||||
request_rate = 14,
|
||||
retry = 2,
|
||||
host_ip = "192.0.0.1", # Proxy service host IP
|
||||
host_port = 8000, # Proxy service Port
|
||||
max_out_len = 10,
|
||||
batch_size=768,
|
||||
trust_remote_code=True,
|
||||
generation_kwargs = dict(
|
||||
temperature = 0,
|
||||
seed = 1024,
|
||||
ignore_eos=False,
|
||||
)
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
- Take gsm8k dataset for example, execute the following commands to assess performance.
|
||||
|
||||
```shell
|
||||
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
|
||||
```
|
||||
|
||||
- For more details for commands and parameters for aisbench, refer to [aisbench](https://gitee.com/aisbench/benchmark)
|
||||
|
||||
## FAQ
|
||||
|
||||
### 1. Prefiller nodes need to warmup
|
||||
|
||||
Since the computation of some NPU operators requires several rounds of warm-up to achieve best performance, we recommend preheating the service with some requests before conducting performance tests to achieve the best end-to-end throughput.
|
||||
|
||||
## Verification
|
||||
|
||||
Check service health using the proxy server endpoint.
|
||||
|
||||
```shell
|
||||
curl http://192.0.0.1:8080/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3-moe",
|
||||
"prompt": "Who are you?",
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
@@ -0,0 +1,277 @@
|
||||
# Prefill-Decode Disaggregation (Qwen2.5-VL)
|
||||
|
||||
## Getting Start
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation. This guide takes one-by-one steps to verify these features with constrained resources.
|
||||
|
||||
Using the Qwen2.5-VL-7B-Instruct model as an example, use vllm-ascend v0.11.0rc1 (with vLLM v0.11.0) on 1 Atlas 800T A2 server to deploy the "1P1D" architecture. Assume the IP address is 192.0.0.1.
|
||||
|
||||
## Verify Communication Environment
|
||||
|
||||
### Verification Process
|
||||
|
||||
1. Single Node Verification:
|
||||
|
||||
Execute the following commands in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
```
|
||||
|
||||
2. Check NPU HCCN Configuration:
|
||||
|
||||
Ensure that the hccn.conf file exists in the environment. If using Docker, mount it into the container.
|
||||
|
||||
```bash
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
3. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g;done
|
||||
```
|
||||
|
||||
4. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
|
||||
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
|
||||
```
|
||||
|
||||
5. Check NPU TLS Configuration
|
||||
|
||||
```bash
|
||||
# The tls settings should be consistent across all nodes
|
||||
for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch
|
||||
```
|
||||
|
||||
## Run with Docker
|
||||
|
||||
Start a Docker container.
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /mnt/sfs_turbo/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## Install Mooncake
|
||||
|
||||
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.Installation and Compilation Guide: <https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries>.
|
||||
First, we need to obtain the Mooncake project. Refer to the following command:
|
||||
|
||||
```shell
|
||||
git clone -b v0.3.8.post1 --depth 1 https://github.com/kvcache-ai/Mooncake.git
|
||||
```
|
||||
|
||||
(Optional) Replace go install url if the network is poor
|
||||
|
||||
```shell
|
||||
cd Mooncake
|
||||
sed -i 's|https://go.dev/dl/|https://golang.google.cn/dl/|g' dependencies.sh
|
||||
```
|
||||
|
||||
Install mpi
|
||||
|
||||
```shell
|
||||
apt-get install mpich libmpich-dev -y
|
||||
```
|
||||
|
||||
Install the relevant dependencies. The installation of Go is not required.
|
||||
|
||||
```shell
|
||||
bash dependencies.sh -y
|
||||
```
|
||||
|
||||
Compile and install
|
||||
|
||||
```shell
|
||||
mkdir build
|
||||
cd build
|
||||
cmake .. -DUSE_ASCEND_DIRECT=ON
|
||||
make -j
|
||||
make install
|
||||
```
|
||||
|
||||
Set environment variables
|
||||
|
||||
**Note:**
|
||||
|
||||
- Adjust the Python path according to your specific Python installation
|
||||
- Ensure `/usr/local/lib` and `/usr/local/lib64` are in your `LD_LIBRARY_PATH`
|
||||
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
## Prefiller/Decoder Deployment
|
||||
|
||||
We can run the following scripts to launch a server on the prefiller/decoder NPU, respectively.
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} Prefiller
|
||||
|
||||
```shell
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0
|
||||
export HCCL_IF_IP=192.0.0.1 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
|
||||
vllm serve /model/Qwen2.5-VL-7B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 13700 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 1 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen25vl \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 40000 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_producer",
|
||||
"kv_port": "30000",
|
||||
"engine_id": "0",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Decoder
|
||||
|
||||
```shell
|
||||
export ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export HCCL_IF_IP=192.0.0.1 # node ip
|
||||
export GLOO_SOCKET_IFNAME="eth0" # network card name
|
||||
export TP_SOCKET_IFNAME="eth0"
|
||||
export HCCL_SOCKET_IFNAME="eth0"
|
||||
export OMP_PROC_BIND=false
|
||||
export OMP_NUM_THREADS=10
|
||||
|
||||
vllm serve /model/Qwen2.5-VL-7B-Instruct \
|
||||
--host 0.0.0.0 \
|
||||
--port 13701 \
|
||||
--no-enable-prefix-caching \
|
||||
--tensor-parallel-size 1 \
|
||||
--seed 1024 \
|
||||
--served-model-name qwen25vl \
|
||||
--max-model-len 40000 \
|
||||
--max-num-batched-tokens 40000 \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config \
|
||||
'{"kv_connector": "MooncakeConnectorV1",
|
||||
"kv_role": "kv_consumer",
|
||||
"kv_port": "30100",
|
||||
"engine_id": "1",
|
||||
"kv_connector_extra_config": {
|
||||
"prefill": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
},
|
||||
"decode": {
|
||||
"dp_size": 1,
|
||||
"tp_size": 1
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
||||
If you want to run "2P1D", please set ASCEND_RT_VISIBLE_DEVICES and port to different values for each P process.
|
||||
|
||||
## Example Proxy for Deployment
|
||||
|
||||
Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
|
||||
|
||||
```shell
|
||||
python load_balance_proxy_server_example.py \
|
||||
--host 192.0.0.1 \
|
||||
--port 8080 \
|
||||
--prefiller-hosts 192.0.0.1 \
|
||||
--prefiller-port 13700 \
|
||||
--decoder-hosts 192.0.0.1 \
|
||||
--decoder-ports 13701
|
||||
```
|
||||
|
||||
|Parameter | Meaning |
|
||||
| --- | --- |
|
||||
| --port | Port of proxy |
|
||||
| --prefiller-port | All ports of prefill |
|
||||
| --decoder-ports | All ports of decoder |
|
||||
|
||||
## Verification
|
||||
|
||||
Check service health using the proxy server endpoint.
|
||||
|
||||
```shell
|
||||
curl http://192.0.0.1:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen25vl",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": [
|
||||
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
|
||||
{"type": "text", "text": "What is the text in the illustrate?"}
|
||||
]}
|
||||
],
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
192
docs/source/tutorials/features/ray.md
Normal file
192
docs/source/tutorials/features/ray.md
Normal file
@@ -0,0 +1,192 @@
|
||||
# Ray Distributed (Qwen3-235B-A22B)
|
||||
|
||||
Multi-node inference is suitable for scenarios where the model cannot be deployed on a single machine. In such cases, the model can be distributed using tensor parallelism or pipeline parallelism. The specific parallelism strategies will be covered in the following sections. To successfully deploy multi-node inference, the following three steps need to be completed:
|
||||
|
||||
* **Verify Multi-Node Communication Environment**
|
||||
* **Set Up and Start the Ray Cluster**
|
||||
* **Start the Online Inference Service on Multi-node**
|
||||
|
||||
## Verify Multi-Node Communication Environment
|
||||
|
||||
### Physical Layer Requirements
|
||||
|
||||
* The physical machines must be located on the same LAN, with network connectivity.
|
||||
* All NPUs are connected with optical modules, and the connection status must be normal.
|
||||
|
||||
### Verification Process
|
||||
|
||||
Execute the following commands on each node in sequence. The results must all be `success` and the status must be `UP`:
|
||||
|
||||
```bash
|
||||
# Check the remote switch ports
|
||||
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
|
||||
# Get the link status of the Ethernet ports (UP or DOWN)
|
||||
for i in {0..7}; do hccn_tool -i $i -link -g ; done
|
||||
# Check the network health status
|
||||
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
|
||||
# View the network detected IP configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
|
||||
# View gateway configuration
|
||||
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
|
||||
# View NPU network configuration
|
||||
cat /etc/hccn.conf
|
||||
```
|
||||
|
||||
### NPU Interconnect Verification
|
||||
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
```
|
||||
|
||||
## Set Up and Start the Ray Cluster
|
||||
|
||||
### Setting Up the Basic Container
|
||||
|
||||
To ensure a consistent execution environment across all nodes, including the model path and Python environment, it is advised to use Docker images.
|
||||
|
||||
For setting up a multi-node inference cluster with Ray, **containerized deployment** is the preferred approach. Containers should be started on both the primary and secondary nodes, with the `--net=host` option to enable proper network connectivity.
|
||||
|
||||
Below is the example container setup command, which should be executed on **all nodes** :
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci4 \
|
||||
--device /dev/davinci5 \
|
||||
--device /dev/davinci6 \
|
||||
--device /dev/davinci7 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /path/to/shared/cache:/root/.cache \ # IMPORTANT: This must be a shared directory accessible by all nodes
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
### Start Ray Cluster
|
||||
|
||||
After setting up the containers and installing vllm-ascend on each node, follow the steps below to start the Ray cluster and execute inference tasks.
|
||||
|
||||
Choose one machine as the primary node and the others as secondary nodes. Before proceeding, use `ip addr` to check your `nic_name` (network interface name).
|
||||
|
||||
Set the `ASCEND_RT_VISIBLE_DEVICES` environment variable to specify the NPU devices to use. For Ray versions above 2.1, also set the `RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES` variable to avoid device recognition issues.
|
||||
|
||||
Below are the commands for the primary and secondary nodes:
|
||||
|
||||
**Primary node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect.
|
||||
Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Head node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --head
|
||||
```
|
||||
|
||||
**Secondary node**:
|
||||
|
||||
:::{note}
|
||||
When starting a Ray cluster for multi-node inference, the environment variables on each node must be set **before** starting the Ray cluster for them to take effect. Updating the environment variables requires restarting the Ray cluster.
|
||||
:::
|
||||
|
||||
```shell
|
||||
# Worker node
|
||||
export HCCL_IF_IP={local_ip}
|
||||
export GLOO_SOCKET_IFNAME={nic_name}
|
||||
export TP_SOCKET_IFNAME={nic_name}
|
||||
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
ray start --address='{head_node_ip}:6379' --node-ip-address={local_ip}
|
||||
```
|
||||
|
||||
Once the cluster is started on multiple nodes, execute `ray status` and `ray list nodes` to verify the Ray cluster's status. You should see the correct number of nodes and NPUs listed.
|
||||
|
||||
After Ray is successfully started, the following content will appear:\
|
||||
A local Ray instance has started successfully.\
|
||||
Dashboard URL: The access address for the Ray Dashboard (default: <http://localhost:8265>); Node status (CPU/memory resources, number of healthy nodes); Cluster connection address (used for adding multiple nodes).
|
||||
|
||||
## Start the Online Inference Service on Multi-node scenario
|
||||
|
||||
In the container, you can use vLLM as if all NPUs were on a single node. vLLM will utilize NPU resources across all nodes in the Ray cluster.
|
||||
|
||||
**You only need to run the vllm command on one node.**
|
||||
|
||||
To set up parallelism, the common practice is to set the `tensor-parallel-size` to the number of NPUs per node, and the `pipeline-parallel-size` to the number of nodes.
|
||||
|
||||
For example, with 16 NPUs across 2 nodes (8 NPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--pipeline-parallel-size 2 \
|
||||
--tensor-parallel-size 8 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Alternatively, if you want to use only tensor parallelism, set the tensor parallel size to the total number of NPUs in the cluster. For example, with 16 NPUs across 2 nodes, set the tensor parallel size to 16:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen3-235B-A22B \
|
||||
--distributed-executor-backend ray \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--seed 1024 \
|
||||
--max-model-len 8192 \
|
||||
--max-num-seqs 25 \
|
||||
--served-model-name qwen \
|
||||
--trust-remote-code \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen",
|
||||
"prompt": "tell me how to sleep well",
|
||||
"max_completion_tokens": 100,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
179
docs/source/tutorials/features/suffix_speculative_decoding.md
Normal file
179
docs/source/tutorials/features/suffix_speculative_decoding.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# Suffix Speculative Decoding
|
||||
|
||||
## **Introduction**
|
||||
|
||||
Suffix Decoding is an optimization technique for speculative decoding based on pattern matching. It simultaneously retrieves repetitive sequences from both the prompt and the generated content, using frequency statistics to predict the most likely token continuations. Unlike traditional speculative decoding methods, Suffix Decoding runs entirely on the CPU, eliminating the need for additional GPU resources or draft models, which results in superior acceleration for repetitive tasks such as AI agents and code generation.
|
||||
|
||||
This document provides step-by-step guidance on how to deploy and benchmark the Suffix Decoding speculative inference technology supported by `vllm-ascend` on Atlas A2 hardware. The setup utilizes a single Atlas 800T A2 node with a 4-card deployment of the Qwen3-32B model instance. Benchmarking is conducted using authentic open-source datasets covering the following categories:
|
||||
|
||||
| **Dataset Category** | **Dataset Name** |
|
||||
| ------------------------------ | ---------------- |
|
||||
| Code Generation | HumanEval |
|
||||
| Common Sense Reasoning | ARC |
|
||||
| Mathematical Reasoning | gsm8k |
|
||||
| Natural Language Understanding | SuperGLUE_BoolQ |
|
||||
| Comprehensive Examination | agieval |
|
||||
| Multi-turn Dialogue | sharegpt |
|
||||
|
||||
The benchmarking tool used in this tutorial is AISbench, which supports performance testing for all the datasets listed above. The final section of this tutorial presents a performance comparison between enabling and disabling Suffix Decoding under the condition of satisfying an SLO TPOT < 50ms across different datasets and concurrency levels. Validations demonstrate that the Qwen3-32B model achieves a throughput improvement of approximately 20% to 80% on various real-world datasets when Suffix Decoding is enabled.
|
||||
|
||||
## **Download vllm-ascend Image**
|
||||
|
||||
This tutorial uses the official image, version v0.13.0rc1. Use the following command to download:
|
||||
|
||||
```bash
|
||||
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
|
||||
```
|
||||
|
||||
## **Run with Docker**
|
||||
|
||||
Container startup command:
|
||||
|
||||
```bash
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc1
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# This test uses four Atlas A2 NPU cards to create the container.
|
||||
# Mount the hccn.conf file from the host node into the container.
|
||||
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci2 \
|
||||
--device /dev/davinci3 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:\
|
||||
/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /etc/hccn.conf:/etc/hccn.conf \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
## **Install arctic-inference**
|
||||
|
||||
Before enabling Suffix Decoding speculative inference on Ascend, the Arctic Inference plugin must be installed. Arctic Inference is an open-source plugin launched by Snowflake specifically to optimize LLM inference speed. For detailed technical principles, please refer to the following article: [Fastest Speculative Decoding in vLLM with Arctic Inference and Arctic Training](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/). Install it within the container using the following command:
|
||||
|
||||
```bash
|
||||
pip install arctic-inference
|
||||
```
|
||||
|
||||
## **vLLM Instance Deployment**
|
||||
|
||||
Use the following command to start the container service instance. Speculative inference is enabled via the `--speculative-config` parameter, where `method` is set to`suffix`. For this test, `num_speculative_tokens` is uniformly set to`3`.
|
||||
|
||||
```bash
|
||||
# set the NPU device number
|
||||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
|
||||
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
# Enable the AIVector core to directly schedule ROCE communication
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
# Enable MLP prefetch for better performance.
|
||||
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
|
||||
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
|
||||
vllm serve /data/Qwen3-32B \
|
||||
--served-model-name qwen3 \
|
||||
--trust-remote-code \
|
||||
--distributed-executor-backend mp \
|
||||
--tensor-parallel-size 4 \
|
||||
--max-model-len 5500 \
|
||||
--max-num-batched-tokens 40960 \
|
||||
--speculative-config '{"method": "suffix", "num_speculative_tokens": 3}' \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--additional-config '{"pa_shape_list":[48,64,72,80]}' \
|
||||
--port 8011
|
||||
```
|
||||
|
||||
## **AISbench Benchmark Testing**
|
||||
|
||||
Performance for all open-source datasets is tested using AISbench. For specific instructions, refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation).
|
||||
|
||||
**Model Configuration**:
|
||||
|
||||
```bash
|
||||
# "ignore_eos" must be set to "False", and "max_out_len" should be set to a large value to allow the model to output completely and naturally.
|
||||
|
||||
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
|
||||
|
||||
models = [
|
||||
dict(
|
||||
attr="service",
|
||||
type=VLLMCustomAPIChatStream,
|
||||
abbr='vllm-api-stream-chat',
|
||||
path="<path_to_your_model>/Qwen3-32B",
|
||||
model="qwen3",
|
||||
request_rate = 0,
|
||||
retry = 2,
|
||||
host_ip = "<your_server_ip>",
|
||||
host_port = 8011,
|
||||
max_out_len = 4000,
|
||||
batch_size= 16,
|
||||
trust_remote_code=False,
|
||||
generation_kwargs = dict(
|
||||
temperature = 0,
|
||||
ignore_eos = False
|
||||
)
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
**Performance Benchmarking Commands**:
|
||||
|
||||
```bash
|
||||
# Example command to test gsm8k dataset performance using the first 100 prompts. Commands for other datasets are similar.
|
||||
ais_bench --models vllm_api_stream_chat \
|
||||
--datasets gsm8k_gen_0_shot_cot_str_perf \
|
||||
--debug --summarizer default_perf --mode perf --num-prompts 100
|
||||
```
|
||||
|
||||
## **Test Results**
|
||||
|
||||
Below are the detailed test results of the six open-source datasets in this evaluation. Compared to the baseline performance, the improvement in TPOT and throughput performance at different concurrency levels after enabling Suffix Decoding varies across datasets. The extent of improvement after enabling Suffix Decoding differs among the datasets. Below is a summary of the results:
|
||||
|
||||
| **Dataset Category** | **Typical Representative** | **Throughput Improvement (BS=1-10)** | **SLO TPOT** |
|
||||
| -------------------- | -------------------------- | ------------------------------------ | ------------ |
|
||||
| **High Gain** | AGIEval, GSM8K | **> 50%** | < 50ms |
|
||||
| **Medium-Low Gain** | ARC, ShareGPT | **20% ~ 30%** | < 50ms |
|
||||
|
||||
Below is the raw detailed test results:
|
||||
|
||||
| Concurrency | Avg Input | Avg Output | Requests | Base TPOT(ms) | Base Throughput(TPS) | Suffix TPOT(ms) | Suffix Throughput(TPS) | Accept Rate | TPOT Gain | TPS Gain |
|
||||
| ------------------- | --------- | ---------- | -------- | ------------- | -------------------- | --------------- | ---------------------- | ----------- | --------- | -------- |
|
||||
| **Humaneval** | | | | | | | | | | |
|
||||
| 1 | 150 | 2700 | 100 | 55.1 | 18.1 | 37.9 | 26.3 | 27.0% | 45.2% | 45.1% |
|
||||
| 15 | 150 | 2700 | 100 | 61.6 | 233.8 | 45.8 | 318.2 | 27.0% | 34.6% | 36.1% |
|
||||
| 26 | 150 | 2700 | 100 | 64.7 | 403.8 | 50.9 | 519.2 | 27.0% | 27.2% | 28.6% |
|
||||
| **ARC** | | | | | | | | | | |
|
||||
| 1 | 76 | 960 | 100 | 52.8 | 18.9 | 39.5 | 25.4 | 23.9% | 33.7% | 34.6% |
|
||||
| 8 | 76 | 960 | 100 | 59.1 | 125.4 | 47.0 | 163.1 | 23.9% | 25.7% | 30.0% |
|
||||
| 15 | 76 | 960 | 100 | 59.8 | 245.8 | 48.9 | 311.7 | 23.9% | 22.3% | 26.8% |
|
||||
| **GSM8K** | | | | | | | | | | |
|
||||
| 1 | 67 | 1570 | 100 | 55.5 | 18.0 | 35.7 | 28.5 | 31.1% | 55.6% | 58.4% |
|
||||
| 17 | 67 | 1570 | 100 | 61.5 | 279.8 | 45.4 | 403.0 | 31.1% | 35.6% | 44.0% |
|
||||
| 26 | 67 | 1570 | 100 | 63.9 | 396.4 | 50.0 | 527.6 | 31.1% | 27.8% | 33.1% |
|
||||
| **ShareGPT** | | | | | | | | | | |
|
||||
| 1 | 666 | 231 | 327 | 54.1 | 18.3 | 39.2 | 24.1 | 23.9% | 37.9% | 31.5% |
|
||||
| 8 | 666 | 231 | 327 | 58.8 | 125.0 | 46.2 | 153.2 | 23.9% | 27.1% | 22.5% |
|
||||
| 14 | 666 | 231 | 327 | 61.8 | 227.0 | 49.9 | 273.9 | 23.9% | 23.8% | 20.7% |
|
||||
| **SuperGLUE_BoolQ** | | | | | | | | | | |
|
||||
| 1 | 207 | 314 | 100 | 54.1 | 18.4 | 36.1 | 26.8 | 33.4% | 49.8% | 45.6% |
|
||||
| 16 | 207 | 314 | 100 | 60.0 | 229.7 | 43.5 | 303.9 | 33.4% | 38.0% | 32.3% |
|
||||
| 32 | 207 | 314 | 100 | 62.7 | 396.4 | 47.8 | 507.5 | 33.4% | 31.3% | 28.0% |
|
||||
| **Agieval** | | | | | | | | | | |
|
||||
| 1 | 735 | 1880 | 100 | 53.1 | 18.7 | 31.8 | 34.1 | 50.3% | 66.8% | 81.9% |
|
||||
| 24 | 735 | 1880 | 100 | 64.0 | 381.2 | 43.3 | 629.0 | 50.3% | 47.8% | 65.0% |
|
||||
| 34 | 735 | 1880 | 100 | 70.0 | 494.6 | 50.2 | 768.4 | 50.3% | 39.4% | 55.3% |
|
||||
Reference in New Issue
Block a user