### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>
210 lines
9.2 KiB
Markdown
210 lines
9.2 KiB
Markdown
# Atlas 300I DUO
|
|
|
|
## Running vLLM on Atlas 300I DUO
|
|
|
|
### Notes
|
|
|
|
* The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:
|
|
|
|
* When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
|
|
* There is no such limitation when TP=1.
|
|
* We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.
|
|
|
|
* Atlas 300I DUO does not support `triton` or `triton-ascend`.
|
|
|
|
* If installing from source, `vllm` and `vllm-ascend` will automatically pull in `triton` and `triton-ascend` dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:
|
|
|
|
```bash
|
|
pip uninstall -y triton && triton-ascend
|
|
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
|
|
# as uninstalling triton may leave residual files behind.
|
|
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton
|
|
```
|
|
|
|
### Deployment
|
|
|
|
```{warning}
|
|
For Atlas 300I DUO (310P), do not rely on automatic `max-model-len` detection
|
|
(that is, do not omit the `--max-model-len` argument), or OOM may occur.
|
|
|
|
Reason (current 310P attention path):
|
|
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
|
|
to `AttentionMaskBuilder310`.
|
|
- `AttentionMaskBuilder310` builds a full float16 causal mask with shape
|
|
`[max_model_len, max_model_len]`,
|
|
and then converts it to FRACTAL_NZ format.
|
|
- In the 310P `attention_v1` prefill/chunked-prefill path
|
|
(`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
|
|
this explicit mask tensor is used directly, and there is currently
|
|
no compressed-mask path.
|
|
|
|
If automatic parsing resolves to a large context length, allocating this mask
|
|
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
|
|
Be sure to set an explicit and conservative value, such as `--max-model-len 16384`.
|
|
```
|
|
|
|
Run the Docker container:
|
|
|
|
```{code-block} bash
|
|
:substitutions:
|
|
|
|
# Use the vllm-ascend image
|
|
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p
|
|
docker run --rm \
|
|
--name vllm-ascend \
|
|
--shm-size=10g \
|
|
--device /dev/davinci0 \
|
|
--device /dev/davinci1 \
|
|
--device /dev/davinci2 \
|
|
--device /dev/davinci3 \
|
|
--device /dev/davinci4 \
|
|
--device /dev/davinci5 \
|
|
--device /dev/davinci6 \
|
|
--device /dev/davinci7 \
|
|
--device /dev/davinci_manager \
|
|
--device /dev/devmm_svm \
|
|
--device /dev/hisi_hdc \
|
|
-v /usr/local/dcmi:/usr/local/dcmi \
|
|
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
|
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
|
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
|
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
|
-v /root/.cache:/root/.cache \
|
|
-p 8000:8000 \
|
|
-it $IMAGE bash
|
|
```
|
|
|
|
Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:
|
|
|
|
* Prepare the environment
|
|
|
|
* Obtain model weights
|
|
(`W8A8SC` weights will be uploaded to the Eco-Tech official ModelScope repository later.)
|
|
|
|
* This guide requires `W8A8SC` quantized weights for the Qwen3 Dense `8B/14B/32B` models. You need to generate the SC-compressed weights yourself.
|
|
* First, prepare the `W8A8S` weights:
|
|
|
|
* Qwen3-8B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310)
|
|
* Qwen3-14B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310)
|
|
* Qwen3-32B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310)
|
|
|
|
Note: if you want to validate directly with `w8a8s` weights instead of `w8a8sc` weights, the following example shows the serving command for `Qwen3-8B-w8a8s-310`. Performance is slightly lower than with compressed `w8a8sc` weights. Detailed `w8a8sc` testing is covered in the following sections.
|
|
|
|
```bash
|
|
vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \
|
|
--tensor-parallel-size 1 --gpu_memory_utilization 0.90 \
|
|
--served_model_name qwen --dtype float16 \
|
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
|
--quantization ascend --max_model_len 16384
|
|
# `--load_format` is required only for the W8A8SC quantized weight format.
|
|
#
|
|
```
|
|
|
|
* Compress the weights
|
|
|
|
* Uninstall triton (unsupported on 310P):
|
|
|
|
```bash
|
|
pip uninstall triton
|
|
pip uninstall triton-ascend
|
|
```
|
|
|
|
* Get the compression script:
|
|
|
|
* [https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py)
|
|
|
|
* Install the compression tool
|
|
|
|
* Repository: [https://gitcode.com/Ascend/msit.git](https://gitcode.com/Ascend/msit.git)
|
|
* Installation guide: [https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装)
|
|
|
|
* Compression command
|
|
|
|
```bash
|
|
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
|
|
export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH
|
|
|
|
python save_sharded_state_310.py \
|
|
--model /your-load-path/w8a8s-weight \
|
|
--tensor-parallel-size 1 \
|
|
--output /your-save-path/w8a8sc-weight \
|
|
--enable-compress \
|
|
--compress-process-num 4 \
|
|
--enforce-eager \
|
|
--dtype float16 \
|
|
--quantization ascend \
|
|
--max_model_len 10240
|
|
```
|
|
|
|
Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.
|
|
|
|
* Additional notes
|
|
|
|
* The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the `qwen3-8B-w8a8sc` weights directly from the Eco-Tech official ModelScope repository once available.
|
|
|
|
* Examples
|
|
|
|
* Qwen3-8B-w8a8sc example
|
|
|
|
```bash
|
|
vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \
|
|
--host 127.0.0.1 \
|
|
--port 8080 \
|
|
--tensor-parallel-size 1 \
|
|
--gpu_memory_utilization 0.90 \
|
|
--max_num_seqs 32 \
|
|
--served_model_name qwen \
|
|
--dtype float16 \
|
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
|
|
--quantization ascend \
|
|
--max_model_len 16384 \
|
|
--no-enable-prefix-caching \
|
|
--load_format="sharded_state"
|
|
```
|
|
|
|
* Qwen3-14B-w8a8sc example
|
|
|
|
```bash
|
|
vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \
|
|
--host 127.0.0.1 \
|
|
--port 8080 \
|
|
--tensor-parallel-size 1 \
|
|
--gpu_memory_utilization 0.90 \
|
|
--max_num_seqs 16 \
|
|
--served_model_name qwen \
|
|
--dtype float16 \
|
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
|
|
--quantization ascend \
|
|
--max_model_len 16384 \
|
|
--no-enable-prefix-caching \
|
|
--load_format="sharded_state"
|
|
```
|
|
|
|
* Qwen3-32B-w8a8sc example
|
|
|
|
```bash
|
|
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
|
|
|
|
vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \
|
|
--host 127.0.0.1 \
|
|
--port 8080 \
|
|
--tensor-parallel-size 4 \
|
|
--gpu_memory_utilization 0.90 \
|
|
--max_num_seqs 32 \
|
|
--served_model_name qwen \
|
|
--dtype float16 \
|
|
--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
|
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
|
|
--quantization ascend \
|
|
--max_model_len 20480 \
|
|
--no-enable-prefix-caching \
|
|
--load_format="sharded_state"
|
|
```
|
|
|
|
* Closing notes
|
|
|
|
For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.
|