Files
xc-llm-ascend/docs/source/tutorials/hardwares/310p.md
SILONG ZENG 2e2aaa2fae [Doc][v0.18.0] Fix documentation formatting and improve code examples (#8701)
### What this PR does / why we need it?
This PR fixes various documentation issues and improves code examples
throughout the project.

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-04-28 09:01:25 +08:00

9.2 KiB

Atlas 300I DUO

Running vLLM on Atlas 300I DUO

Notes

  • The current release supports FULL_DECODE_ONLY graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

    • When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
    • There is no such limitation when TP=1.
    • We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.
  • Atlas 300I DUO does not support triton or triton-ascend.

  • If installing from source, vllm and vllm-ascend will automatically pull in triton and triton-ascend dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:

pip uninstall -y triton && triton-ascend
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
# as uninstalling triton may leave residual files behind.
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton

Deployment

For Atlas 300I DUO (310P), do not rely on automatic `max-model-len` detection
(that is, do not omit the `--max-model-len` argument), or OOM may occur.

Reason (current 310P attention path):
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
  to `AttentionMaskBuilder310`.
- `AttentionMaskBuilder310` builds a full float16 causal mask with shape
  `[max_model_len, max_model_len]`,
  and then converts it to FRACTAL_NZ format.
- In the 310P `attention_v1` prefill/chunked-prefill path
  (`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
  this explicit mask tensor is used directly, and there is currently
  no compressed-mask path.

If automatic parsing resolves to a large context length, allocating this mask
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
Be sure to set an explicit and conservative value, such as `--max-model-len 16384`.

Run the Docker container:

   :substitutions:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:

  • Prepare the environment

    • Obtain model weights (W8A8SC weights will be uploaded to the Eco-Tech official ModelScope repository later.)

      Note: if you want to validate directly with w8a8s weights instead of w8a8sc weights, the following example shows the serving command for Qwen3-8B-w8a8s-310. Performance is slightly lower than with compressed w8a8sc weights. Detailed w8a8sc testing is covered in the following sections.

      vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \
          --tensor-parallel-size 1 --gpu_memory_utilization 0.90 \
          --served_model_name qwen --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
          --quantization ascend --max-model-len 16384
      # `--load_format` is required only for the W8A8SC quantized weight format.
      #
      
    • Compress the weights

      • Uninstall triton (unsupported on 310P):

        pip uninstall triton
        pip uninstall triton-ascend
        
      • Get the compression script:

      • Install the compression tool

      • Compression command

        export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
        export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH
        
        python save_sharded_state_310.py \
            --model /your-load-path/w8a8s-weight \
            --tensor-parallel-size 1 \
            --output /your-save-path/w8a8sc-weight \
            --enable-compress \
            --compress-process-num 4 \
            --enforce-eager \
            --dtype float16 \
            --quantization ascend \
            --max-model-len 10240
        

        Argument notes: --tensor-parallel-size: W8A8SC quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. --model is the path to the input w8a8s weights, and --output is the output path for the compressed w8a8sc weights.

      • Additional notes

        • The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the qwen3-8B-w8a8sc weights directly from the Eco-Tech official ModelScope repository once available.
  • Examples

    • Qwen3-8B-w8a8sc example

      vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 1 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 32 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
          --quantization ascend \
          --max-model-len 16384 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
    • Qwen3-14B-w8a8sc example

      vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 1 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 16 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
          --quantization ascend \
          --max-model-len 16384 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
    • Qwen3-32B-w8a8sc example

      export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
      
      vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \
          --host 127.0.0.1 \
          --port 8080 \
          --tensor-parallel-size 4 \
          --gpu_memory_utilization 0.90 \
          --max_num_seqs 32 \
          --served_model_name qwen \
          --dtype float16 \
          --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
          --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
          --quantization ascend \
          --max-model-len 20480 \
          --no-enable-prefix-caching \
          --load_format="sharded_state"
      
  • Closing notes

    For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.