xc-llm-ascend/docs/source/tutorials/hardwares/310p.md

# Atlas 300I DUO

## Running vLLM on Atlas 300I DUO

### Notes

* The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

    * When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.
    * There is no such limitation when TP=1.
    * We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.

* Atlas 300I DUO does not support `triton` or `triton-ascend`.

* If installing from source, `vllm` and `vllm-ascend` will automatically pull in `triton` and `triton-ascend` dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:

```bash
pip uninstall -y triton && triton-ascend
# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,
# as uninstalling triton may leave residual files behind.
# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton
```

### Deployment

```{warning}
For Atlas 300I DUO (310P), do not rely on automatic `max-model-len` detection
(that is, do not omit the `--max-model-len` argument), or OOM may occur.

Reason (current 310P attention path):
- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
  to `AttentionMaskBuilder310`.
- `AttentionMaskBuilder310` builds a full float16 causal mask with shape
  `[max_model_len, max_model_len]`,
  and then converts it to FRACTAL_NZ format.
- In the 310P `attention_v1` prefill/chunked-prefill path
  (`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
  this explicit mask tensor is used directly, and there is currently
  no compressed-mask path.

If automatic parsing resolves to a large context length, allocating this mask
(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
Be sure to set an explicit and conservative value, such as `--max-model-len 16384`.
```

Run the Docker container:

```{code-block} bash
   :substitutions:

# Use the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
```

Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:

* Prepare the environment

    * Obtain model weights
        (`W8A8SC` weights will be uploaded to the Eco-Tech official ModelScope repository later.)

        * This guide requires `W8A8SC` quantized weights for the Qwen3 Dense `8B/14B/32B` models. You need to generate the SC-compressed weights yourself.
        * First, prepare the `W8A8S` weights:

            * Qwen3-8B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310)
            * Qwen3-14B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310)
            * Qwen3-32B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310)

        Note: if you want to validate directly with `w8a8s` weights instead of `w8a8sc` weights, the following example shows the serving command for `Qwen3-8B-w8a8s-310`. Performance is slightly lower than with compressed `w8a8sc` weights. Detailed `w8a8sc` testing is covered in the following sections.

        ```bash
        vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \
            --tensor-parallel-size 1 --gpu_memory_utilization 0.90 \
            --served_model_name qwen --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
            --quantization ascend --max_model_len 16384
        # `--load_format` is required only for the W8A8SC quantized weight format.
        #
        ```

    * Compress the weights

        * Uninstall triton (unsupported on 310P):

            ```bash
            pip uninstall triton
            pip uninstall triton-ascend
            ```

        * Get the compression script:

            * [https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py)

        * Install the compression tool

            * Repository: [https://gitcode.com/Ascend/msit.git](https://gitcode.com/Ascend/msit.git)
            * Installation guide: [https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装)

        * Compression command

            ```bash
            export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
            export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH

            python save_sharded_state_310.py \
                --model /your-load-path/w8a8s-weight \
                --tensor-parallel-size 1 \
                --output /your-save-path/w8a8sc-weight \
                --enable-compress \
                --compress-process-num 4 \
                --enforce-eager \
                --dtype float16 \
                --quantization ascend \
                --max_model_len 10240
            ```

            Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.

        * Additional notes

            * The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the `qwen3-8B-w8a8sc` weights directly from the Eco-Tech official ModelScope repository once available.

* Examples

    * Qwen3-8B-w8a8sc example

        ```bash
        vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 1 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 32 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \
            --quantization ascend \
            --max_model_len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

    * Qwen3-14B-w8a8sc example

        ```bash
        vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 1 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 16 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \
            --quantization ascend \
            --max_model_len 16384 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

    * Qwen3-32B-w8a8sc example

        ```bash
        export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3

        vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \
            --host 127.0.0.1 \
            --port 8080 \
            --tensor-parallel-size 4 \
            --gpu_memory_utilization 0.90 \
            --max_num_seqs 32 \
            --served_model_name qwen \
            --dtype float16 \
            --additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \
            --quantization ascend \
            --max_model_len 20480 \
            --no-enable-prefix-caching \
            --load_format="sharded_state"
        ```

* Closing notes

  For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`# Atlas 300I DUO`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`## Running vLLM on Atlas 300I DUO`

			`### Notes`

			* The current release supports `FULL_DECODE_ONLY` graph mode on Atlas 300I DUO devices, but the following limitations apply due to hardware event-id resource constraints:

			`* When multiple Tensor Parallel (TP) ranks are enabled, the number of capturable graphs is limited and depends on the model depth. For example, Qwen3-32B can capture and replay 2 graphs.`
			`* There is no such limitation when TP=1.`
			`* We have reached out to the relevant experts for a solution. A software-based fix is considered feasible, but full support will take additional time. Thank you for your understanding.`

			* Atlas 300I DUO does not support `triton` or `triton-ascend`.

			* If installing from source, `vllm` and `vllm-ascend` will automatically pull in `triton` and `triton-ascend` dependencies, which may cause unexpected issues on Atlas 300I DUO. Please run:

			```bash
			`pip uninstall -y triton && triton-ascend`
			`# If you still encounter errors mentioning triton, manually remove the remaining triton directory in site-packages,`
			`# as uninstalling triton may leave residual files behind.`
			`# For example: rm -rf /usr/local/python3.11.10/lib/python3.11/site-packages/triton`
Add Pangu MoE Pro for 300I series docs (#1516) ### What this PR does / why we need it? Add Pangu MoE Pro for 300I series docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com> 2025-06-30 13:37:22 +08:00			```

[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`### Deployment`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			```{warning}
			For Atlas 300I DUO (310P), do not rely on automatic `max-model-len` detection
			(that is, do not omit the `--max-model-len` argument), or OOM may occur.

			`Reason (current 310P attention path):`
			- `AscendAttentionMetadataBuilder310` passes `model_config.max_model_len`
			to `AttentionMaskBuilder310`.
			- `AttentionMaskBuilder310` builds a full float16 causal mask with shape
			`[max_model_len, max_model_len]`,
			`and then converts it to FRACTAL_NZ format.`
			- In the 310P `attention_v1` prefill/chunked-prefill path
			(`_npu_flash_attention` / `_npu_paged_attention_splitfuse`),
			`this explicit mask tensor is used directly, and there is currently`
			`no compressed-mask path.`

			`If automatic parsing resolves to a large context length, allocating this mask`
			(`O(max_model_len^2)`) may exceed NPU memory and trigger OOM.
			Be sure to set an explicit and conservative value, such as `--max-model-len 16384`.
			```

			`Run the Docker container:`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00
			```{code-block} bash
			`:substitutions:`
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00
			`# Use the vllm-ascend image`
			`export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00			`docker run --rm \`
			`--name vllm-ascend \`
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`--shm-size=10g \`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00			`--device /dev/davinci0 \`
			`--device /dev/davinci1 \`
			`--device /dev/davinci2 \`
			`--device /dev/davinci3 \`
			`--device /dev/davinci4 \`
			`--device /dev/davinci5 \`
			`--device /dev/davinci6 \`
			`--device /dev/davinci7 \`
			`--device /dev/davinci_manager \`
			`--device /dev/devmm_svm \`
			`--device /dev/hisi_hdc \`
			`-v /usr/local/dcmi:/usr/local/dcmi \`
			`-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \`
			`-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \`
			`-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \`
			`-v /etc/ascend_install.info:/etc/ascend_install.info \`
			`-v /root/.cache:/root/.cache \`
			`-p 8000:8000 \`
			`-it $IMAGE bash`
			```

[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`Run the following steps to start the vLLM service on NPU for the Qwen3 Dense series:`
Bump v0.9.1rc1 release (#1349) ### What this PR does / why we need it? Bump v0.9.1rc1 release Closes: https://github.com/vllm-project/vllm-ascend/pull/1341 Closes: https://github.com/vllm-project/vllm-ascend/pull/1334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: shen-shanshan <467638484@qq.com> 2025-06-22 13:15:36 +08:00
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			`* Prepare the environment`

			`* Obtain model weights`
			(`W8A8SC` weights will be uploaded to the Eco-Tech official ModelScope repository later.)

			* This guide requires `W8A8SC` quantized weights for the Qwen3 Dense `8B/14B/32B` models. You need to generate the SC-compressed weights yourself.
			* First, prepare the `W8A8S` weights:

			`* Qwen3-8B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-8B-w8a8s-310)`
			`* Qwen3-14B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-14B-w8a8s-310)`
			`* Qwen3-32B-w8a8s-310: [https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310](https://modelers.cn/models/Eco-Tech/Qwen3-32B-w8a8s-310)`
Add Pangu MoE Pro for 300I series docs (#1516) ### What this PR does / why we need it? Add Pangu MoE Pro for 300I series docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com> 2025-06-30 13:37:22 +08:00
[Doc][0.18.0] Add the 310p guide (#8639) ### What this PR does / why we need it? Add a detailed 310 deployment tutorial. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> 2026-04-23 23:50:40 +08:00			Note: if you want to validate directly with `w8a8s` weights instead of `w8a8sc` weights, the following example shows the serving command for `Qwen3-8B-w8a8s-310`. Performance is slightly lower than with compressed `w8a8sc` weights. Detailed `w8a8sc` testing is covered in the following sections.

			```bash
			`vllm serve Eco-Tech/Qwen3-8B-w8a8s-310 --host 127.0.0.1 --port 8080 \`
			`--tensor-parallel-size 1 --gpu_memory_utilization 0.90 \`
			`--served_model_name qwen --dtype float16 \`
			`--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \`
			`--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \`
			`--quantization ascend --max_model_len 16384`
			# `--load_format` is required only for the W8A8SC quantized weight format.
			`#`
			```

			`* Compress the weights`

			`* Uninstall triton (unsupported on 310P):`

			```bash
			`pip uninstall triton`
			`pip uninstall triton-ascend`
			```

			`* Get the compression script:`

			`* [https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/save_sharded_state_310.py)`

			`* Install the compression tool`

			`* Repository: [https://gitcode.com/Ascend/msit.git](https://gitcode.com/Ascend/msit.git)`
			`* Installation guide: [https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装](https://gitcode.com/Ascend/msit/blob/master/msmodelslim/docs/安装指南.md#基于atlas-300i-duo-系列产品安装)`

			`* Compression command`

			```bash
			`export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256`
			`export LD_LIBRARY_PATH=/usr/local/python3.11.10/lib/:$LD_LIBRARY_PATH`

			`python save_sharded_state_310.py \`
			`--model /your-load-path/w8a8s-weight \`
			`--tensor-parallel-size 1 \`
			`--output /your-save-path/w8a8sc-weight \`
			`--enable-compress \`
			`--compress-process-num 4 \`
			`--enforce-eager \`
			`--dtype float16 \`
			`--quantization ascend \`
			`--max_model_len 10240`
			```

			Argument notes: `--tensor-parallel-size`: `W8A8SC` quantized weights are tightly coupled to the TP size, so you must specify the TP size you plan to use at serving time when running compression. `--model` is the path to the input `w8a8s` weights, and `--output` is the output path for the compressed `w8a8sc` weights.

			`* Additional notes`

			* The Qwen3-8B model has fewer parameters, so some layers need fallback handling during quantization. It is recommended to download the `qwen3-8B-w8a8sc` weights directly from the Eco-Tech official ModelScope repository once available.

			`* Examples`

			`* Qwen3-8B-w8a8sc example`

			```bash
			`vllm serve /your-save-path/Qwen3-8B-w8a8sc-310-vllm/TP1/Qwen3-8B-w8a8sc-310-vllm-tp1/ \`
			`--host 127.0.0.1 \`
			`--port 8080 \`
			`--tensor-parallel-size 1 \`
			`--gpu_memory_utilization 0.90 \`
			`--max_num_seqs 32 \`
			`--served_model_name qwen \`
			`--dtype float16 \`
			`--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \`
			`--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16,32]}' \`
			`--quantization ascend \`
			`--max_model_len 16384 \`
			`--no-enable-prefix-caching \`
			`--load_format="sharded_state"`
			```

			`* Qwen3-14B-w8a8sc example`

			```bash
			`vllm serve /your-save-path/Qwen3-14B-w8a8sc-310-vllm/TP1/Qwen3-14B-w8a8sc-310-vllm-tp1/ \`
			`--host 127.0.0.1 \`
			`--port 8080 \`
			`--tensor-parallel-size 1 \`
			`--gpu_memory_utilization 0.90 \`
			`--max_num_seqs 16 \`
			`--served_model_name qwen \`
			`--dtype float16 \`
			`--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \`
			`--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1,2,4,8,16]}' \`
			`--quantization ascend \`
			`--max_model_len 16384 \`
			`--no-enable-prefix-caching \`
			`--load_format="sharded_state"`
			```

			`* Qwen3-32B-w8a8sc example`

			```bash
			`export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3`

			`vllm serve /save-path/Qwen3-32B-w8a8sc-310-vllm/TP4/Qwen3-32B-w8a8sc-310-vllm-tp4/ \`
			`--host 127.0.0.1 \`
			`--port 8080 \`
			`--tensor-parallel-size 4 \`
			`--gpu_memory_utilization 0.90 \`
			`--max_num_seqs 32 \`
			`--served_model_name qwen \`
			`--dtype float16 \`
			`--additional-config '{"ascend_compilation_config": {"fuse_norm_quant": false}}' \`
			`--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [16,32]}' \`
			`--quantization ascend \`
			`--max_model_len 20480 \`
			`--no-enable-prefix-caching \`
			`--load_format="sharded_state"`
			```

			`* Closing notes`

			`For early access to Qwen3-MoE, Qwen3-VL, and preview support for Qwen3.5 and Qwen3.6 with performance acceleration, follow #7394 for updated deployment guidance.`