From a3f65b938f7df7d28c395463134d5b05871486a7 Mon Sep 17 00:00:00 2001
From: ZYang6263 <50876451+ZYang6263@users.noreply.github.com>
Date: Wed, 24 Dec 2025 14:40:20 +0800
Subject: [PATCH] [Doc] Add pa_shape_list description to qwen dense tutorial
 (#5225)

### What this PR does / why we need it?
Add pa_shape_list description to qwen dense tutorial.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: ZYang6263 <zy626375@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
---
 docs/source/tutorials/Qwen3-Dense.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/source/tutorials/Qwen3-Dense.md b/docs/source/tutorials/Qwen3-Dense.md
index b95e9e5c..b49053d3 100644
--- a/docs/source/tutorials/Qwen3-Dense.md
+++ b/docs/source/tutorials/Qwen3-Dense.md
@@ -181,6 +181,7 @@ vllm serve /model/Qwen3-32B-W8A8 \
   --max-model-len 5500 \
   --max-num-batched-tokens 40960 \
   --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
+  --additional-config '{"pa_shape_list":[48,64,72,80]}' \
   --port 8113 \
   --block-size 128 \
   --gpu-memory-utilization 0.9
@@ -191,6 +192,8 @@ vllm serve /model/Qwen3-32B-W8A8 \
 
 - If the model is not a quantized model, remove the `--quantization ascend` parameter.
 
+- **[Optional]** `--additional-config '{"pa_shape_list":[48,64,72,80]}'`: `pa_shape_list` specifies the batch sizes where you want to switch to the PA operator. This is a temporary tuning knob. Currently, the attention operator dispatch defaults to the FIA operator. In some batch-size (concurrency) settings, FIA may have suboptimal performance. By setting `pa_shape_list`, when the runtime batch size matches one of the listed values, vLLM-Ascend will replace FIA with the PA operator to prevent performance degradation. In the future, FIA will be optimized for these scenarios and this parameter will be removed.
+
 - If the ultimate performance is desired, the cudagraph_capture_sizes parameter can be enabled, reference: [key-optimization-points](./Qwen3-Dense.md#key-optimization-points)、[optimization-highlights](./Qwen3-Dense.md#optimization-highlights). Here is an example of batchsize of 72: `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'`.
 
 Once your server is started, you can query the model with input prompts