From 1fc7bc056d684f82a452ec2421ce6096b90804c1 Mon Sep 17 00:00:00 2001
From: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
Date: Thu, 23 Apr 2026 19:09:55 +0800
Subject: [PATCH] [0.18.0][Doc] Add NPU soft partitioning + cudagraph.piecewise
 limitation (#8595)

### What this PR does / why we need it?

Added NPU soft partitioning + cudagraph.piecewise limitation in graph
mode user guide doc.

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: zzzzwwjj <1183291235@qq.com>
---
 docs/source/user_guide/feature_guide/graph_mode.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/source/user_guide/feature_guide/graph_mode.md b/docs/source/user_guide/feature_guide/graph_mode.md
index 26f23186..3f7a5a74 100644
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -86,3 +86,7 @@ Online example:
 ```shell
 vllm serve someother_model_weight --enforce-eager
 ```
+
+## Common Limitations and Caveats
+
+- NPU soft partitioning + `CUDAGraphMode.PIECEWISE` is not supported. With soft-partitioned virtual NPU instances, the 2048 device streams are shared and split across containers (see the [virtual instance with vCANN RT description](https://gitcode.com/Ascend/mind-cluster/blob/branch_v26.0.0/docs/zh/scheduling/usage/virtual_instance/virtual_instance_with_vcann_rt/00_description.md)), but vLLM Ascend still derives ACL graph capture limits from a fixed full-device stream budget in `update_aclgraph_sizes` (`vllm_ascend/utils.py`) and does not use the per-container stream quota, so this combination is incompatible.