[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532)

### What this PR does / why we need it?
This PR ports #2312 #2506 #2531 to main branch.

Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable. Additionally, this pr fixes a
recompilation problem of torchair graph mode caused by
`running_in_graph` variable in `AscendMLATorchairImpl`.

### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster). Additionally, we've made a
change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to
enhance safety and prevent accidental file deletion by adding a suffix
directory.

### How was this patch tested?
CI and e2e vllm serving pass.


- vLLM version: v0.10.1.1
- vLLM main:
70549c1245

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2025-09-03 17:56:12 +08:00
committed by GitHub
parent 5889fa1b1c
commit 90a75a90a9
9 changed files with 97 additions and 13 deletions

View File

@@ -35,8 +35,9 @@ import vllm_ascend.envs as envs_ascend
from vllm_ascend.ascend_config import get_ascend_config
from vllm_ascend.platform import NPUPlatform
from vllm_ascend.torchair.utils import (
TorchairCommonAttentionMetadata, check_torchair_cache_exist,
converting_weight_acl_format, register_torchair_model, torchair_ops_patch,
TORCHAIR_CACHE_DIR, TorchairCommonAttentionMetadata,
check_torchair_cache_exist, converting_weight_acl_format,
register_torchair_model, torchair_ops_patch,
torchair_quant_method_register, write_kv_cache_bytes_to_file)
from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ,
is_310p)
@@ -52,6 +53,7 @@ class NPUTorchairModelRunner(NPUModelRunner):
self.torchair_compiled_model = None # type: ignore
self.torchair_compiled_models = {} # type: ignore
self.use_cached_npu_graph = ascend_config.torchair_graph_config.use_cached_graph
self.use_cached_kv_cache_bytes = ascend_config.torchair_graph_config.use_cached_kv_cache_bytes
self.torchair_graph_batch_sizes = ascend_config.torchair_graph_config.graph_batch_sizes
if ascend_config.torchair_graph_config.graph_batch_sizes_init:
self.init_torchair_graph_batch_sizes()
@@ -194,14 +196,20 @@ class NPUTorchairModelRunner(NPUModelRunner):
graph_num = len(torchair_graph_batch_sizes)
if self.use_cached_npu_graph and not check_torchair_cache_exist():
# If caching is enabled but does not exist, we will compile the model twice. The first
# time is used to generate the cache, and the second time is used to load the cache to
# skip the overhead caused by Dynamo guard mechanism.
# If caching is enabled but does not exist (either
# use_cached_kv_cache_bytes is disabled or kv_cache_bytes are
# different), we will compile the model twice. The first time is
# used to generate the cache, and the second time is used to load the
# cache to skip the overhead caused by Dynamo guard mechanism.
logger.info(
"Use cached npu graph but cache doesn't exist! Now we compile graph to genetate torchair cache, this usually takes %.1f~%.1f mins.",
"Cache compilation for torchair graph is enabled. Now we compile graph to genetate"
" torchair cache, this usually takes %.1f~%.1f mins.",
0.5 * graph_num, 1.5 * graph_num)
self._compile_torchair_graph(torchair_graph_batch_sizes)
NPUPlatform.synchronize()
# Note: We reset dynamo and reload the compiled torchair cached computation graph below
# that was compiled above. This operation reduces graph launch time by 2-4ms and avoids
# runtime errors caused by configuration mismatches in graph mode.
torch._dynamo.reset()
self.torchair_compiled_models.clear()
if self.use_cached_npu_graph:
@@ -215,7 +223,7 @@ class NPUTorchairModelRunner(NPUModelRunner):
0.5 * graph_num, 1.5 * graph_num)
self._compile_torchair_graph(torchair_graph_batch_sizes)
if self.new_kv_cache_bytes > 0:
if self.use_cached_kv_cache_bytes and self.new_kv_cache_bytes > 0:
write_kv_cache_bytes_to_file(torch.distributed.get_rank(),
self.new_kv_cache_bytes)
@@ -362,6 +370,7 @@ class NPUTorchairModelRunner(NPUModelRunner):
self.model.__dict__[forward_proxy_name],
dynamic=True,
fullgraph=envs_vllm.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE,
cache_dir=TORCHAIR_CACHE_DIR,
config=config,
ge_cache=False)
return self.torchair_compiled_models[batch_size]