In context parallel scenario (i.e. prefill_context_parallel_size * decode_context_parallel_size > 1), "cudagraph_mode" is not sufficiently supported to be set to "FULL" yet.
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested from 0.9.0rc1. We will make it stable and generalized in the next release.
From v0.9.1rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fall back to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.1rc1, Qwen and DeepSeek series models are well tested.
- **XliteGraph**: This is the OpenEuler Xlite graph mode. In v0.11.0, only Llama, Qwen dense series models, Qwen MoE series models, and Qwen3-VL are supported.
If you want to run Llama, Qwen dense series models, Qwen MoE series models, or Qwen3-VL with Xlite graph mode, please install xlite, and set xlite_graph_config.