### What this PR does / why we need it?
This PR aims to add note in doc that FULL mode is not supported in PCP
scenario.
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:
### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.
### How was this patch tested?
vLLM version: v0.12.0
- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:
## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6)
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph
### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.
Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.
- vLLM version: v0.11.2
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Correct the mistake in information documents
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
2918c1b49c
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
### What this PR does / why we need it?
Corrected the errors in the information
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.11.0
- vLLM main:
83f478bb19
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
1. Add the tutorials for qwen3-embedding-8b
2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2
- vLLM version: v0.9.2
- vLLM main:
5923ab9524
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>