### What this PR does / why we need it?
support torchair mode
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
5438967fbc
Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com>
Signed-off-by: p00465316 <panchao13@huawei.com>
Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.
In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.
performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>
### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |
example
`--additional_config={"lmhead_tensor_parallel_size": 8}`
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
de533ab2a1
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
### What this PR does / why we need it?
Upgrade to multi-node tutorial model to deepseek-v3.1-w8a8
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
de02b07db4
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial
delivery version of modelslim in Q3, and has been verified available
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
7d67a9d9f9
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Update release version info.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
712d0f88d8
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it?
update vllm version in ci
- vLLM version: v0.10.0
- vLLM main:
170e8ea9ea
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add release note for `v0.9.1rc3`.
- vLLM version: v0.10.0
- vLLM main:
53415653ff
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
Add cp/sp feature branch
- vLLM version: v0.10.0
- vLLM main:
0c6e40bbaa
Signed-off-by: LookAround <lixushi@huawei.com>
### What this PR does / why we need it?
This patch add the feature branch policy.
After this patch: maintainers are allowed to create a feature branch.
Feature branches are used for collaboration and must include an RFC
link, merge plan and mentor info.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
7be5d113d8
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Update DOC. Guide users to run LoRA with ACLGraph.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
- vLLM version: v0.10.0
- vLLM main:
de7b67a023
---------
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
enable_shared_pert_dp is currently on by default. This optimization is
currently only valid for deepseek series models. The default opening
affects the accuracy of the qwen series models.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
use parameter --additional_config='{"enable_shared_expert_dp": true}'
- vLLM version: v0.10.0
- vLLM main:
d983769c41
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
To help more developers quickly get started with vLLM, we need to write
clear and easy-to-understand code documentation and technical
interpretations. This will effectively lower the learning curve, attract
more excellent contributors, and collectively build a better developer
community.
Add ModelRunner_prepare_inputs doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Pass CI
- vLLM version: v0.10.0
- vLLM main:
4be02a3776
---------
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>
### What this PR does / why we need it?
Fixed the expression of msit for code clone
- vLLM version: v0.10.0
- vLLM main:
afa5b7ca0b
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add Docker export/import guide for air-gapped environments
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
NA
- vLLM version: v0.10.0
- vLLM main:
d16aa3dae4
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
### What this PR does / why we need it?
- update determinitic calculation
- update support device
### Does this PR introduce _any_ user-facing change?
- Users should update ray and protobuf when using ray as distributed
backend
- Users should change to use `export HCCL_DETERMINISTIC=true` when
enabling determinitic calculation
### How was this patch tested?
N/A
- vLLM version: v0.10.0
- vLLM main:
ea1292ad3e
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to
pure DP for shared experts, enabling more efficient execution.
2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by
using ReduceScatter, made possible by pure DP sharding.
3.AllGather Postponed: Delayed to after QKV down projection to reduce
synchronization impact during prefill.
### How was this patch tested?
Adding ut case in `tests/ut/attention/test_mla_v1.py`
#### How to run
use parameter `--additional_config='{"enable_shared_expert_dp": true}'`
##### a.How to run eager mode
eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true}'
##### b.How to run graph mode
eg:
python -m vllm.entrypoints.openai.api_server --model=/model_path
--trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002
--max-model-len 5120 --max-num-batched-tokens 16384
--disable-log-requests
--additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp":
true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}'
- vLLM version: v0.10.0
- vLLM main:
9edd1db02b
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: SlightwindSec <slightwindsec@gmail.com>
### What this PR does / why we need it?
Release note of v0.10.0rc1
- vLLM version: v0.10.0
- vLLM main:
8e8e0b6af1
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Update user guide for suported models
- vLLM version: v0.10.0
- vLLM main:
4be02a3776
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Add a new single npu quantization tutorial, and using the latest qwen3
model.
- vLLM version: v0.10.0
- vLLM main:
8e8e0b6af1
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
### What this PR does / why we need it?
Update user guide for using lm-eval
1. add using lm-eval on online server
2. add using offline datasets
- vLLM version: v0.10.0
- vLLM main:
9edd1db02b
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Update tutorials for single_npu_audio and single_npu_multimodal
- vLLM version: v0.10.0
- vLLM main:
6b47ef24de
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
In fact, the kimi-k2 model is similar to the deepseek model, and we only
need to make a few changes to support it. what does this pr do:
1. Add kimi-k2-w8a8 deployment doc
2. Update quantization doc
3. Upgrade torchair support list
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.10.0
- vLLM main:
9edd1db02b
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Add release note for v0.9.1rc2
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
c494f96fbc
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix cann related urls in installation doc.
### Does this PR introduce _any_ user-facing change?
The users install cann manually could use the correct url after this pr
### How was this patch tested?
N/A
- vLLM version: v0.10.0
- vLLM main:
5bbaf492a6
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
When using deepseek series models generated by the --dynamic parameter,
if torchair graph mode is enabled, we should modify the configuration
file in the CANN package to prevent incorrect inference results.
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
---------
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Upgrade CANN to 8.2.rc1
Backport: https://github.com/vllm-project/vllm-ascend/pull/1653
### Does this PR introduce _any_ user-facing change?
Yes, docker image will use 8.2.RC1
### How was this patch tested?
CI passed
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Fix num_hidden_layers when Qwen2-Audio 7B and #1760 :
```
INFO 07-15 04:38:53 [platform.py:174] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
Traceback (most recent call last):
File "/workspace/test1.py", line 58, in <module>
main(audio_count)
File "/workspace/test1.py", line 38, in main
llm = LLM(model="Qwen/Qwen2-Audio-7B-Instruct",
File "/vllm-workspace/vllm/vllm/entrypoints/llm.py", line 271, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 494, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1286, in create_engine_config
config = VllmConfig(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File "/vllm-workspace/vllm/vllm/config.py", line 4624, in __post_init__
current_platform.check_and_update_config(self)
File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 180, in check_and_update_config
update_aclgraph_sizes(vllm_config)
File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 307, in update_aclgraph_sizes
num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 211, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'Qwen2AudioConfig' object has no attribute 'num_hidden_layers'
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/1780https://github.com/vllm-project/vllm-ascend/issues/1760https://github.com/vllm-project/vllm-ascend/issues/1276https://github.com/vllm-project/vllm-ascend/issues/359
- vLLM version: v0.10.0
- vLLM main:
7728dd77bb
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
- Upgrade to v0.10.0
- Drop v0.9.2 version compatibility
- Add patch for
`vllm_ascend/patch/worker/patch_common/patch_sampler_gather_logprobs.py`
as workaround of
f3a683b7c9
for v0.10.0 and also add e2e test `test_models_prompt_logprobs`
- Pin transformers<4.54.0 as workaround of
https://github.com/vllm-project/vllm-ascend/issues/2034
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Test locally:
`VLLM_USE_MODELSCOPE=true pytest -sv
tests/e2e/singlecard/test_offline_inference.py::test_models_prompt_logprobs`
- CI passed
- vLLM version: v0.9.2
- vLLM main:
7728dd77bb
---------
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
The feature support matrix is out of date. This PR refresh the content.
- vLLM version: v0.9.2
- vLLM main:
107111a859
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds a complete Chinese translation for the documentation using
PO files and the gettext toolchain. The goal is to make the
documentation more accessible to Chinese-speaking users and help the
community grow.
### Does this PR introduce any user-facing change?
Yes. This PR introduces Chinese documentation, which users can access
alongside the original English documentation. No changes to the core
code or APIs.
### How was this patch tested?
The translated documentation was built locally using the standard
documentation build process (`make html` or `sphinx-build`). I checked
the generated HTML pages to ensure the Chinese content displays
correctly and matches the original structure. No code changes were made,
so no additional code tests are required.
vLLM version: v0.9.2
vLLM main: vllm-project/vllm@5780121
---
Please review the translation and let me know if any improvements are
needed. I am happy to update the translation based on feedback.
- vLLM version: v0.9.2
- vLLM main:
7ba34b1241
---------
Signed-off-by: aidoczh <aidoczh@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Remove ETP/EP maintained in branch main. We drop this as there is no
relevant scenarios to use ETP now, and we may subsequently advocate
implementing expert tensor parallelism in vLLM to support scenarios
where the expert is needed to be sliced
This is a part of #1422 backport.
Fixes https://github.com/vllm-project/vllm-ascend/issues/1396https://github.com/vllm-project/vllm-ascend/issues/1154
### Does this PR introduce _any_ user-facing change?
We'll not maintain etp/ep in vllm-ascend anymore, and use the tp/ep in
vllm instead.
### How was this patch tested?
CI passed with new added and existing test.
- vLLM version: v0.9.2
- vLLM main:
fe8a2c544a
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Add graph mode and improve on multi_npu_moge.md
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
CI passed with new existing test.
- vLLM version: v0.9.2
- vLLM main:
5a7fb3ab9e
Signed-off-by: GDzhu01 <809721801@qq.com>
### What this PR does / why we need it?
### Does this PR introduce _any_ user-facing change?
Pin docker image to latest release
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
1e9438e0b0
Signed-off-by: wangli <wangli858794774@gmail.com>
There is still issue with pp in some case. such as aclgraph, ray. Remove
the related doc in release note
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1. Add the tutorials for qwen3-embedding-8b
2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2
- vLLM version: v0.9.2
- vLLM main:
5923ab9524
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>