### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
[Doc] Update DeepSeek-V3.2 tutorail, add single-node and multi-node
deployment
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: menogrey <1299267905@qq.com>
### What this PR does / why we need it?
Add the setting description of cudagraph_capture_sizes, guide users to
avoid the common mistakes frequently made when using the EAGLE overlay
fullgraph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need for testing
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
---------
Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
Signed-off-by: zhaomingyu13 <zhaomingyu13@h-partners.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
This pr update --max-num-seqs in Qwen3-235b single-node-deployment
tutorial to ensure running into graph mode correctly.
- vLLM version: v0.14.0
- vLLM main:
d68209402d
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
update supported features
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Upgrade wheel building's CANN to 8.5.0 and update the Docs
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Install clang in dokerfile for triton ascend
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Upgrade PTA to 2.9.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
---------
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
1) Default enable MLAPO for deepseek MLA Attention W8A8 models on PD
disagregation D Instance, for example: DeepSeekV3-W8A8,
DeepSeek-R1-W8A8.
2) Default enable MLAPO for DeepSeek SFA Attention W8A8 models,
currently is DeepSeek-V3.2-W8A8.
### Does this PR introduce _any_ user-facing change?
Don't need use manully to VLLM_ASCEND_ENABLE_MLAPO=1, to enable MLAPO
feature for deepseek w8a8 model
The effect of enabling MLAPO SFA model deployed on a single A3 Node:
Test
with:tests/e2e/nightly/single_node/models/test_deepseek_v3_2_exp_w8a8.py
dataset: gsm8k-lite,without set MTP, FULL GRAPH, has 19% promote:
未默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 14055.8836 ms │
├─────────────────────────┤
│ ITL │ 66.8171 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 104.9105 token/s │
├─────────────────────────┤
默认开启 MLAPO 时:
├─────────────────────────┤
│ TTFT │ 3753.1547 ms │
├─────────────────────────┤
│ ITL. │ 61.4236 ms. │
├─────────────────────────┤
│ Output Token Throughput │ 125.2075 token/s│
├─────────────────────────┤
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
Supplement PD separation parameters of DeepSeek V3.1
The recommended parameter configuration for DeepSeek V3.1 in the EP32
scenario after PD separation has been adjusted, and the core parameters
have been described in detail.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: chenmenglong <chenmenglong1@huawei.com>
### What this PR does / why we need it?
update triton ascend version in 3.2.0
- vLLM version: v0.13.0
- vLLM main:
d68209402d
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
This is a part of
https://github.com/vllm-project/vllm-ascend/issues/4715#issue-3694310762
1. refactor the npugraph_ex config,modified the default configuration of
the static kernel, new default value of static kernel is false
2. support online-infer with static kernel
3. fixed the issue where manually modifying FX graphs caused an abnormal
model return type, and removed the related redundant code.
### Does this PR introduce _any_ user-facing change?
yes,the new config of npugraph_ex is as follow:
```
additional_config={
"npugraph_ex_config": {
"enable": True,
"enable_static_kernel": False
}
}
```
### How was this patch tested?
```
vllm serve /data/DeepSeek-V3.1-Terminus-w4a8 \
--host 0.0.0.0 \
--port 8004 \
--data-parallel-size 4 \
--tensor-parallel-size 4 \
--quantization ascend \
--seed 1024 \
--served-model-name deepseek_v3 \
--enable-expert-parallel \
--max-num-seqs 48 \
--max-model-len 40000 \
--async-scheduling \
--max-num-batched-tokens 9000 \
--trust-remote-code \
--no-enable-prefix-caching \
--speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp","disable_padded_drafter_batch": false}' \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_capture_sizes":[4,32,64,112,160,176,192], "cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config \
'{"enable_shared_expert_dp": true,"multistream_overlap_shared_expert": true,"npugraph_ex_config":{"enable":true}}'
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: chencangtao <chencangtao@huawei.com>
Signed-off-by: ChenCangtao <50493711+ChenCangtao@users.noreply.github.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
### What this PR does / why we need it?
Add docs for Qwen3-VL-Embedding & Qwen3-VL-Reranker.
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
---------
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
#### Documentation Improvements
New Configuration: Added the layer_sharding parameter to the
DeepSeek-V3.2-W8A8 deployment tutorial. This guides users to include
`["q_b_proj", "o_proj"]` in their prefill node setup for better resource
utilization.
#### CI and Testing Updates
Test Config Update: Updated the multi-node E2E test configuration file:
tests/e2e/nightly/multi_node/config/DeepSeek-V3_2-W8A8-A3-dual-nodes.yaml.
including disable `FLASHCOMM` and enable `FULL_DECODE_ONLY` and update
performance baseline.
### Does this PR introduce any user-facing change?
Yes. The documentation now recommends a more optimized startup command
for DeepSeek-V3.2-W8A8. Users following the updated tutorial will see
improved performance in multi-node PD disaggregation environments.
### How was this patch tested?
CI Validation: The updated E2E test configuration has been verified
through the nightly CI pipeline.
Environment: * vLLM version: v0.13.0
Base Commit:
[11b6af5](11b6af5280)
Hardware: Ascend A3/A2 multi-node cluster.
---------
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
### What this PR does / why we need it?
correct the seq length of KV for prefill of GQA and clarify the desc of
block table distribution in developer guide.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
1. Rename dynamic_ep to default_eplb.
2. Rename dynamic_ep_v2 to swift_balancer
3. Discard func compose_expert_update_info_bipartite.
- vLLM version: v0.13.0
- vLLM main:
bde38c11df
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Migrate the torch profiler configuration from deprecated environment
variables (`VLLM_TORCH_PROFILER_DIR`, `VLLM_TORCH_PROFILER_WITH_STACK`,
`VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY`) to the explicit
`ProfilerConfig` object, aligning with vLLM's configuration best
practices.
The profiler environment variable approach is deprecated in vLLM and
will be removed in v0.14.0 or v1.0.0.
### Does this PR introduce _any_ user-facing change?
yes, for deverlopers who want to fetch profiler, he should use `--profiler-config` instead of `VLLM_TORCH_PROFILER_DIR`
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
Added legend descriptions, and split redundant tables into core
supported model tables and extended compatible model tables.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
---------
Signed-off-by: herizhen <1270637059@qq.com>
### What this PR does / why we need it?
For cpu env, we should set `SOC_VERSION` to mock different NPU chips for
different compilation paths
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Remove Chinese characters from the icons in the doc.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2c24bc6996
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
Upgrade vllm commit to releases/v0.14.0
- Re-open cases in `tests/e2e/singlecard/pooling/test_scoring.py`, since
the errors before have been fixed by
https://github.com/vllm-project/vllm/pull/32243
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
11b6af5280
Signed-off-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it?
Add tutorials for `Qwen3-VL-30B-A3B-Instruct`.
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: shen-shanshan <467638484@qq.com>
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817
### Does this PR introduce _any_ user-facing change?
before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`
after this pr:
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`
### How was this patch tested?
#### test qwen3-235b eplb num_redundant_experts=16
without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |
with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604
This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: lty <linhebiwen@gmail.com>
### What this PR does / why we need it?
upgrade tutorial doc index display by category
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.
1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.
Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
Description
This PR fixes linting issues in the root directory, benchmarks/, tools/
and docs/ to align with the project's Ruff configuration.
This is part of a gradual effort to enable full linting coverage across
the repository. The corresponding paths have been removed from the
exclude list in pyproject.toml.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
This pull request updates documentation to align with vLLM's community
standards.
- Change `Maintainers` to `Committers` to follow vLLM naming:
https://docs.vllm.ai/en/latest/governance/committers/
- Change release branch policy from `vX.Y.Z-dev` to `releases/vX.Y.Z`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
doc ci passed
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: liziyu <liziyu16@huawei.com>
Add `releases` match case for CI jobs and update related doc for v0.13.0
branch
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Add GLM4.5 GLM4.6 doc
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
Signed-off-by: 1092626063 <1092626063@qq.com>
### What this PR does / why we need it?
Fix the title of the document for the layer_sharding feature
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
### What this PR does / why we need it?
1. add PaddleOCR-VL.md in the `docs/source/tutorials/`
2. add PaddleOCR-VL index in `docs/source/tutorials/index.md`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by CI
- vLLM version: v0.13.0
- vLLM main:
7157596103
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
### What this PR does / why we need it?
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
### What this PR does / why we need it?
Add Qwen3-Omni-30B-A3B-Thinking Tutorials
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
"layer_sharding": ["o_proj", "q_b_proj"]
}'
```
This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order
- vLLM version: v0.13.0
- vLLM main:
8be6432bda
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>