Commit Graph

156 Commits

Author SHA1 Message Date
zhangsicheng5
8ed87dfa84 [doc] Add context parallel user guide (#5358)
1. Add context parallel user guide
2. Add context parallel related message in supported features/models
- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
2025-12-26 17:03:47 +08:00
wangxiyuan
29d2fe653d cleanup ascend config (#5296)
1. refresh additional config doc
2. move kv config logic to platform.
3. improve `dump_config` init logic and rename it to `dump_config_path`
this change is user impacted. dump_config is changed from dict to
string.
4. correct `enable_async_exponential` type
5. remove useless `chunked_prefill_for_mla`

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-26 14:07:37 +08:00
Tiger Xu / Zhonghu Xu
cb963c53a5 [Doc] Added deploying on k8s with kthena (#4674)
### What this PR does / why we need it?
[Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native
LLM inference platform that transforms how organizations deploy and
manage Large Language Models in production. Built with declarative model
lifecycle management and intelligent request routing, it provides high
performance and enterprise-grade scalability for LLM inference
workloads.

The platform extends Kubernetes with purpose-built Custom Resource
Definitions (CRDs) for managing LLM workloads, supporting multiple
inference engines (vLLM, SGLang, Triton) and advanced serving patterns
like prefill-decode disaggregation.

This pr added a example on deloying llm on Ascend Kubernetes clusters.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
2025-12-23 17:46:04 +08:00
lvjunqi
55beac9c91 [Feat]Xlite Qwen3-vl Support (#5228)
### What this PR does / why we need it?
This patch adds support for the Qwen3-VL model in Xlite. For more
details about Xlite, please refer to the following
link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

### Does this PR introduce _any_ user-facing change?
XLite graph mode supports the Qwen3-VL model.

### How was this patch tested?
vLLM version: v0.12.0 

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: lvjunqi <lvjunqi1@huawei.com>
Co-authored-by: lvjunqi <lvjunqi1@huawei.com>
2025-12-22 16:30:52 +08:00
zhangyiming
dc047489c7 [Doc] Fix DeepSeek-V3.2 tutorial. (#5190)
### What this PR does / why we need it?
Fix DeepSeek-V3.2 tutorial.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: menogrey <1299267905@qq.com>
2025-12-22 11:30:17 +08:00
YuhanBai
5d02eed16f [Performance] Add async exponential while model executing (#4501)
### What this PR does / why we need it?
Add a control to enable the exponential distribution operator
overlapping with model executing (default is OFF due to this feature
might not perform well on MOE models, i.e. For Qwen3-30B).
Enable async exponential overlapping will provides performance
improvement.
Also, overlapping the exponential operator with module execution can
cover the performance drop introduced by AICPU-version's exponential
operator.

**UPDATE**: (12/12)
Now our overlap will use the same stream that introduced in this pr:
#4908 .
We move the `do_async_exponential` from `model_runner_v1.py` to
`sampler.py`.
Now we are using `additional_config` to enable async exponential:
Add `"enable_async_exponential": 1` in `addition_config`.
Now we **ONLY** support default exponential/AI-CPU exponential, the old
`"enable_async_exponential": 2` option has been aborted to keep
consistency.

### Does this PR introduce _any_ user-facing change?
**YES**, added a new `additional_config` : `"enable_async_exponential":
1`.
When `enable_async_exponential` is set to 1, we enable the async
exponential and overlap with model runner.
When `enable_async_exponential` is set to 0 (default is 0), we disable
the async exponential, but exponential will still running on a different
stream using stream introduced in #4908.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: YuhanBai <yuhan.bai0830@gmail.com>
Signed-off-by: YuhanBai yuhan.bai0830@gmail.com
2025-12-20 21:23:21 +08:00
zzhxxx
17f2eead99 [Doc]Add the user_guide doc file regarding fine-grained TP. (#5084)
### What this PR does / why we need it?
Add user guide for **Fine-Grained Tensor Parallelism** feature.  
Documents usage, supported components (`embedding`, `lm_head`, `o_proj`,
`mlp`/`dense_ffn`), model compatibility, and deployment guidelines.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-19 16:37:25 +08:00
luluxiu520
bc05a81bf2 Add Qwen3-VL-235B-A22B-Instruct tutorials (#5167)
### What this PR does / why we need it?

This PR provides an introduction to the Qwen3-VL-235B-A22B-Instruct
model, details on the features supported by the model in the current
version, the model deployment process, as well as methods for
performance testing and accuracy testing.

With this document, the deployment and testing of the
Qwen3-VL-235B-A22B-Instruct model can be implemented more easily.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: luluxiu520 <l2625793@outlook.com>
2025-12-19 14:56:17 +08:00
1092626063
f952de93df 【Doc】Deepseekv3.1/R1 doc enhancement (#4827)
### What this PR does / why we need it?

Deepseekv3.1、DeepSeekR1 doc enhancement

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-19 10:52:33 +08:00
weichen
ca6f631cba [2/N][Pangu][MoE] Remove Pangu Related Code (#5130)
### What this PR does / why we need it?
Remove Pangu Related Code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
e2e & ut

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weichen <calvin_zhu0210@outlook.com>
2025-12-19 09:00:07 +08:00
zxr2333
073a3a6e6c [Doc][P/D] Fix MooncakeConnector's name (#5172)
### What this PR does / why we need it?
vLLM community has integrated their MooncakeConnector. The original
scripts will now find this MooncakeConnector instead of the one from
vLLM-Ascend. All scripts that involve using the MooncakeConnector need
to be modified to another name.

### Does this PR introduce _any_ user-facing change?
Yes, users need to use a new name to load vLLM-Ascend MooncakeConnector.

### How was this patch tested?
By CI.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2025-12-18 22:29:19 +08:00
TingW09
879ec2d1c4 [Doc] add qwen3 reranker (#5086)
### What this PR does / why we need it?
add qwen3 reranker tutorials
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0

---------

Signed-off-by: TingW09 <944713709@qq.com>
2025-12-18 10:54:07 +08:00
liziyu
190ae55e9f Add a Mooncake installation tutorial for kv pool and update Mooncake installation tutorial (#5069)
### What this PR does / why we need it?
Add a Mooncake installation tutorial for kv pool and update Mooncake
installation tutorial

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-16 19:53:23 +08:00
wangxiyuan
d11b74a571 Add release note for v0.11.0 (#4918)
Add release note for v0.11.0. We'll release soon.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-16 17:31:45 +08:00
zhaomingyu13
039cc65e58 [Doc] Add user guide of speculative decoding (#5074)
### What this PR does / why we need it?
Add user guide of speculative decoding that includes n-grams, EAGLE,
MTP, and suffix.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
2025-12-16 17:01:44 +08:00
UnifiedCacheManager
195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
fluctlux
6de4bedd04 update release note for suffix decoding (#5009)
update release note for suffix decoding

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com>
2025-12-15 17:22:19 +08:00
Chao Lei
b75bfc58f6 [Doc ] Supplement kvpool user guide (#5013)
### What this PR does / why we need it?
Supplement detailed descriptions for `ASCEND_CONNECT_TIMEOUT` and
`ASCEND_TRANSFER_TIMEOUT` in kvpool.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-15 14:24:39 +08:00
wangxiyuan
42ceaf08a1 add release note for 0.12.0 (#4995)
Add release note for v0.12.0rc1
Update deepseek3.2 tutorial doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-13 22:09:59 +08:00
lilinsiman
31c94b7e7b [doc][main] Correct more doc mistakes (#4958)
### What this PR does / why we need it?
Correct more doc mistakes

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-13 18:36:58 +08:00
lilinsiman
fc818f1509 [doc][main] Correct mistakes in doc (#4945)
### What this PR does / why we need it?
Correct mistakes in doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-12 19:17:10 +08:00
Li Wang
4ae7588c52 [Doc] Upgrade outdated doc (#4957)
### What this PR does / why we need it?
Updated some issues that caused sleep mode document content to be
unavailable due to changes/outdated environment variables.

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-12 15:38:29 +08:00
1092626063
62a9fea7af 【doc】Add model feature matrix (#4950)
### What this PR does / why we need it?

doc tutorials add  model feature matrix:
DeepSeekR1
DeepSeekV3.1
Qwen3-Dense
Qwen3-Moe
Qwen3-Next
Qwen2.5
Qwen2.5-VL
Qwen3-VL

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 1092626063 <1092626063@qq.com>
2025-12-12 15:37:39 +08:00
lidenghui1110
d65fb194d9 [Feat] Add custom Embedding tensor model parallel (#2616)
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.

And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 14:41:20 +08:00
wangxiyuan
e538fa6f9c [Doc] Update tutorial index (#4920)
Update tutorial index and remove useless doc

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-11 20:53:13 +08:00
Shanshan Shen
551069e53a [Doc] Update structured output doc with upstream link (#4015)
### What this PR does / why we need it?
Currently, the usage of structured output feature in vllm-ascend is
totally the same as that in vllm.

Thus, IMO, it's better to remove this doc directly to avoid some case
that there are some changes in the upstream doc and we don't update our
doc in time, which can be misleading to users.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-11 19:14:29 +08:00
SILONG ZENG
ff7d703192 [Doc]Add tutorial document for qwen-VL-Dense (#3516)
### What this PR does / why we need it?
This document employs the qwen3-vl-8b model and qwen2.5-vl-32b to
demonstrate the primary verification steps for the Qwen-VL series dense
models, including supported features, feature configuration, environment
preparation, NPU deployment, and accuracy and performance evaluation.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2025-12-11 08:55:23 +08:00
wangxiyuan
37db0844f5 Remove COMPILE_CUSTOM_KERNELS env (#4864)
With more and more custom ops merged, disable `COMPILE_CUSTOM_KERNELS `
for vllm ascend seems useless now. Let's enable csrc compile by default.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-10 23:48:03 +08:00
lianyibo
e32014ac1d [Model] Support pooling models (#3122)
### What this PR does / why we need it?

Support pooling models (like `bge-reranker-v2-m3`) in vllm-ascend, this
pr covered the three model types of embed (cls_token, mean_token,
lasttoken).

After this
[commit](17373dcd93),
vllm has provided support for adapting pooling models on the v1 engine.
This PR includes corresponding adaptations on the vllm-ascend side.

Fixes #1960

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lianyibo <lianyibo1@kunlunit.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-12-10 11:37:57 +08:00
wangxiyuan
835b4c8f1d Drop torchair (#4814)
aclgraph is stable and fast now. Let's drop torchair graph mode now.

TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 09:20:40 +08:00
LuLina
2be0fe2691 [Feat] Add Euler xlite graph wrapper support (#4526)
### What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend.
Xlite provides operator implementations of the transformer network on
Ascend hardware. For details about xlite, please refer to the following
link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default
aclgraph mode is as follows:

## Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
- aclgraph: main(c4a71fc6) 
- xlite-full: main(c4a71fc6) + xlite-full
- xlite-decode-only: main(c4a71fc6) + xlite-decode-only
- diff1: Performance comparison between xlite-full and aclgraph
- diff2: Performance comparison between xlite-decode-only and aclgraph


### Does this PR introduce _any_ user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' #
Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true,
"full_mode": true}}' # Enabled for prefill and decode

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lulina <lina.lulina@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-08 08:27:46 +08:00
liziyu
688b1332da [P/D] check kv extra config and del hccl backend (#4547)
### What this PR does / why we need it?
check kv extra config & del hccl backend


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-07 15:19:42 +08:00
wangxiyuan
3f81c4bb25 fix typo (#4657)
typo fix for release title

- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-03 11:56:47 +08:00
wangxiyuan
9a73c22b1c [Doc] add release note for v0.11.0rc3 (#4646)
Add release note for 0.11.0rc3. We'll release it today.

- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
86e178f7c4

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-03 11:49:44 +08:00
yeyifan
8907010815 [Doc] Add tutorial for Qwen3-Coder-30B-A3B (#4391)
### What this PR does / why we need it?
Add tutorial for Qwen3-Coder-30B-A3B

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Signed-off-by: herizhen <you@example.com>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: herizhen <59841270+herizhen@users.noreply.github.com>
Co-authored-by: herizhen <you@example.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: XiaoxinWang <963372609@qq.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
2025-12-02 16:03:37 +08:00
wangxiyuan
cb33b09179 [Doc]clean up ascend scheduler config from doc (#4612)
clean up ascend scheduler config from doc

- vLLM version: v0.11.2

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-12-02 14:22:56 +08:00
herizhen
bb1610dc25 add hyperlink (#4588)
### What this PR does / why we need it?
add hyperlink

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-12-02 14:09:03 +08:00
Mengqing Cao
517fd9272d Revert "drop ascend scheduler" (#4580)
Reverts vllm-project/vllm-ascend#4498
- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2
2025-11-29 22:20:48 +08:00
wangxiyuan
f10acddb78 drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that
the npu ops didn't work well with chunked prefill.

Now the ops with chunked prefill work better, it's time to remove the
ascend scheduler to use vLLM default scheduler.

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-29 16:18:34 +08:00
fems14
5447a039b9 [Feature][main]reconstruction kvpool connector to ascend connector (#4438)
### What this PR does / why we need it?
1.In short, we renamed the existing MooncakeStoreConnector to
AscendStoreConnector and extracted the storage engine interaction logic
into a new Backend class.
Associated RFC:https://github.com/vllm-project/vllm-ascend/issues/4329
2.Fixed the issue where the number of input parameters for the connector
was incorrect, introduced in vllm 0.11.2
### Does this PR introduce _any_ user-facing change?
change MooncakeStoreConnector to AscendStoreConnector
### How was this patch tested?

- vLLM version: v0.11.2

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-11-28 18:08:37 +08:00
LHXuuu
bdc66972db [Quantization] Support compressed tensors w8a8 static and w8a8 dynamic weight (#4036)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Add AscendCompressedTensorsConfig to replace CompressedTensorsConfig
in vllm.
2. Support CompressedTensorsW8A8 static weight.
- weight: per-channel, int8, symmetric; activation: per-tensor, int8,
symmetric.
4. Support CompressedTensorsW8A8Dynamic weight.
- weight: per-channel, int8, symmetric; activation: per-token, int8,
symmetric, dynamic.
5. Modify the override_quantization_method in AscendQuantConfig.

Co-authored-by: taoqun110 taoqun@huawei.com
Co-authored-by: chenxi-hh chen464822955@163.com

- vLLM version: v0.11.2

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: chenxi-hh <chen464822955@163.com>
Signed-off-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
Co-authored-by: chenxi-hh <chen464822955@163.com>
Co-authored-by: chenxi-hh <32731611+chenxi-hh@users.noreply.github.com>
2025-11-28 14:09:39 +08:00
herizhen
d252e36ae8 Change comment location (#4432)
### What this PR does / why we need it?
When running 'python example.py',connection issues often occur.The
solution is to comment out the first line the code.
Complete the specific names of machines A2 and A3.
Standardize document format,a space should be added after the colon.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.2

---------

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-26 16:13:31 +08:00
Tjh-UKN
00ea61ec88 [feature] vllm-ascend support msprobe (eager mode dump) (#4241)
### What this PR does / why we need it?
vllm-ascend need to dump data during model execution to debug some
precision problems, here msprobe provide the corresponding abilities, so
msprobe will join vllm-ascend to make debug easier

### Does this PR introduce _any_ user-facing change?
```
'dump_config': '/path/to/config.json'
```



- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Tjh-UKN <2559659915@qq.com>
2025-11-24 21:58:31 +08:00
herizhen
8c87a3b053 Change the first letter to uppercase (#4375)
### What this PR does / why we need it?
 The first letter  of the English title should be capitalized
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: herizhen <you@example.com>
Co-authored-by: herizhen <you@example.com>
2025-11-24 12:18:24 +08:00
wangxiyuan
fff258bce1 [Doc] add release note for v0.11.0rc2 (#4348)
add release note for v0.11.0rc2

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-21 23:03:32 +08:00
whx
a5554b6661 [Feat][Doc] Add a load_balance_dp_proxy in examples and external dp doc. (#4265)
### What this PR does / why we need it?
This PR adds a load-balance dp proxy server which can be used in
external DP scenario without Disaggregated-Prefill enabled. What's more,
add a doc of external dp and load-balance dp proxy server.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
See the new doc.

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-11-21 16:33:23 +08:00
LI SHENGYONG
019c7ded91 eplb redundant expert bugfix (#4291)
### What this PR does / why we need it?
Redundant experts bugfix
### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.
### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2025-11-21 14:24:35 +08:00
pz1116
d43022f3ed [doc]fix readme for kv pool user guide (#4271)
### What this PR does / why we need it?
Add the parameter "register_buffer" for PD Aggregated Scenario in the
given example.


- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2025-11-19 15:57:50 +08:00
lilinsiman
adee9dd3b1 [Info][main] Correct the mistake in information documents (#4157)
### What this PR does / why we need it?
Correct the mistake in information documents

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-11-13 15:53:58 +08:00
wangxiyuan
f811a24bf0 Remove VLLM_USE_V1 (#4086)
Drop VLLM_USE_V1 usage.  This env has been removed from vLLM already.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 15:43:39 +08:00