Commit Graph

1176 Commits

Author SHA1 Message Date
offline893
9b0baa1182 [BugFix] Check all expert maps when using muilty instance. (#3576)
### What this PR does / why we need it?
Check all expert maps when using muilty instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2:   master has expert map, slave has error expert map.
case3:   master has expert map,slave has correct expert map.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-24 17:10:14 +08:00
Mengqing Cao
cea0755b07 [1/N][Refactor] Refactor code to adapt with vllm main (#3612)
### What this PR does / why we need it?
This is the step 1 of refactoring code to adapt with vllm main, and this
pr aligned with
17c540a993

1. refactor deepseek to the latest code arch as of
17c540a993
 
2. bunches of fixes due to vllm changes
- Fix `AscendScheduler` `__post_init__`, caused by
https://github.com/vllm-project/vllm/pull/25075
- Fix `AscendScheduler` init got an unexpected arg `block_size`, caused
by https://github.com/vllm-project/vllm/pull/26296
- Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by
https://github.com/vllm-project/vllm/pull/23485
- Fix `MLAAttention` import,caused by
https://github.com/vllm-project/vllm/pull/25103
- Fix `SharedFusedMoE` import, caused by
https://github.com/vllm-project/vllm/pull/26145
- Fix `LazyLoader` improt, caused by
https://github.com/vllm-project/vllm/pull/27022
- Fix `vllm.utils.swap_dict_values` improt, caused by
https://github.com/vllm-project/vllm/pull/26990
- Fix `Backend` enum import, caused by
https://github.com/vllm-project/vllm/pull/25893
- Fix `CompilationLevel` renaming to `CompilationMode` issue introduced
by https://github.com/vllm-project/vllm/pull/26355
- Fix fused_moe ops, caused by
https://github.com/vllm-project/vllm/pull/24097
- Fix bert model because of `inputs_embeds`, caused by
https://github.com/vllm-project/vllm/pull/25922
- Fix MRope because of `get_input_positions_tensor` to
`get_mrope_input_positions`, caused by
https://github.com/vllm-project/vllm/pull/24172
- Fix `splitting_ops` changes introduced by
https://github.com/vllm-project/vllm/pull/25845
- Fix multi-modality changes introduced by
https://github.com/vllm-project/vllm/issues/16229
- Fix lora bias dropping issue introduced by
https://github.com/vllm-project/vllm/pull/25807
- Fix structured ouput break introduced by
https://github.com/vllm-project/vllm/issues/26737

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Icey <1790571317@qq.com>
Co-authored-by: Icey <1790571317@qq.com>
2025-10-24 16:55:08 +08:00
jiangyunfan1
ec9ec78b53 [TEST]Add initial prefix cache case for nightly test (#3709)
### What this PR does / why we need it?
This PR adds the initial prefix cache case for nightly test for
Qwen3-32b-int8 on A3, we need test them daily.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the test

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-24 16:33:18 +08:00
zzzzwwjj
6be321b95e remove useless code (#3685)
### What this PR does / why we need it?
`vanilla_chunked_prefill_mla` and `vanilla_decode_mla` is unused, so
remove it.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-10-24 16:29:08 +08:00
lio
cd58a643c5 [UT] Fix test_sample_recovered_tokens_pytorch_autoregressive (#3434)
### What this PR does / why we need it?

This 'test_rejection_sampler' unit test is something wrong.

> def test_sample_recovered_tokens_pytorch_autoregressive(self):
>       output_token_ids = torch.empty(2, dtype=torch.int32)
>       cu_num_draft_tokens = torch.tensor([1, 1])
>       draft_token_ids = torch.tensor([0, 1])

len(draft_token_ids ) = 2, cu_num_draft_tokens should be
torch.tensor([1, 2]) or torch.tensor([2, 2])

I fix it and set cu_num_draft_tokens = torch.tensor([1, 2]). The methods
before and after optimization can pass.

### Does this PR introduce _any_ user-facing change?
No 
### How was this patch tested?
NA

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: lio <1983142975@qq.com>
2025-10-24 11:20:57 +08:00
Li Wang
802c574532 [Benchmark] Upgrade benchmark args for new vllm version (#3218)
### What this PR does / why we need it?
Since the newest vllm commit has deprecated the arg `--endpoint-type`,
we should use `--backend` instead
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test it locally:
```shell
export VLLM_USE_MODELSCOPE=true
export DATASET_PATH=/root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json

vllm serve Qwen/Qwen2.5-7B-Instruct --load-format dummy

wget -O ${DATASET_PATH}  /root/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

vllm bench serve --model Qwen/Qwen2.5-7B-Instruct --backend vllm --dataset-name sharegpt --dataset-path ${DATASET_PATH}  --num-prompt 200
```
and the result looks good:
```shell
============ Serving Benchmark Result ============
Successful requests:                     200
Benchmark duration (s):                  20.36
Total input tokens:                      43560
Total generated tokens:                  44697
Request throughput (req/s):              9.82
Output token throughput (tok/s):         2194.88
Peak output token throughput (tok/s):    4676.00
Peak concurrent requests:                200.00
Total Token throughput (tok/s):          4333.93
---------------Time to First Token----------------
Mean TTFT (ms):                          2143.85
Median TTFT (ms):                        2486.17
P99 TTFT (ms):                           2530.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.50
Median TPOT (ms):                        30.75
P99 TPOT (ms):                           309.22
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.15
Median ITL (ms):                         25.42
P99 ITL (ms):                            38.30
==================================================
```
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-24 11:18:19 +08:00
whx
1b270a64bd [MoE][Multistream] Avoid performing communication in extra stream. (#3582)
This PR moves the communication operation of shared experts out of extra
stream because I found that this might cause rtMemcpy related errors
when running shared experts multistream with aclgraph.

Furthermore, I utilize a global variable as extra stream object to avoid
allocating streams for each layer in full-graph mode.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-24 10:44:38 +08:00
LookAround0301
b54d44e664 support cp&dcp (#3260)
### What this PR does / why we need it?
This PR adds the Prefill Context Parallelism (PCP) feature, which
corresponds to DCP. For specific implementation details, please refer to
the RFC https://github.com/vllm-project/vllm/issues/25749.
TL;DR: PCP enhances long-sequence inference capabilities by partitioning
the sequence dimension during the prefill stage.
### Does this PR introduce _any_ user-facing change?
The current implementation primarily includes the following changes:

Modified ModelRunner.py for CP partitioning logic for tokens;
Modified attention_v1.py and mla_v1.py to adapt the GQA/MLA backend to
PCP.
Modified block_tables.py to extend the KV cache storage based on
DCP&PCP;
Added necessary command-line arguments to control parallelism for PCP;
### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: chenjie <chenjie137@huawei.com>
Signed-off-by: Delphine-Nic <tanwenqin@huawei.com>
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: Feng Liu <liufeng248@huawei.com>
Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: chenjie <chenjie137@huawei.com>
Co-authored-by: Delphine-Nic <tanwenqin@huawei.com>
Co-authored-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Co-authored-by: Feng Liu <liufeng248@huawei.com>
Co-authored-by: gaojc <1055866782@qq.com>
Co-authored-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: z50049692 <zhangmingwei11@huawei.com>
Co-authored-by: w00896881 <wangzixuan40@huawei.com>
2025-10-24 10:32:01 +08:00
fems14
2bcadcb9d5 【main】patch sched_yield (#3648)
### What this PR does / why we need it?
On Arm systems, os.sched_yield() does not take effect, causing the GIL
(Global Interpreter Lock) to remain unrelinquished and resulting in CPU
bound issues. This PR applies a patch to sched_yield in vLLM, making the
process execute time.sleep(0) instead to release the GIL.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-10-24 00:06:45 +08:00
Wang Yixuan
a7b40b09eb [BugFix]fix deepseek torchair recompile (#3678)
### What this PR does / why we need it?
The #3624 PR fix the precision of deepseek torchair, but don't consider
the limitation of torch compile which results in the recompile, This PR
fixs this problem

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-23 22:53:01 +08:00
Slightwind
3366d47694 [main][bugfix] Add 'layer_type' param to get_pergroup_param() for compatibility (#3682)
Resolves a `TypeError: got an unexpected keyword argument 'layer_type'`.

A recent change (PR #3311) started passing the `layer_type` argument
when calling `get_pergroup_param()`. This specific implementation does
not use this parameter, causing the error.

This patch adds `layer_type=None` to the method signature to maintain
API compatibility and ignore the unused argument.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-10-23 21:26:33 +08:00
HuaJiaHeng
062257f624 [Test] add a new Qwen3-32b-int8 test case with feature_stack3 (#3676)
### What this PR does / why we need it?
This PR add a new Qwen3-32b-int8 test case for nightly test. This test
case mainly test the performance and accuracy of Qwen3-32b-int8 with a
new feature.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: root <root@hostname-2pbfv.foreman.pxe>
Co-authored-by: root <root@hostname-2pbfv.foreman.pxe>
2025-10-23 20:43:14 +08:00
zhangyiming
ebfd09a075 [Doc] Update the Pangu Pro MoE tutorials. (#3651)
### What this PR does / why we need it?
Update the Pangu Pro MoE tutorials.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: menogrey <1299267905@qq.com>
2025-10-23 20:41:47 +08:00
liziyu
aeddf4261a [Bugfix] fix delay free prefill req & D node support prefix cache (#3607)
### What this PR does / why we need it?
Fix mooncake connector. In scenarios where TP is not equal, when the
prefill TP size is less than the number of key-value heads,
_get_remote_tp_ranks_for_req will return a list of np.arrays. Performing
an operation like int in list of np.arrays will cause an error.
Converting the list of np.arrays into a single np.array resolves this
issue.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
qwen235B
P tp16, D tp1
P tp8, D tp1
P tp4, D tp1
P tp8, D tp2


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: underfituu <hzhucong@163.com>
2025-10-23 20:39:14 +08:00
Shanshan Shen
e3c1ac89e5 [Structured Output] Replace apply_grammar_bitmask() method with that in vllm to avoid maintenance (#2524)
### What this PR does / why we need it?
Replace `apply_grammar_bitmask()` method with that in vllm to avoid
maintenance.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: shen-shanshan <467638484@qq.com>
2025-10-23 17:26:27 +08:00
jiangyunfan1
9434f24ded [TEST]Add initial multi modal cases for nightly test and deepseek-r1 tests (#3631)
### What this PR does / why we need it?
This PR adds the initial multi modal model for nightly test, including 3
cases for Qwen2.5-vl-7b acc/perf test on A3, we need test them daily. It
also inclues 8 cases for deepseek-r1-0528-w8a8 func, acc and perf tests
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
2025-10-23 17:18:49 +08:00
Rui Kang
427b17e2da [Misc] Add a model loader that utilizes HCCL for weight loading (#2888)
### What this PR does / why we need it?

This PR introduces a new model loader called Netloader, which leverages
high-bandwidth P2P direct transfer between NPU cards to achieve weight
loading. Netloader is implemented as a plugin through the newly added
'register_model_loader' function in vLLM 0.10. It facilitates the
process of weight loading by sending weights from a pre-loaded model
(server) to an empty model of a newly started instance (client). The
server operates concurrently with normal inference tasks through
sub-threads and the 'stateless_init_torch_distributed_process_group' in
vLLM. The client initiates a transfer request after verifying that the
model and partitioning method are the same as the server's, and uses
HCCL's collective communication (send/recv) to load the weights in the
order they are stored in the model.

Application Scenarios:
1. Significantly Reduces Inference Instance Startup Time By reusing the
weights of already loaded instances and performing high-speed transfers
directly between computing cards, this method reduces model loading
latency compared to traditional remote/local pull methods.
2. Reduces Network and Storage Pressure Avoids the need to repeatedly
download weight files from remote repositories, reducing the impact on
centralized storage and network traffic, thereby enhancing overall
system stability and service quality.
3. Improves Resource Utilization and Reduces Costs Accelerating the
loading process reduces reliance on redundant computing pools, allowing
computing resources to be elastically scaled and reclaimed as needed.
4. Enhances Business Continuity and High Availability In fault recovery
scenarios, new instances can quickly take over existing services,
avoiding prolonged business interruptions and improving the system's
high availability and user experience.

### Does this PR introduce _any_ user-facing change?

Netloader utilizes the existing --load-format=netloader and
--model-loader-extra-config to be activated. The
model-loader-extra-config needs to be input as a JSON string (as it is
now)

Afterwards, you can check whether the outputs for the same sentence are
consistent when the temperature is set to 0.

Signed-off-by: destinysky <kangrui10@126.com>

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
NeverRaR
807686dec9 perf : optimize memory for deepseek mtp (#2713)
### What this PR does / why we need it?
delete the temp tensor to optimize memory for deepseek mtp for torchair
case

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: boying <897013703@qq.com>
2025-10-23 15:52:17 +08:00
Wang Yixuan
2584f97217 [BugFix] fix deepseek torchair precision (#3624)
### What this PR does / why we need it?
The precision of deepseek torchair is broken by #3465 , which due to the
origin patch or rmsnorm in torchair. This PR fixes the precision of
deepseek torchair

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: hust17yixuan <303660421@qq.com>
2025-10-23 15:41:50 +08:00
Crazyang
f06a6cad1b [Doc] Update the modelslim website from gitee to gitcode. (#3615)
### What this PR does / why we need it?

Because the ModelSlim code repository has migrated from gitee to
gitcode, all relevant links in the repository have been updated.

[migration
notice](https://gitee.com/ascend/msit/tree/master/.%E6%9C%AC%E9%A1%B9%E7%9B%AE%E5%B7%B2%E7%BB%8F%E6%AD%A3%E5%BC%8F%E8%BF%81%E7%A7%BB%E8%87%B3%20Gitcode%20%E5%B9%B3%E5%8F%B0)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Crazyang <im.crazyang@gmail.com>
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Co-authored-by: weichen <calvin_zhu0210@outlook.com>
2025-10-23 15:38:16 +08:00
rjg-lyh
292e213dd2 [main][refactor] refactor SequenceRowParallelOp forward (#3616)
### What this PR does / why we need it?
This PR refactors SequenceRowParallelOp forward. In order to further
expand the operator inclusion scope in dynamic judgment scenarios, this
PR customizes the entire matmul computation and communication as a
custom operator masking. With this refactor, it will support directly
writing code such as common operation fusion into the
`SequenceRowParallelOp` class's member function `matmul_and_reduce`,
without the need to register more redundant custom masking operators.

### How was this patch tested?
CI passed with existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
2025-10-23 14:41:15 +08:00
Li Wang
ca104ce6f0 [Doc] Upgrade docker run command (#3645)
### What this PR does / why we need it?
Update the docker run command, specifically: add --shm-size=1g
### Does this PR introduce _any_ user-facing change?
users/developers using docker to pull vllm-ascend, the shared memory of
the container will be increased from the default 64MB to 1G

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-23 11:17:26 +08:00
Ruri
dd7a25063c [Feat] Prefetching Attention QKV Linear Weight With AddRmsNormQuant Custom Op (#3517)
### What this PR does / why we need it?

- `qkv_proj.weight` prefetching has been implemented with `Quant` op,
when `AddRmsNormQuant` is enabled (#3465) `qkv_proj.weight` prefetching
won't work
- Implement `qkv_proj.weight` prefetching with `AddRmsNormQuant`

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Tested on `Qwen3-235B-A22B-W8A8`
<img width="1868" height="109" alt="image"
src="https://github.com/user-attachments/assets/0bc28082-0287-4d5c-b8f6-f907c3134d36"
/>


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
2025-10-23 10:07:37 +08:00
whx
72695c97d0 [BugFix][main] Fix quantization related mtp bug with patch (#3620)
vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp. main also need this bugfix to support vllm's
v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-23 09:54:31 +08:00
Yizhou
4381d296e5 [Fix] Fix attention metadata handling for profiling and MLA (#3636)
### What this PR does / why we need it?
Move the creation of dummy attention metadata to occur after the ACL
graph runtime mode is determined. This ensures the metadata is
initialized with the correct configuration during a profile run.

Additionally, remove the `attn_metadata` existence check before updating
MLA attention parameters. This change prevents the update from being
skipped when metadata is not yet available, ensuring parameters are set
correctly.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-23 09:35:18 +08:00
Yizhou
b13d22bf5a [Fix] Fixes attribute error in MLA implementation (#3618)
### What this PR does / why we need it?
Corrects the attribute access for retrieving the device from `q_a_proj`
to `q_proj`. This prevents an `AttributeError` as `q_a_proj` does not
exist on the class instance.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Need MLAPO tests.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-23 09:12:50 +08:00
Zetong Li
179b897b52 [Bugfix][MTP] Fix performance degradation when mtp>1 (#3597)
### What this PR does / why we need it?
This PR aims to fix performance degradation when mtp>1. Since mtp>1 may
result in more tokens (i.e. larger batch size) than acl graph maximum
batch size, this will cause draft model to run in eager mode.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-10-22 22:04:43 +08:00
zouyida2052
55a4b5ac40 unify logic between aclgraph and torchair (#3560)
### What this PR does / why we need it?
unify logic between aclgraph and torchair for mtp spec decode

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2025-10-22 21:52:57 +08:00
wangxiyuan
1ad7ffd647 clean up uesless ut test (#3622)
supports_v1 has been removed already by
c18ca62a17

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-22 15:00:08 +08:00
Li Wang
286ae9003d [CI] Multi-Node CI scalable (#3611)
### What this PR does / why we need it?
This PR adds a jinja template for the k8s configuration file, prepare
for the upcoming 4-node CI
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-10-22 14:18:43 +08:00
kx
bc30874f8b [Feat] add native kvcache offload (#3433)
### What this PR does / why we need it?
This pr is for https://github.com/vllm-project/vllm-ascend/issues/3241 ,
which is in-house solution for offloading KV cache data from the GPU
memory to other medium (in particular, CPU memory)。Previous solutions
required reliance on third-party components, which had issues with
compatibility between different versions.

### How was this patch tested?
use the following script for testing:

export CUDA_VISIBLE_DEVICES=0
export TP=1
export MODEL_PATH=/model/Qwen3-14B
export MODEL_NAME=Qwen3-14B
export PORT=10000
#export ASCEND_LAUNCH_BLOCKING=1
#export ASCEND_SLOG_PRINT_TO_STDOUT=1

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port
${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name
${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.7
--max-model-len 32768 --trust-remote-code --disable-log-requests \
    --block-size 128 \
--kv-transfer-config
'{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size":
128, "num_cpu_blocks": 1000, "spec_name":"NPUOffloadingSpec",
"spec_module_path": "vllm_ascend.kv_offload.npu"}}'


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: HF-001 <1670186653@qq.com>
2025-10-22 14:15:49 +08:00
KyrieWang
60e2be1b36 [Feat] Dynamic Batch Feature (#3490)
[RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more
details.
Add dynamic batch feature in chunked prefilling strategy, the token
budget can be refined to achieve better effective throughput and TPOT.

!!! NOTE: only 910B3 is supported till now, we are working on further
improvements.
Additional file for lookup table is required.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com>
2025-10-22 14:13:32 +08:00
wangxiyuan
c18ca62a17 [Misc] clean up useless function (#3348)
Remove the interface which has been removed from vLLM already.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-22 11:53:40 +08:00
NeverRaR
f2dd5f8d08 fix : support chunked_prefill with deepseek_mtp (#2711)
### What this PR does / why we need it?
fix : support chunked_prefill with deepseek_mtp

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
vllm serve $MODEL_PATH
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --no-enforce-eager \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 16384 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --speculative-config '{"num_speculative_tokens":1, "method": "deepseek_mtp"}' \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24],"enable_multistream_mla": true},"ascend_scheduler_config":{"enabled":false},"expert_tensor_parallel_size":16, "chunked_prefill_for_mla":true}' \
   --gpu-memory-utilization 0.95
```

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: boying <897013703@qq.com>
2025-10-22 11:52:27 +08:00
weichen
2f1b9a7a64 Reapply "[MoE] [Refactor] Remove manual memory cleanup (#3365)" (#3483) (#3512)
### What this PR does / why we need it?
1. Replace manual memory cleanup with passing parameter.
2. FusedMoEPrepareAndFinalizeWithMC2 inherits All2All avoid duplicated
code.
3. Fix MC2 bug introduced in
https://github.com/vllm-project/vllm-ascend/pull/3365
4. Unify aclgraph & eager in W8A8_dynamic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
e2e & ut

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-10-22 11:41:30 +08:00
wangxiyuan
6ef62cb427 fix ut (#3608)
Fix `test_torchair_deepseek_v2_decoder_layer` ut failure

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-22 11:30:12 +08:00
offline893
5f3b798e56 [CI]Fix test nightly workflow. (#3603)
### What this PR does / why we need it?
Add the nightly test back, it's deleted by mistake.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-22 10:45:42 +08:00
offline893
e916265b2b [CI]Add EPLB CI. (#3568)
### What this PR does / why we need it?
1.Add eplb ci to check the change of eplb feature.
2.Add param checking of eplb params. 
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Qwen in A3.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: offline0806 <3337230449@qq.com>
Co-authored-by: offline0806 <3337230449@qq.com>
2025-10-21 22:58:02 +08:00
linfeng-yuan
4c9af353ee Revert "[Feat] Shared expert dp for deepseek and deepseek_mtp (#3495)" (#3586)
### What this PR does / why we need it?
This reverts commit
bf87606932.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E vllm serving with `enable_shared_expert_dp: true` in eager mode as
before.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-10-21 22:24:30 +08:00
whx
bd11c0054f [BugFix] Fix torchair+mtp bug after deleting deepseek_mtp. (#3590)
This is a missing bug fix introduced by PR #3561

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-21 22:23:52 +08:00
shaopeng-666
0c83eee9b1 fix vl float model not support NZ format weight error (#3533)
### What this PR does / why we need it?
fix vl float model not support nz mm op
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
Co-authored-by: shaopeng666 <shaopeng666@noreply.gitcode.com>
2025-10-21 22:23:17 +08:00
Icey
6f04b467de [CI] Upgrade manylinux image (#3587)
### What this PR does / why we need it?
Upgrade manylinux image

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Icey <1790571317@qq.com>
2025-10-21 22:22:45 +08:00
xuyexiong
79821106e6 [BugFix]Fix mtp torchair bug caused by #2719 (#3566)
### What this PR does / why we need it?
Fix mtp tochair bug cuased by #2719
Since FIA need extra space for padding, we need to enforce
`self.max_num_seqs > self.scheduler_config.max_num_seqs` in KV consumer
+ MTP
This means that, `self.max_num_seqs` **>** the actual maximum requests
(`self.scheduler_config.max_num_seqs`)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
2025-10-21 22:21:44 +08:00
drslark
534f32d27c [BugFix][mian] Fixed a triton kernel bug of layer_norm_fwd_kernel for Qwen3-next (#3549)
### What this PR does / why we need it?
Fixes triton kernel **layer_norm_fwd_kernel**, descripted by
https://github.com/vllm-project/vllm-ascend/issues/3548

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

The environment is same with this issue,
https://github.com/vllm-project/vllm-ascend/issues/3548.

Starts a vllm server with:
```shell
vllm serve /home/model/Qwen3-Next-80B-A3B-Instruct   --port 22   --host 0.0.0.0   --served-model-name qwen3_next_mtp_0   --tensor-parallel-size 4   --max-model-len 32000   --gpu-memory-utilization 0.7   --enforce-eager
```

The, we start an aisbench clinet like:
```shell
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt --dump-eval-details
```

Whose config is:
```python
    # a big batch_size and a large max_out_len
    dict(
        abbr='vllm-api-general-chat',
        attr='service',
        batch_size=512,
        generation_kwargs=dict(temperature=0.7, top_k=20, top_p=0.8),
        host_ip='xxx.xxx.xxx.xxx',
        host_port=8881,
        max_out_len=30000,
        model='qwen3_next_mtp_0',
        path='',
        pred_postprocessor=dict(
            type=
            'ais_bench.benchmark.utils.model_postprocessors.extract_non_reasoning_content'
        ),
        request_rate=0,
        retry=2,
        trust_remote_code=False,
        type='ais_bench.benchmark.models.VLLMCustomAPIChat'),
```

**Results:**

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.1 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 72.1 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 71.4 tokens/s, Running: 7 reqs, Waiting: 1 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.6 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 86.1%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 59.8 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:44:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.2 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:45:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.8 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 88.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 01:45:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.4 tokens/s, Running: 6 reqs, Waiting: 2 reqs, GPU KV cache usage: 90.8%, Prefix cache hit rate: 0.0%
```

We can see when we sent a bunch of requests and the **KV cache usage
reaches 100.0%**.
We won't get a **coreDim=xxx can't be greater than UINT16_MAX.**
Exception.

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.3 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:17:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:18:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.9 tokens/s, Running: 3 reqs, Waiting: 5 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 02:18:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.7 tokens/s, Running: 2 reqs, Waiting: 6 reqs, GPU KV cache usage: 81.9%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
```

And after a few minutes, these two requests have been done.

```text
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:45 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:18:55 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:05 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 40.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:15 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.2%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO:     141.61.39.105:48712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:25 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=615544)^[[0;0m INFO 10-21 03:19:35 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
Finally, all requests are done.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: drslark <slarksblood@qq.com>
2025-10-21 20:20:57 +08:00
wangxiyuan
13e8e75143 [Refactor] refactor patch module (#3555)
### What this PR does / why we need it?
we notice that `patch_main` is never used. Usually the patch is for all
version. And if it's for specified version, we can use `vllm_version_is`
instead. So let's remove the useless sub folder in patch module to make
it clear.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-21 20:19:46 +08:00
Jade Zheng
0c6349610e [Feature] Reduce host memory usage for attention mask generation (#3048)
### What this PR does / why we need it?

Previously, the mask construction process created multiple tensors of
size (max_model_len, max_model_len). When max_model_len reached 128k,
single GPU host memory usage exceeded hundreds of GB, causing process
OOM crashes. This update optimizes the mask generation to significantly
reduce memory consumption.

### Does this PR introduce _any_ user-facing change?

No.
### How was this patch tested?

CI pass.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-10-21 20:19:04 +08:00
Anion
5f8b1699ae [Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers (#3311)
### What this PR does / why we need it?
**Problem Description:**

The existing implementation for the w4a8-dynamic linear method only
supports the old quantization format from msmodelslim. When attempting
to load models quantized with the new version, vLLM encounters errors
due to mismatched tensor shapes and unprocessed quantization parameters.

Relavant issues: 
- https://github.com/vllm-project/vllm-ascend/issues/3192
- https://github.com/vllm-project/vllm-ascend/issues/3152

**Proposed Changes:**
1. Add support for w4a8 dynamic(new format) in
AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod
2. Add unit tests and e2e tests for w4a8 dynamic new and old format
models
<details>
<summary><b>details</b></summary>

1.  **Support for new w4a8-dynamic format:**
* Detects quantization format by reading the "version" field in
quant_description to ensure backward compatibility.
* Handles the new pre-packed weight format (`2x int4` in an `int8`),
which has a halved dimension. It tells the vLLM loader how to unpack it
using `_packed_dim` and `_packed_factor`.
* Supports the new `scale_bias` parameter, setting its shape based on
the layer type, as required by msmodelslim. For api consistency and
future use, the `layer_type` parameter was also added to other
quantization methods.
* Updates the weight processing logic: new format weights are handled
with `.view(torch.int32)` since they're pre-packed, while old ones are
processed with `npu_convert_weight_to_int4pack`.

2.  **New unit and E2E tests:**
* Added unit tests that verify the logic for both the old and new
formats.
* Split the distributed E2E test to confirm that both old and new format
models work correctly.

</details>
Theoretically, these changes will provide support for all common new
version w4a8(dynamic) models from msmodelslim.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
I implement relevant unit tests and e2e tests and test the changes with
following commands:
```bash
# unit tests
python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v

# e2e tests
pytest tests/e2e/singlecard/test_quantization.py -v -s

pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s

```

I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic
format:
```
vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 
```

All tests mentioned passed locally.

**NOTE: I use quantization model from my own repo in
test_offline_inference_distributed.py**. Here is the description:
[Anionex/Qwen3-1.7B-W4A8-V1](https://modelscope.cn/models/Anionex/Qwen3-1.7B-W4A8-V1/summary)
(including quantization steps).This should be replaced by a model in
vllm-ascend ci modelscope repo.

Thanks for reading!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Anionex <1005128408@qq.com>
2025-10-21 20:18:39 +08:00
Chao Lei
11f9bccf6b Mooncake store use adxl inferface (#3350)
Use adxl inferface in mooncake store, mooncake PR
https://github.com/kvcache-ai/Mooncake/pull/929

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-10-21 20:18:17 +08:00
Yizhou
ef3fabf399 [Chore] Prevents use of ASCEND_LAUNCH_BLOCKING with ACL Graph (#3574)
### What this PR does / why we need it?
Adds a validation check to prevent running with an incompatible
configuration.

The `ASCEND_LAUNCH_BLOCKING=1` environment variable, used for debugging,
enforces synchronous execution which is incompatible with ACL Graph.

This change raises an explicit error to inform the user about the
conflict and how to resolve it, preventing a more obscure failure later.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None needed.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-21 20:17:33 +08:00
whx
220df60c61 [Model][2/N] Remove deepseek_mtp modeling. (#3561)
This PR is step 2 of deepseek model refactoring and removes
deepseek_mtp.

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-10-21 20:17:09 +08:00