Commit Graph

267 Commits

Author SHA1 Message Date
Pr0Wh1teGivee
d13fb0766e [Perf] add patch to optimize apply_topk_topp (#1732)
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut

















- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-07-11 15:32:02 +08:00
weiguihua2
aa4240c67f Support pipeline parallel in V1 Engine (#1700)
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine

### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1

### How was this patch tested?
Manully test














- vLLM version: v0.9.2
- vLLM main:
31d5c1797f

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-07-11 15:30:51 +08:00
ttanzhiqiang
ee40d3d850 use npu_moe_gating_top_k_softmax (#1355)
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b

- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:55:06 +08:00
ttanzhiqiang
9d16c9982e rm router logits Improve TTOP 3ms (#1407)
### What this PR does / why we need it?

The previous code is
router_logits, _ = self.gate(hidden_states)
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits = get_dp_group().all_gather(router_logits, 0)
I want to change the two all_gathers to one, reduce one all_gather
communication, and make it
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits, _ = self.gate(hidden_states)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh

gsm8k accuracy verification
<img width="1809" alt="截屏2025-06-24 21 53 24"
src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b"
/>



- vLLM version: v0.9.2
- vLLM main:
77f77a951e

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-11 08:53:17 +08:00
ApsarasX
0fc9b56d40 [Perf] Improve MLA multistream performance (#1353)
### What this PR does / why we need it?
> Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1%
performance gain.

#### Before Improvement
Profiling
<img width="1147" alt="截屏2025-06-22 14 54 47"
src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c"
/>

Evaluation
```
# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"
```

```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================
```

#### After Improvement
Profiling
<img width="1200" alt="截屏2025-06-22 14 55 42"
src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f"
/>

Evaluation
```
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================
```
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?




- vLLM version: v0.9.2
- vLLM main:
65393ee064

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-11 08:51:17 +08:00
Mengqing Cao
cc210f46e6 [AscendScheduler][Bugfix] Remove num_draft_tokens while allocating slots (#1718)
### What this PR does / why we need it?

Now there is no need to calculate `num_draft_tokens` when allocating
slots.

This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test






- vLLM version: v0.9.2
- vLLM main:
cc876d0f29

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-10 18:47:45 +08:00
Li Wang
c7446438a9 [1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
ApsarasX
643e6f5486 [Bugfix] Fix accuracy problem caused by mask pollution (#1678)
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.

The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Yes.

- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 14:06:49 +08:00
ttanzhiqiang
60519c71bd shared_experts+router_experts merge all_reduce(Improve TTOP 5ms) (#1395)
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912

---------

Signed-off-by: ttanzhiqiang <389825161@qq.com>
2025-07-10 12:07:05 +08:00
ApsarasX
89c1a0f006 [Bugfix] Fix memory-leak caused by dist._functional_collectives.reduce_scatter_tensor (#1380)
### What this PR does / why we need it?
In some cases, `dist._functional_collectives.reduce_scatter_tensor` can
cause its input tensor not to be released immediately after the current
layer ends. Instead, it will only be released when the GPU memory usage
of the current process reaches a certain threshold (approximately every
15 layers each time).

**Before Fix**

<img width="1441" alt="截屏2025-06-24 01 26 13"
src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0"
/>

**After Fix**
<img width="1475" alt="截屏2025-06-24 01 23 43"
src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf"
/>

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?


- vLLM version: v0.9.1
- vLLM main:
9ff2af6d2b

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-07-10 10:57:24 +08:00
wangxiyuan
b979ee353d [Misc] Code clean up (#1679)
Make model_runner_v1 more readable

- vLLM version: v0.9.2
- vLLM main:
baed180aa0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 14:33:40 +08:00
wangxiyuan
392fd7239b [Misc] Add attention mask (#1673)
Move attention mark from V0 to common place.
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 09:12:03 +08:00
wangxiyuan
cc1588be50 [Misc] Code clean up (#1674)
Remove useless function
- vLLM version: v0.9.2
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:54:12 +08:00
wangxiyuan
830332ebfc Clean up v0.9.1 code (#1672)
vllm has released 0.9.2. This PR drop 0.9.1 support.

- vLLM version: v0.9.1
- vLLM main:
b942c094e3

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-09 08:52:24 +08:00
NeverRaR
71de52d3a9 feat: add kv cache memory cache and skip dynamo guard (#1549)
### What this PR does / why we need it?

1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)

### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:37:14 +08:00
NeverRaR
df84cceca8 perf: use multicast to avoid padding decode request to prefill size (#1555)
### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:36:03 +08:00
wm901115nwpu
f08c4f15a2 fix spell error (#1654)
Fix the spell error in code

- vLLM version: v0.9.1
- vLLM main:
923147b5e8

Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
2025-07-07 20:24:42 +08:00
Angazenn
18495f44b2 [BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636)
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed normally.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-07-07 20:03:02 +08:00
ApsarasX
c58accc15e [Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)
### What this PR does / why we need it?
Fix the shape of the `npu_moe_init_routing` input parameters to support
aclgraph mode on qwen3-moe

In addition to this PR, resolving the `gatherv3` error might be
necessary. See related PR
https://github.com/vllm-project/vllm-ascend/pull/1297
https://github.com/vllm-project/vllm-ascend/pull/1446

Thanks to @yiz-liu  for providing the idea

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Qwen3-30B-A3B

Closes: https://github.com/vllm-project/vllm-ascend/issues/1368

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
Vincent Yuan
eb390545ec [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591)
### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
2025-07-05 16:29:21 +08:00
Mengqing Cao
dd22ac38b2 [CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
wangxiyuan
343955c7ac [CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625)
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig

This PR do the same to fix the CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-04 17:54:33 +08:00
Angazenn
a5f33590d3 [CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
Angazenn
9fbd8017c0 [Quantization]300I Duo support w8a8 quantization (#1560)
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
offline inference on 310p runs normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:12:46 +08:00
wangxiyuan
a45dfde283 [CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
Li Wang
30bf7014d0 [Bugfix] Add func swap_states to fix MLA attention (#1580)
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`

This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 17:42:53 +08:00
Zhu Yi Lin
6b80c5acba Fix W8A8 fused moe bug (#1529)
### What this PR does / why we need it?
1. drop some useless code for w8a8 fusedmoe
2. Add in8 kv cache check
3. Add more ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: zhuyilin <809721801@qq.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-02 16:40:51 +08:00
wangxiyuan
641a4e6092 [CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone
0e43813120 [ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
Shanshan Shen
8013634e9c [Structured Output] Remove redundant check for grammar_bitmask (#1459)
### What this PR does / why we need it?
Remove redundant check since we have check this at
https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:39:19 +08:00
whx
f286265791 [BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498)
When use AscendScheduler with prefix-cache enabled and chunk-prefill
disabled, there will be accuray problem because there is no branch in
mla_v1 to process this scenario. This PR fixes it.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-30 16:51:20 +08:00
Li Wang
5f8241c25c [V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
yiz-liu
75d05ee200 [Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446)
### What this PR does / why we need it?

This fix the shape of block_table which was introduced by hybrid kv
groups several weeks ago.

Error will be raised when enable prefix-cache (eager or not) and Ascend
Scheduler at the same time, just send two identical requests and it will
reproduce.

v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test manually

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-30 11:25:19 +08:00
Zhu Yi Lin
b308a7a258 support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Angazenn
c59d69d9e6 [PERF]support MERRouter (#1421)
### What this PR does / why we need it?
This PR introduces an expert rearrange algorithm for PanguProMoE model.
Different from the original grouped topk, it filters out the top experts
that are allocated more tokens. Therefore, we can load less experts when
calculating gmm.

We have test this algorithm for PanguProMoE-72B on 300I Duo platform and
800I A2 platform. On 300I Duo platform, we find that `num_voted_experts`
set to 5 achieves both good performance and accuracy. While on 800I A2,
we still set it to 8 to use original pangu grouped topk.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:14:49 +08:00
Angazenn
8fa188111d [PERF]support H2P communication optimization for PanguProMoe (#1463)
### What this PR does / why we need it?
In this PR, we support H2P communication optimization when running
PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather`
to replace `all_reduce` to improve performance:

original layer:
input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm
--> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce
now:
input_layernorm --> tp all_gather --> attn --> tp reduce_scatter -->
post_attention_layernorm --> all_rank all_gather --> moe/mlp -->
all_rank reduce_scatter

Besides, because `reduce_scatter` requires num_tokens that can be
divided by group size, we need pad the seqs based on
`max_tokens_across_dp`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This PR has been tested with both offline and online inference using
PanguProMoE-72B.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:10:27 +08:00
Angazenn
5c53cbaf2a [BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478)
### What this PR does / why we need it?
This PR fixes a bug that use broadcast with cpu_group when running dp.
The `broadcast310p` patch will take effects for both cpu_group and
device group, but we only need it for device group. Hence a wrapper is
added to allow cpu_group use native torch broadcast and it solves the
bug.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With this PR, DP on 310p runs normally and generates reasonable answers.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:07:52 +08:00
Mengqing Cao
5f4391652f [PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00
Mengqing Cao
d59e7fa095 [CI] Pin transformers<4.53.0 and fix EPLB load_weights to make CI passed (#1482)
### What this PR does / why we need it?

- Fix vLLM EPLB break
e9fd658a73
by recovering load_weights back to [v0.9.1
version](07b8fae219)
temporarily.

- Fix transformers>=4.53.0 image processor break
Related: https://github.com/vllm-project/vllm-ascend/issues/1470

- Mirror torch_npu requirements to pyproject.toml

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-28 00:12:43 +08:00
wangxiyuan
5968dff4e0 [Build] Add build info (#1386)
Add static build_info py file to show soc and sleep mode info. It helps
to make the code clean and the error info will be more friendly for
users

This PR also added the unit test for vllm_ascend/utils.py

This PR also added the base test class for all ut in tests/ut/base.py

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-27 09:14:43 +08:00
sdmyzlp
53c2d58ae1 Handle with_prefill_across_dp for multistream mla (#1322)
### What this PR does / why we need it?
After #1094, decode might be executed with non-compiled mode, despite of
`torchair_graph_config.enabled`, causing multistream mla to fail, which
assumes torchair compiled mode for decode when
`torchair_graph_config.enabled == True`.
Augment that assumption to fix this.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested both offline, and by graph mode mla e2e testcase.

---------

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-06-26 09:32:07 +08:00
yiz-liu
2690697caa [Bugfix] Reset all unused positions to prevent out-of-bounds in GatherV3 (#1416)
### What this PR does / why we need it?
Reset all unused positions in `NPUModelRunner` to prevent out-of-bounds
asserts in the `GatherV3` operator.

Currently, in
[`get_splitfuse_attn_mask`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/attention/attention.py#L124),
the `position` tensor may contain values that exceed the dimensions of
the attention mask, triggering a `GatherV3` boundary check failure.
These invalid indices originate from stale “dirty” entries left over in
`position` due to padding logic in the ACL graph. Specifically, in
[`_process_reqs`](https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L989),
the variable `num_input_tokens` is always greater than or equal to
`total_num_scheduled_tokens`, so any positions not explicitly cleared
from a previous batch will persist and cause this sporadic error.

BTW, in the original vLLM implementation, masks are constructed
internally using other args, so these lingering values do not surface.
However, on the Ascend platform—where split-fuse attention requires
externally supplied masks—these residual indices become critical and
lead to this elusive, hard-to-reproduce failure.

The fix is to explicitly reset or zero out all unused entries in the
`position` tensor before passing it to `GatherV3`, ensuring that every
index lies within the valid range of the attention mask.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1038

### Does this PR introduce _any_ user-facing change?
No


Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-26 09:27:43 +08:00
Pr0Wh1teGivee
2fda60464c [Perf] Use fused ops npu_top_k_top_p (#1308)
### What this PR does / why we need it?
Use fused ops torch_npu.npu_top_k_top_p(logits, p, k) when p and k are
not None, otherwise fallback to the original one. The replacement will
take place automatically when `VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1` .

This patch are using `npu_top_k_top_p` which required
torch_npu>=2.5.1.post1.dev20250619

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested by DeepSeek R1 and UT passed

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-06-25 20:59:06 +08:00
yuancaoyaoHW
e7efc7e7e7 [BugFix] Remove not using patch_eagle.py for CI. (#1385)
### What this PR does / why we need it?
This PR aims to address a long-standing **CI bug** and remove unused
code. The specific changes include:

1. **Fixing CI Bug**: Resolves the root cause of CI test failures or
instability. This often stems from incorrect environment configurations,
dependency version conflicts, or flawed test script logic. This fix
ensures the reliability and consistency of the CI pipeline.
2. **Removing `patch_eagle.py`**: Deletes the `patch_eagle.py` file,
which is no longer utilized by the project. This file was likely legacy
code, experimental code, or its functionality has since been replaced by
other modules. Its removal helps reduce codebase complexity, improves
maintainability, and prevents potential confusion.

### Does this PR introduce _any_ user-facing change?
No, this PR primarily focuses on internal CI stability maintenance and
code cleanup. It does not introduce any user-visible changes to APIs,
interfaces, or other behaviors.

### How was this patch tested?
CI passed. Specifically:

1. **Existing CI Pipelines Passed**: After fixing the CI bug, all
existing CI tests and pipelines were verified to run correctly and pass
successfully.
2. **Code Cleanup Verified**: Following the removal of `patch_eagle.py`,
it was ensured that any related functional modules (if applicable)
continue to work as expected, without introducing new regressions. This
was typically verified by running the project's main test suite.

Signed-off-by: yuancaoyaoHW <a2749322671@gmail.com>
2025-06-25 20:36:05 +08:00
sharonyunyun
941269a6c5 adjusting the communication method in graph mode (#1194)
### What this PR does / why we need it?
Communication performance optimization: replace allreduce with
reduce_scatter+all_gather in MLA layer's TP group,to remove
stridedsliced and all_gather in MOE layer.
when tp > 1, It is enabled during the decode phase of the graph mode
when enable_multistream_moe、MLA, use_v1, and MC2 are used.
According to the end-to-end RL inference test results, this PR can bring
3% gain in the decode stage.

**Before Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/1bb5dfa1-809b-410a-90c9-c5fd23cff003)
Evaluation

![image](https://github.com/user-attachments/assets/0b8ea0c7-88e7-410f-9ef4-f0cfe910cdc7)

![image](https://github.com/user-attachments/assets/94fde910-c125-4c2e-8de4-88fc3fafc057)

**After Improvement**
Profiling kernel_details

![image](https://github.com/user-attachments/assets/55fac0e0-11f2-4654-8fd4-287949e0b29e)
Evaluation

![image](https://github.com/user-attachments/assets/e923f74b-29c4-4171-9382-40a00cf05df0)

![image](https://github.com/user-attachments/assets/5dba7967-07ea-4926-a8be-804bfd34e3e4)

### Does this PR introduce _any_ user-facing change?
Users need to configure enable_multistream_moe=True

### How was this patch tested?
Add e2e test cases to cover code logic

Signed-off-by: sharonyunyun <zhangying134@huawei.com>
2025-06-25 19:56:49 +08:00
wangxiyuan
ca884ef86d [Misc] Clean up uesless code for LLM initialize (#1373)
This PR aims to clean up the useless code for LLM setup. It helps to
make the code more clear.
1. remove useless `self.xxx` property
2. change `set_random_seed` to `seed_everything`
3. remove `set_custom_all_reduce`, it's only used for cuda

This is just a code clean. no change for any code logic.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-25 16:20:14 +08:00
Mengqing Cao
52317f92cb [DP] Tiny fix of dp and update example (#1273)
### What this PR does / why we need it?
Add `max_num_tokens_across_dp` to AscendMetadata to fix dp

This pr fixes the bug introduced by
https://github.com/vllm-project/vllm-ascend/pull/1229, which add an arg
`max_num_tokens_across_dp` when dp_size > 1.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-25 11:03:04 +08:00
Li Wang
5f5800ba42 [Bugfix] Sync MRotaryEmbedding interface change to recover CI (#1399)
### What this PR does / why we need it?

Sync MRotaryEmbedding interface change to recover main CI
(https://github.com/vllm-project/vllm/pull/19939)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-24 22:56:39 +08:00
wangxiyuan
9cbce423ce [MISC] Remove useless patch (#1366)
### What this PR does / why we need it?
`stateless_init_dp_group` in vllm works with non-cuda platform now.
Remove this useless patch.

Which was introduced in vllm-ascend by
e74331a1ed
(v0.8.4rc2)
vLLM upstream merged:
3e472d882a
(v0.8.0)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-24 10:05:59 +08:00
lyj-jjj
5177bef87a support fused_moe_allgather_ep (#1335)
### What this PR does / why we need it?
support fused_moe_allgather_ep

### How was this patch tested?
It was tested by UT.

Signed-off-by: lyj-jjj <liuyingjun5@huawei.com>
2025-06-23 22:03:38 +08:00