Commit Graph

489 Commits

Author SHA1 Message Date
Yikun Jiang
e4e9ea02ab Upgrade vLLM version to v0.9.2 (#1652)
### What this PR does / why we need it?

This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-08 14:18:17 +08:00
NeverRaR
71de52d3a9 feat: add kv cache memory cache and skip dynamo guard (#1549)
### What this PR does / why we need it?

1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)

### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:37:14 +08:00
NeverRaR
df84cceca8 perf: use multicast to avoid padding decode request to prefill size (#1555)
### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size

### How was this patch tested?

- vLLM version: v0.9.1
- vLLM main:
1fd471e957

Signed-off-by: boying <897013703@qq.com>
2025-07-07 22:36:03 +08:00
wm901115nwpu
f08c4f15a2 fix spell error (#1654)
Fix the spell error in code

- vLLM version: v0.9.1
- vLLM main:
923147b5e8

Signed-off-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
Co-authored-by: unicorn <unicorn@unicorns-MacBook-Pro.local>
2025-07-07 20:24:42 +08:00
Mengqing Cao
f2a20393a2 [CI] Fix mypy check in CI (#1655)
### What this PR does / why we need it?
Fix mypy check in CI:
https://github.com/vllm-project/vllm-ascend/actions/runs/16115919385/job/45469646509?pr=1654

Mypy failed due to the greater numpy version. We need to pin
`numpy=1.26.4` in vllm-ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 20:19:16 +08:00
Angazenn
18495f44b2 [BugFix] Fix max_num_tokens_across_dp calculation bugs in attention_v1_torchair (#1636)
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed normally.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-07-07 20:03:02 +08:00
Zheng Wengang
9c886d0a1f [EPLB] support deepseek eplb strategy (#1196)
### What this PR does / why we need it?

This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB)
strategy to optimize expert distribution in vllm-ascend. The
implementation:
- Adapts the expert-map format to work with vllm-ascend's architecture
- Provides DeepSeek-provided mechanism to balance expert workload across
devices

### Does this PR introduce _any_ user-facing change?

This PR adds a new script that allows users to:
- Generate expert map configurations based on workload analysis
- Optimize expert distribution for their specific use case

### How was this patch tested?

To use this feature:
1. First collect expert heat information during model execution
2. Run the provided script to generate the expert map configuration
3. Apply the generated configuration to your vllm-ascend deployment

User example:

```bash
# expert_load_view.pt:  dumped expert heat info file
python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \
    --input_path expert_load_view.pt  --output_path examples/eplb/results/demo \
    --num_nodes 4
```

---------

Signed-off-by: ZhengWG <zwg0606@gmail.com>
2025-07-07 17:22:08 +08:00
wangyanhui-cmss
4e29c5a808 Add ut for test_pooling_model_runner.py (#1640)
### What this PR does / why we need it?
 Add ut for test_pooling_model_runner.py

### Does this PR introduce _any_ user-facing change? N/A

### How was this patch tested?
 python -m unittest  test_pooling_model_runner.py


- vLLM version: v0.9.1
- vLLM main:
2e610deb72

---------

Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
2025-07-07 17:12:11 +08:00
Yikun Jiang
493768eb30 Record vLLM commit in PR description (#1623)
### What this PR does / why we need it?
This patch enables the vllm commits recording and also cleanup unused
commit msg note in PR.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- Test on https://github.com/Yikun/vllm-ascend/pull/33 and vllm commit
refreshed as expected.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-07 10:20:38 +08:00
Mengqing Cao
7efa4e92fe [CI] Fix oom in chunk prefill (#1622)
### What this PR does / why we need it?
Add the resource clear logic to fix oom issue when testing
`tests/e2e/singlecard/core/ascend_scheduler`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-07 10:14:40 +08:00
ApsarasX
c58accc15e [Bugfix] Support Qwen3-MOE on aclgraph mode (#1381)
### What this PR does / why we need it?
Fix the shape of the `npu_moe_init_routing` input parameters to support
aclgraph mode on qwen3-moe

In addition to this PR, resolving the `gatherv3` error might be
necessary. See related PR
https://github.com/vllm-project/vllm-ascend/pull/1297
https://github.com/vllm-project/vllm-ascend/pull/1446

Thanks to @yiz-liu  for providing the idea

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Qwen3-30B-A3B

Closes: https://github.com/vllm-project/vllm-ascend/issues/1368

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 15:29:36 +08:00
zhangxinyuehfad
14373f65d7 [Test] Remove V0 accuracy test and enable MoE and VL test on V1 (#1574)
### What this PR does / why we need it?
Update accuracy test
1. remove accuarcy report on V0
2. add parallel and execution mode
3. add Qwen/Qwen3-30B-A3B and remove Qwen/Qwen2.5-7B-Instruct


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-06 11:10:19 +08:00
Yikun Jiang
0c1d239df4 Add unit test local cpu guide and enable base testcase (#1566)
### What this PR does / why we need it?
Use Base test and cleanup all manaul patch code
- Cleanup EPLB config to avoid tmp test file
- Use BaseTest with global cache
- Add license
- Add a doc to setup unit test in local env 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-06 10:42:27 +08:00
Vincent Yuan
eb390545ec [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series (#1591)
### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339

Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
2025-07-05 16:29:21 +08:00
Mengqing Cao
dd22ac38b2 [CI/UT][Refactor] move e2e spec decode and deepseek acc test to per pr (#1136)
### What this PR does / why we need it?
1. run deepseek acc ut per pr --- multicard CI time increased by 9 min
2. run spec decode e2e test on v1 per pr --- singlecard CI time
increased by 3 min (partly is disabled due to not work now)
~~3. align the output of whether dbo is enabled or not~~
    The generated results with and without dbo cannot be aligned.

https://github.com/vllm-project/vllm-ascend/actions/runs/15822900528/job/44600029405?pr=1136
4. skip V0 mtp test due to failure in
https://github.com/vllm-project/vllm-ascend/actions/runs/16012172833/job/45171988816
5. fix some version conflicts
### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-04 18:05:45 +08:00
wangxiyuan
343955c7ac [CI] Follow vLLM FusedMoEParallelConfig interface change and clean up unused config (#1625)
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig

This PR do the same to fix the CI error

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-04 17:54:33 +08:00
zhangxinyuehfad
4e910186de [CI/UT] Unify model usage via ModelScope in CI (#1207)
### What this PR does / why we need it?
Unify Model Usage via ModelScope

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2025-07-04 10:52:17 +08:00
Angazenn
a5f33590d3 [CORE]initial support for torchair with non-mla backend (#1506)
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.

### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.

### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:21:42 +08:00
Angazenn
9fbd8017c0 [Quantization]300I Duo support w8a8 quantization (#1560)
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
offline inference on 310p runs normally.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-03 22:12:46 +08:00
Yikun Jiang
6d7cb14a24 Fix lint in examples/offline_embed.py (#1618)
### What this PR does / why we need it?
Fix lint

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-03 21:40:29 +08:00
xleoken
e511ddd67d [Bug] Fix wrong modescope env set order (#1611)
### What this PR does / why we need it?
The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before
module imports

if not 
```
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module>
    model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config
    model_config = self.create_model_config()
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config
    return ModelConfig(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__
    hf_config = get_config(self.hf_config_path or self.model,
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
  File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files
    raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Local.

Signed-off-by: xleoken <xleoken@163.com>
2025-07-03 18:50:53 +08:00
wangxiyuan
a45dfde283 [CI] Fix FusedMoEConfig and input batch failure to recover CI (#1602)
Make CI happy

1.
c1909e7e8c
changed moeConfig init way
2.
48fb076cbc
changed input batch logic.

This PR address these change to vllm-ascend.

Closes: https://github.com/vllm-project/vllm-ascend/issues/1600

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-03 18:36:17 +08:00
yupeng
d96da1f00c [DOC] Fix word spelling (#1595)
### What this PR does / why we need it?
Fix word spelling in DOC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 21:42:39 +08:00
zhanghw0354
9fb3d558e5 [Test]Add unit test for platform.py (#1476)
### What this PR does / why we need it?
According to issue #1298 , this pull request adds unit test code for
platform.py.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added/existing test.

---------

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: zhuyilin <809721801@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Angazenn <92204292+Angazenn@users.noreply.github.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: Zhu Yi Lin <116337067+GDzhu01@users.noreply.github.com>
2025-07-02 17:46:06 +08:00
Li Wang
30bf7014d0 [Bugfix] Add func swap_states to fix MLA attention (#1580)
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`

This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 17:42:53 +08:00
Mengqing Cao
59237ea788 [CI/UT] Add test for chunk prefill and prefix cache on v1/AscendScheduler (#1505)
### What this PR does / why we need it?
Add test for chunked prefill and prefix cache on v1/AscendScheduler

Covered scenarios:
- `Qwen/Qwen3-0.6B-Base` and `deepseek-ai/DeepSeek-V2-Lite-Chat` ---
multicard CI time increased by 19 min
- `V1 + default scheduler` vs `V1 + default scheduler + enable prefix
cache`
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable prefix
cache` vs `V1 + Ascend scheduler + enable prefix cache + enable chunked
prefill`
- `Qwen/Qwen3-0.6B-Base` --- singlecard CI time increased by 8 min
- `V1 + Ascend scheduler` vs `V1 + Ascend scheduler + enable chunked
prefill`

should rebase after #1498 and #1446
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:57:03 +08:00
Zhu Yi Lin
6b80c5acba Fix W8A8 fused moe bug (#1529)
### What this PR does / why we need it?
1. drop some useless code for w8a8 fusedmoe
2. Add in8 kv cache check
3. Add more ut.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: zhuyilin <809721801@qq.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
2025-07-02 16:40:51 +08:00
Agonixiaoxiao
7fc1a98489 add ut for kv tansfer module (#1531)
### What this PR does / why we need it?
test kv data transfer contains connect,pipe,buffer

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: lixudong <lixudong@cmss.chinamobile.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: lixudong <lixudong@cmss.chinamobile.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
2025-07-02 16:14:52 +08:00
Yikun Jiang
aa5fa07478 Only enable single version for wheel pr build (#1571)
### What this PR does / why we need it?
Only enable single version for wheel pr build to speedup PR triggered CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 14:50:34 +08:00
yupeng
c3c8c9317c [DOC] add LoRA user guide (#1265)
### What this PR does / why we need it?
Add LoRA user guide to DOC. The content refers to [LoRA
Adapters](https://docs.vllm.ai/en/latest/features/lora.html).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

---------

Signed-off-by: paulyu12 <507435917@qq.com>
2025-07-02 14:41:31 +08:00
Li Wang
f39365d2ea [Benchmark] Fix error msg upload in performance benchmark (#1559)
### What this PR does / why we need it?

Make sure that None parameters are not passed in for `--error`
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

CI passed locally

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-02 14:06:08 +08:00
wangxiyuan
641a4e6092 [CI] Cache sampled token ids in model runner to fix CI error (#1573)
### What this PR does / why we need it?
vllm change
7f280d69c9
break vllm-ascend.

This PR Fix the broken CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
passed

Closes: https://github.com/vllm-project/vllm-ascend/issues/1572

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-02 12:11:14 +08:00
Pleaplusone
0e43813120 [ModelRunner] Use shared CachedRequestData cross request to fix ci (#1546)
### What this PR does / why we need it?

This PR (adapted from
2863befce3)
updates the CachedRequestData definition to use a single instance shared
across all requests in a batch, instead of creating a new instance per
request.

Found ci boken by the vllm's model_runner change: `ERROR 07-01 09:53:53
[core.py:521] TypeError: 'CachedRequestData' object is not iterable`,
Modify the model_runner to fix it.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
pass ci will verify this.

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-07-02 06:05:21 +08:00
Li Wang
6db7dc2c85 [Benchmark] Refactor perf script to use benchmark cli (#1524)
### What this PR does / why we need it?

Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-06-30 23:42:04 +08:00
leo-pony
53ec583bbb [Docs] Update Altlas 300I series doc and fix CI lint (#1537)
### What this PR does / why we need it?
- Update Altlas 300I series doc: cleanup unused parameters and enable
optimized ops
- Fix code spell CI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 23:34:00 +08:00
wangxiyuan
a054f0f4ca [CI] change to new ds model (#1513)
Previous, the DeepSeek V3 Pruning weight is not correct, the moe layer
is not tested. We update a new Pruning model to enable moe layer
compute.

This PR fix the CI to address the new weight.

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-30 19:02:29 +08:00
Shanshan Shen
8013634e9c [Structured Output] Remove redundant check for grammar_bitmask (#1459)
### What this PR does / why we need it?
Remove redundant check since we have check this at
https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/worker/model_runner_v1.py#L1450.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:39:19 +08:00
Shanshan Shen
ba577dfc52 [Doc] Add Structured Output guide (#1499)
### What this PR does / why we need it?
Add Structured Output guide.


Signed-off-by: shen-shanshan <467638484@qq.com>
2025-06-30 17:21:44 +08:00
whx
f286265791 [BugFix] Address PrefillCacheHit state to fix prefix cache accuracy bug (#1498)
When use AscendScheduler with prefix-cache enabled and chunk-prefill
disabled, there will be accuray problem because there is no branch in
mla_v1 to process this scenario. This PR fixes it.

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-06-30 16:51:20 +08:00
Li Wang
5f8241c25c [V1][ModelRunner] Support pooling model for v1 engine (#1359)
### What this PR does / why we need it?
Change as little existing code as possible to add v1 pooling task's
support, notice that i move down the `vllm.v1.worker.gpu_input_batch` to
vllm-ascend, Considering the frequent changes in upstream interfaces, in
order to decouple, so i move it here
### How was this patch tested?
CI passed with new added/existing test, and I have a simple test was
first conducted locally which is adapted from
https://www.modelscope.cn/models/Qwen/Qwen3-Embedding-0.6B, just like
bellow:
```python
import os

import torch
from vllm import LLM


os.environ["VLLM_USE_MODELSCOPE"]="True"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]
```
---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangli <858794774@qq.com>
Co-authored-by: wangli <858794774@qq.com>
2025-06-30 16:31:12 +08:00
dependabot[bot]
790c810bf7 Bump actions/github-script from 6 to 7 (#1519)
Bumps [actions/github-script](https://github.com/actions/github-script)
from 6 to 7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/github-script/releases">actions/github-script's
releases</a>.</em></p>
<blockquote>
<h2>v7.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add base-url option by <a
href="https://github.com/robandpdx"><code>@​robandpdx</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li>Expose async-function argument type by <a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a>,
see for details <a
href="https://github.com/actions/github-script#use-scripts-with-jsdoc-support">https://github.com/actions/github-script#use-scripts-with-jsdoc-support</a></li>
<li>Update dependencies and use Node 20 by <a
href="https://github.com/joshmgross"><code>@​joshmgross</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/425">actions/github-script#425</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/navarroaxel"><code>@​navarroaxel</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/285">actions/github-script#285</a></li>
<li><a href="https://github.com/robandpdx"><code>@​robandpdx</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/429">actions/github-script#429</a></li>
<li><a
href="https://github.com/viktorlott"><code>@​viktorlott</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/402">actions/github-script#402</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.1...v7.0.0">https://github.com/actions/github-script/compare/v6.4.1...v7.0.0</a></p>
<h2>v6.4.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Add <code>@​octokit/plugin-request-log</code>, to produce debug
output for requests by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
<li>fix input handling by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/357">actions/github-script#357</a></li>
<li>Remove unused dependencies by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/356">actions/github-script#356</a></li>
<li>Default debug to current runner debug state by <a
href="https://github.com/mjpieters"><code>@​mjpieters</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/363">actions/github-script#363</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/mjpieters"><code>@​mjpieters</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/358">actions/github-script#358</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.4.0...v6.4.1">https://github.com/actions/github-script/compare/v6.4.0...v6.4.1</a></p>
<h2>v6.4.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Bump json5 from 2.1.3 to 2.2.3 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/319">actions/github-script#319</a></li>
<li>Bump minimatch from 3.0.4 to 3.1.2 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/320">actions/github-script#320</a></li>
<li>Add node-fetch by <a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a> in
<a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/jongwooo"><code>@​jongwooo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/313">actions/github-script#313</a></li>
<li><a
href="https://github.com/austinvazquez"><code>@​austinvazquez</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/306">actions/github-script#306</a></li>
<li><a
href="https://github.com/danmichaelo"><code>@​danmichaelo</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/321">actions/github-script#321</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.3...v6.4.0">https://github.com/actions/github-script/compare/v6.3.3...v6.4.0</a></p>
<h2>v6.3.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@actions/glob</code> to 0.3.0 by <a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/nineinchnick"><code>@​nineinchnick</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/github-script/pull/279">actions/github-script#279</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/github-script/compare/v6.3.2...v6.3.3">https://github.com/actions/github-script/compare/v6.3.2...v6.3.3</a></p>
<h2>v6.3.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Update <code>@​actions/core</code> to 1.10.0 by <a
href="https://github.com/rentziass"><code>@​rentziass</code></a> in <a
href="https://redirect.github.com/actions/github-script/pull/295">actions/github-script#295</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="60a0d83039"><code>60a0d83</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/440">#440</a>
from actions/joshmgross/v7.0.1</li>
<li><a
href="b7fb2001b4"><code>b7fb200</code></a>
Update version to 7.0.1</li>
<li><a
href="12e22ed06b"><code>12e22ed</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/439">#439</a>
from actions/joshmgross/avoid-setting-base-url</li>
<li><a
href="d319f8f5b5"><code>d319f8f</code></a>
Avoid setting <code>baseUrl</code> to undefined when input is not
provided</li>
<li><a
href="e69ef5462f"><code>e69ef54</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/github-script/issues/425">#425</a>
from actions/joshmgross/node-20</li>
<li><a
href="ee0914b839"><code>ee0914b</code></a>
Update licenses</li>
<li><a
href="d6fc56f33b"><code>d6fc56f</code></a>
Use <code>@types/node</code> for Node 20</li>
<li><a
href="384d6cf581"><code>384d6cf</code></a>
Fix quotations in tests</li>
<li><a
href="84724927e3"><code>8472492</code></a>
Only validate GraphQL <code>previews</code></li>
<li><a
href="84903f5182"><code>84903f5</code></a>
Remove <code>node-fetch</code> from type</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/github-script/compare/v6...v7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/github-script&package-manager=github_actions&previous-version=6&new-version=7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-30 16:04:41 +08:00
Yikun Jiang
e4df0a4395 Add Pangu MoE Pro for 300I series docs (#1516)
### What this PR does / why we need it?
Add Pangu MoE Pro for 300I series docs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 13:37:22 +08:00
Yikun Jiang
cad4c693c6 Add Pangu MoE Pro docs (#1512)
### What this PR does / why we need it?
This PR add Pangu MoE Pro 72B docs

[1] https://gitcode.com/ascend-tribe/pangu-pro-moe-model

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-06-30 12:15:33 +08:00
yiz-liu
75d05ee200 [Core] Fix block table shape to make Prefix cache work with Ascend scheduler (#1446)
### What this PR does / why we need it?

This fix the shape of block_table which was introduced by hybrid kv
groups several weeks ago.

Error will be raised when enable prefix-cache (eager or not) and Ascend
Scheduler at the same time, just send two identical requests and it will
reproduce.

v0.9.1: https://github.com/vllm-project/vllm-ascend/pull/1297

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test manually

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-06-30 11:25:19 +08:00
Zhu Yi Lin
b308a7a258 support pangumoe w8a8c8 and docs (#1477)
### What this PR does / why we need it?
support pangu moe w8a8c8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with new added test.

Signed-off-by: zhuyilin <809721801@qq.com>
2025-06-28 18:51:07 +08:00
Angazenn
c59d69d9e6 [PERF]support MERRouter (#1421)
### What this PR does / why we need it?
This PR introduces an expert rearrange algorithm for PanguProMoE model.
Different from the original grouped topk, it filters out the top experts
that are allocated more tokens. Therefore, we can load less experts when
calculating gmm.

We have test this algorithm for PanguProMoE-72B on 300I Duo platform and
800I A2 platform. On 300I Duo platform, we find that `num_voted_experts`
set to 5 achieves both good performance and accuracy. While on 800I A2,
we still set it to 8 to use original pangu grouped topk.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:14:49 +08:00
Angazenn
8fa188111d [PERF]support H2P communication optimization for PanguProMoe (#1463)
### What this PR does / why we need it?
In this PR, we support H2P communication optimization when running
PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather`
to replace `all_reduce` to improve performance:

original layer:
input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm
--> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce
now:
input_layernorm --> tp all_gather --> attn --> tp reduce_scatter -->
post_attention_layernorm --> all_rank all_gather --> moe/mlp -->
all_rank reduce_scatter

Besides, because `reduce_scatter` requires num_tokens that can be
divided by group size, we need pad the seqs based on
`max_tokens_across_dp`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This PR has been tested with both offline and online inference using
PanguProMoE-72B.

---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:10:27 +08:00
Angazenn
5c53cbaf2a [BugFix]Fix bugs when initializing communication groups with dp on 300I Duo (#1478)
### What this PR does / why we need it?
This PR fixes a bug that use broadcast with cpu_group when running dp.
The `broadcast310p` patch will take effects for both cpu_group and
device group, but we only need it for device group. Hence a wrapper is
added to allow cpu_group use native torch broadcast and it solves the
bug.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
With this PR, DP on 310p runs normally and generates reasonable answers.

Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
2025-06-28 16:07:52 +08:00
Mengqing Cao
2cf9c4c3a2 [CI/Build] Fix version conflict on transformers (#1490)
### What this PR does / why we need it?
Fix version conflict on transformers:
`pip._vendor.pkg_resources.ContextualVersionConflict: (transformers
4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages),
Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})`
Fix
https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642

### Does this PR introduce _any_ user-facing change?
Fix broken build

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 15:11:04 +08:00
Mengqing Cao
5f4391652f [PromptLogprobs][V1] Support prompt logprobs to fix ceval accuracy in V1 (#1483)
### What this PR does / why we need it?
Support prompt logprobs in V1. This also enable lm_eval to test accuracy
on V1

### Does this PR introduce _any_ user-facing change?
support prompt logprobs output

### How was this patch tested?
CI passed with accuracy test.

Using lm_eval, which use prompt logprobs as output to test accuracy, to
test:
```python
VLLM_USE_V1=1 lm_eval \
  --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \
  --tasks ceval-valid_computer_network \
  --batch_size 8
```
After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct`
on V1 is:
```bash
|           Tasks            |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|----------------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|ceval-valid_computer_network|      2|none  |     0|acc     |↑  |0.7368|±  |0.1038|
|                            |       |none  |     0|acc_norm|↑  |0.7368|±  |0.1038|
```

Closes: https://github.com/vllm-project/vllm-ascend/issues/1043

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-06-28 09:38:52 +08:00