### What this PR does / why we need it?
Add accuracy ci for DP and EP and TP
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
35514b682a
---------
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Performance optimization for apply_top_k_top_p
### Does this PR introduce _any_ user-facing change?
Use VLLM_ASCEND_ENABLE_TOPK_TOPP_OPTIMIZATION to enable this feature
### How was this patch tested?
e2e & ut
- vLLM version: v0.9.2
- vLLM main:
6a9e6b2abf
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it?
This patch supports pipeline parallel in V1 Engine
### Does this PR introduce _any_ user-facing change?
Yes, users can run PP in V1
### How was this patch tested?
Manully test
- vLLM version: v0.9.2
- vLLM main:
31d5c1797f
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace
gating_topk_softmax with softmax+topk+to, which is optimized from 37us
to 14us on bf16/fp16 of qwen3-235b
- vLLM version: v0.9.2
- vLLM main:
1a4f35e2ea
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it?
The previous code is
router_logits, _ = self.gate(hidden_states)
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits = get_dp_group().all_gather(router_logits, 0)
I want to change the two all_gathers to one, reduce one all_gather
communication, and make it
hidden_states = get_dp_group().all_gather(hidden_states, 0)
router_logits, _ = self.gate(hidden_states)
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
gsm8k accuracy verification
<img width="1809" alt="截屏2025-06-24 21 53 24"
src="https://github.com/user-attachments/assets/47eace3b-a86b-41b4-9de8-773f57fea33b"
/>
- vLLM version: v0.9.2
- vLLM main:
77f77a951e
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it?
Now there is no need to calculate `num_draft_tokens` when allocating
slots.
This PR follows the changes in vllm:
https://github.com/vllm-project/vllm/pull/20701
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test
- vLLM version: v0.9.2
- vLLM main:
cc876d0f29
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
1. enable lint check for all change
2. only run ut and e2e if it's the code change.
3. only run ut and disable e2e if the change is ut only.
4. disable wheel build for push case
5. run unit test when pr is merged
6. remove useless pytest.ini
- vLLM version: v0.9.2
- vLLM main:
fdfd409f8f
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Add user doc index to make the user guide more clear
- vLLM version: v0.9.1
- vLLM main:
49e8c7ea25
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml
Enable pre-commit to avoid some low level error AMAP.
This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.
TODO:
- clang-format(check for csrc with google style)
need clean code, disable for now
- pymarkdown
need clean code, disable for now
- shellcheck
need clean code, disable for now
### Does this PR introduce _any_ user-facing change?
Only developer UX change:
https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally
```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```
### How was this patch tested?
CI passed with new added/existing test.
Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
If a small batch of short requests is sent first, forming a chunk with a
length <128, it will corrupt the `attn_mask_cache`, causing subsequent
requests that do not form a chunk to have accuracy issues.
The root cause of this problem is the use of in-place multiplication.
Modifying it to use out-of-place multiplication will resolve the
accuracy problem.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Yes.
- vLLM version: v0.9.2
- vLLM main:
ad6c2e1a0b
---------
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
When all_reduce_merge is in progress, shared_experts does not do
all_reduce in mlp, but waits until shared_experts+router_experts are
completed before doing all_reduce
In prefill and decode, as long as shared_experts+router_experts are
all_reduce, there will be benefits.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
bash examples/run_dp_attention_etp16.sh
bash examples/run_dp_attention_etp16_benmark.sh
- vLLM version: v0.9.1
- vLLM main:
977180c912
---------
Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it?
In some cases, `dist._functional_collectives.reduce_scatter_tensor` can
cause its input tensor not to be released immediately after the current
layer ends. Instead, it will only be released when the GPU memory usage
of the current process reaches a certain threshold (approximately every
15 layers each time).
**Before Fix**
<img width="1441" alt="截屏2025-06-24 01 26 13"
src="https://github.com/user-attachments/assets/72d5dbb3-c8c8-4778-bf64-8db7bab8aff0"
/>
**After Fix**
<img width="1475" alt="截屏2025-06-24 01 23 43"
src="https://github.com/user-attachments/assets/6c69cfcd-a469-4ee5-b8c6-210aeb3a5bdf"
/>
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
9ff2af6d2b
---------
Signed-off-by: ApsarasX <apsarax@outlook.com>
### What this PR does / why we need it?
add multi node data parallel doc
### Does this PR introduce _any_ user-facing change?
add multi node data parallel doc
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
805d62ca88
Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Improve
- Keep the same file name format as v1, `offline_inference_npu_v0.py`,
`offline_inference_npu_v1.py`
- Use `VLLM_USE_V1` = 0/1 clearly in py scripts
- Fix some run errors in `offline_inference_npu_v1.py`, e.g.
`deepseekv3-lite-base-latest` not exists in modescope or hf.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
baed180aa0
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
To solve the error in the CI of long term test:
```bash
modelscope - ERROR - Repo JackFram/llama-68m not exists on either https://www.modelscope.cn/ or https://www.modelscope.ai/
```
Replace the hf model with modelscope model.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
71d1d75b7a
---------
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
vllm has released 0.9.2. This PR drop 0.9.1 support.
- vLLM version: v0.9.1
- vLLM main:
b942c094e3
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This patch upgrade vLLM version to v0.9.2, this patch didn't remove the
v0.9.1 compatible code to easy review.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
14601f5fba
- Accuracy test with 0.9.2:
https://github.com/vllm-project/vllm-ascend/actions/runs/16121612087
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
1、Sometimes loading torchair cache will fail because of the floating of
npu memory, so this pr add a new cache to save the old kv cache bytes to
avoid the possible crash while loading the torchair graph cache.
2、When caching is enabled and does not exist, the first compilation
introduces the overhead of Dynamo Gurad. So in this case, we will
compile them directly twice to skip them (This will bring 3-4 ms of tpot
optimization)
### Does this PR introduce _any_ user-facing change?
Add a new env `VLLM_ASCEND_KV_CACHE_MEGABYTES_FLOATING_TOLERANCE` to
control kv cache floating tolerance
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
1fd471e957
Signed-off-by: boying <897013703@qq.com>
### What this PR does / why we need it?
perf: use multicast to avoid padding decode request to prefill size
### How was this patch tested?
- vLLM version: v0.9.1
- vLLM main:
1fd471e957
Signed-off-by: boying <897013703@qq.com>
### What this PR does / why we need it?
This PR fixes a bug that is caused by max_num_tokens_across_dp
calculation. In earlier version, we compute this by graph_pad_size plus
max_num_tokens(actual). This will result in different
max_num_tokens_across_dp across dp ranks. If padding related is
required, this might cause a wrong padding.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed normally.
Signed-off-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
### What this PR does / why we need it?
This PR implements the DeepSeek Expert Parallel Load Balancing (EPLB)
strategy to optimize expert distribution in vllm-ascend. The
implementation:
- Adapts the expert-map format to work with vllm-ascend's architecture
- Provides DeepSeek-provided mechanism to balance expert workload across
devices
### Does this PR introduce _any_ user-facing change?
This PR adds a new script that allows users to:
- Generate expert map configurations based on workload analysis
- Optimize expert distribution for their specific use case
### How was this patch tested?
To use this feature:
1. First collect expert heat information during model execution
2. Run the provided script to generate the expert map configuration
3. Apply the generated configuration to your vllm-ascend deployment
User example:
```bash
# expert_load_view.pt: dumped expert heat info file
python3 examples/eplb/eplb_strategy.py --exp_name 'deepseek_demo' \
--input_path expert_load_view.pt --output_path examples/eplb/results/demo \
--num_nodes 4
```
---------
Signed-off-by: ZhengWG <zwg0606@gmail.com>
### What this PR does / why we need it?
Add ut for test_pooling_model_runner.py
### Does this PR introduce _any_ user-facing change? N/A
### How was this patch tested?
python -m unittest test_pooling_model_runner.py
- vLLM version: v0.9.1
- vLLM main:
2e610deb72
---------
Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
### What this PR does / why we need it?
This patch enables the vllm commits recording and also cleanup unused
commit msg note in PR.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- CI passed
- Test on https://github.com/Yikun/vllm-ascend/pull/33 and vllm commit
refreshed as expected.
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add the resource clear logic to fix oom issue when testing
`tests/e2e/singlecard/core/ascend_scheduler`.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with existing test.
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Update accuracy test
1. remove accuarcy report on V0
2. add parallel and execution mode
3. add Qwen/Qwen3-30B-A3B and remove Qwen/Qwen2.5-7B-Instruct
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Use Base test and cleanup all manaul patch code
- Cleanup EPLB config to avoid tmp test file
- Use BaseTest with global cache
- Add license
- Add a doc to setup unit test in local env
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.
See #1563
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test manually:
https://github.com/vllm-project/vllm-ascend/pull/1591#issuecomment-3028352339
Signed-off-by: Vincent Yuan <farawayboat@gmail.com>
This commit
78fe77534b
from vllm reverted the change for FusedMoEParallelConfig
This PR do the same to fix the CI error
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Unify Model Usage via ModelScope
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
This PR supports torchair graph mode with non-mla backend on both 800IA2
and 300I Duo platforms. The main change is to add
`attention_v1_torchair.py` to support specific attention related
operations that are required by torchair.
### Does this PR introduce _any_ user-facing change?
Before this PR, vLLM-Ascend only allows deepseek to use torchair. Now we
can also use it with pangu. Besides, we add a support model list to
control which type of models that can use torchair.
### How was this patch tested?
We have test it with PanguProMoE on both 800IA2 and 300I Duo platforms,
and model generates answer normally.
---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
### What this PR does / why we need it?
This pr supports w8a8 on 300I Duo platform. The main change is to use
`npu_quant_grouped_matmul_dequant` to replace `npu_grouped_matmul`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
offline inference on 310p runs normally.
---------
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: tianyitang <tangtianyi4@huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: tianyitang <tangtianyi4@huawei.com>
### What this PR does / why we need it?
Fix lint
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
The `os.environ["VLLM_USE_MODELSCOPE"] = "True"` should be placed before
module imports
if not
```
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/xleoken/projects/vllm-ascend/examples/offline_embed.py", line 48, in <module>
model = LLM(model="Qwen/Qwen3-Embedding-0.6B", task="embed")
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 243, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 494, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1018, in create_engine_config
model_config = self.create_model_config()
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 910, in create_model_config
return ModelConfig(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 120, in __init__
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/config.py", line 528, in __post_init__
hf_config = get_config(self.hf_config_path or self.model,
File "/usr/local/python3.10.17/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 321, in get_config
config_dict, _ = PretrainedConfig.get_config_dict(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
resolved_config_file = cached_file(
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 266, in cached_file
file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/transformers/utils/hub.py", line 491, in cached_files
raise OSError(
OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
[ERROR] 2025-07-03-15:27:10 (PID:333665, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local.
Signed-off-by: xleoken <xleoken@163.com>
### What this PR does / why we need it?
Fix word spelling in DOC.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Signed-off-by: paulyu12 <507435917@qq.com>
### What this PR does / why we need it?
mla attention still using the gpu_input_batch's attr:`swap_states`, which will lead to
an error `AttributeError: 'InputBatch' object has no attribute 'swap_states'`
This PR fixed the mla input patch error
### How was this patch tested?
will be tested by #1136
---------
Signed-off-by: wangli <wangli858794774@gmail.com>