Commit Graph

36 Commits

Author SHA1 Message Date
Li Wang
7fe0469e27 [CI][Misc] Use offline mode for model downloads (#7179)
### What this PR does / why we need it?
1. For all parts of the current test module involving the millisecond
download model, add the `local_file_only` parameter to specify offline
mode; this ensures that CI will not fail due to network instability.
2. Install modelscope from a fixed commit until it next release
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
 check if the env or arg `local_files_only` works
1) set the env:
```shell
export HF_HUB_OFFLINE=1
```
2) run the script
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()

model="Qwen/Qwen3-0.6B"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
it works well:
```shell
2026-03-06 06:40:12,546 - modelscope - WARNING - We can not confirm the cached file is for revision: master
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
{'architectures': ['Qwen3ForCausalLM'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': 1024, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 40960, 'max_window_layers': 28, 'model_type': 'qwen3', 'num_attention_heads': 16, 'num_hidden_layers': 28, 'num_key_value_heads': 8, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000, 'sliding_window': None, 'tie_word_embeddings': True, 'torch_dtype': 'bfloat16', 'transformers_version': '4.51.0', 'use_cache': True, 'use_sliding_window': False, 'vocab_size': 151936, '_commit_hash': None}
```
3) test the model repo does not cached locally when the env
`HF_HUB_OFFLINE`==True
```python
from transformers import PretrainedConfig
import huggingface_hub
from modelscope.utils.hf_util import patch_hub

patch_hub()


model="FireRedTeam/FireRed-OCR"
kwargs = {}


config_dict, _ = PretrainedConfig.get_config_dict(
    model,
    trust_remote_code=True,
    local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
    **kwargs,
)

print(config_dict)
```
and the result is as expected:
```shell
  File "/workspace/demo.py", line 12, in <module>
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 189, in patch_get_config_dict
    model_dir = get_model_dir(pretrained_model_name_or_path,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/utils/hf_util/patcher.py", line 164, in get_model_dir
    model_dir = snapshot_download(
                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 137, in snapshot_download
    return _snapshot_download(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/modelscope/hub/snapshot_download.py", line 283, in _snapshot_download
    raise ValueError(
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable look-ups and downloads online, set 'local_files_only' to False
```
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-13 08:52:24 +08:00
XiaoxinWang
37d1bd8c50 fixed fia pad logic in graph mode. (#7144)
### What this PR does / why we need it?
related to vllm PR #34043 this pr delete func
‘relax_for_mixed_batch_cudagraphs’, num_reqs no longer equals the actual
number of requests, due to fia operator requires that
query_start_loc[-1] equals the total number of computed tokens, so this
func delete cause the ifa error.
In full graph mode, set num_reqs_paded = num_reqs to fix the error
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-12 14:50:54 +08:00
meihanc
da01a74009 Revert "[CI] fix skiped e2e test when upgrade vllm version (#6654)" (#7166)
This reverts commit f6db47f103.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-11 23:03:15 +08:00
yupeng
830f39dd70 [Bugfix][LoRA] Fix the issue when enable LoRA + tp + fully_sharded_loras (#6650)
### What this PR does / why we need it?
Fix the issue #6143 .

### Does this PR introduce _any_ user-facing change?
Allow to start the server with "--enable-lora && --fully-sharded-loras
&& --tensor_parallel_size 2".

### How was this patch tested?
pytest -sv tests/e2e/multicard/2-cards/test_llama32_lora_tp2.py
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: paulyu12 <507435917@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-11 15:43:15 +08:00
meihanc
f6db47f103 [CI] fix skiped e2e test when upgrade vllm version (#6654)
### What this PR does / why we need it?
fix skiped test_aclgraph_capture_replay.py when upgrade vllm version

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
13397841ab

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-10 09:55:35 +08:00
SILONG ZENG
43df2cb2fc [Lint]Style: Convert test/ to ruff format(Batch #1) (#6738)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `tests/e2e/310p/multicard/test_vl_model_multicard.py` |
| `tests/e2e/310p/singlecard/test_vl_model_singlecard.py` |
| `tests/e2e/310p/test_utils.py` |
| `tests/e2e/conftest.py` |
| `tests/e2e/model_utils.py` |
| `tests/e2e/models/conftest.py` |
| `tests/e2e/models/test_lm_eval_correctness.py` |
| `tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py` |
| `tests/e2e/multicard/2-cards/test_aclgraph_capture_replay.py` |
| `tests/e2e/multicard/2-cards/test_data_parallel.py` |
| `tests/e2e/multicard/2-cards/test_disaggregated_encoder.py` |
| `tests/e2e/multicard/2-cards/test_expert_parallel.py` |
| `tests/e2e/multicard/2-cards/test_external_launcher.py` |
| `tests/e2e/multicard/2-cards/test_full_graph_mode.py` |
| `tests/e2e/multicard/2-cards/test_ilama_lora_tp2.py` |
| `tests/e2e/multicard/2-cards/test_offline_inference_distributed.py` |
| `tests/e2e/multicard/2-cards/test_offline_weight_load.py` |
| `tests/e2e/multicard/2-cards/test_pipeline_parallel.py` |
| `tests/e2e/multicard/2-cards/test_prefix_caching.py` |
| `tests/e2e/multicard/2-cards/test_quantization.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_moe.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_moe_routing_replay.py` |
| `tests/e2e/multicard/2-cards/test_qwen3_performance.py` |
| `tests/e2e/multicard/2-cards/test_shared_expert_dp.py` |
| `tests/e2e/multicard/2-cards/test_single_request_aclgraph.py` |
| `tests/e2e/multicard/2-cards/test_sp_pass.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-03-10 09:52:50 +08:00
zhangxinyuehfad
a6745b8577 [CI] fix test_qwen3_moe_external_launcher_ep_tp2 (#6951)
### What this PR does / why we need it?
fix test_qwen3_moe_external_launcher_ep_tp2 by
wait_until_npu_memory_free

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-05 16:43:45 +08:00
whx
16c879cdf7 [Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518)
### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-03-02 17:54:25 +08:00
realliujiaxu
5def28dcd3 [Feat]support sequence parallelism by pass for VL models (#5632) 2026-02-27 08:27:41 +08:00
starmountain1997
bc1622338c [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)
### What this PR does / why we need it?

This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.

We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.

and we also add aime2025 test for dual-node deepseek 3.2 nightly test.

### How was this patch tested?

test at nightly environment.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-26 10:58:50 +08:00
Li-Yongwen
2870f7c8ad [Feat] Support routing replay (#6696)
### What this PR does / why we need it?

[Feat] Support routing replay
same as https://github.com/vllm-project/vllm-ascend/pull/6666
resubmit  because of DOC failure

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

---------

Signed-off-by: liyongwen <1310439159@qq.com>
Signed-off-by: Li-Yongwen <63399187+Li-Yongwen@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-26 10:22:47 +08:00
jiahao.quan
7221045777 [Attention] add gpt-oss support (#5901)
### What this PR does / why we need it?
Please refer to the following link for the historical conversation
https://github.com/vllm-project/vllm-ascend/pull/4467. We have made
updates in light of the comments from the prior PR review. Given the
refactoring of the attention_v1 component, we have carried out necessary
adjustments to fit the newly revised code.

### Does this PR introduce _any_ user-facing change?

1. Modified the code in the Attention section to adapt to the SWA and
Sink features required by gpt-oss.
2. Modified the code in the MoE section to add support for bias and
swigluoai.

### How was this patch tested?
Please refer to the
https://github.com/vllm-project/vllm-ascend/pull/4467 for performance
tests, on the basis of which the accuracy tests from AIME2024 have been
newly added.

![img_v3_02tu_501e88e3-2217-4565-8edf-b9acf4f43f2g](https://github.com/user-attachments/assets/024f8283-18ab-4d4d-ab12-27917b5d7d06)


- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: leon_tao <taoyao2@huawei.com>
Co-authored-by: nurxat <738457498@qq.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: mikequan <199741451@qq.com>
Co-authored-by: LI SHENGYONG <49200266+shenchuxiaofugui@users.noreply.github.com>
Co-authored-by: jiangyunfan1 <jiangyunfan1@h-partners.com>
Co-authored-by: pu-zhe <zpuaa@outlook.com>
Co-authored-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Cao Yi <slightwindsec@gmail.com>
Co-authored-by: Icey <1790571317@qq.com>
Co-authored-by: SILONG ZENG <2609716663@qq.com>
2026-02-12 10:55:34 +08:00
wangxiyuan
2a826b5fad [Misc] upgrade to vllm main (#6646)
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.

1.
ac32e66cf9
pass file is moved.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
2026-02-10 14:08:59 +08:00
wangyu
c63b7a1188 [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301)
### What this PR does / why we need it?
This PR adds disaggregated encoder  tests for Qwen2.5-VL-7B-Instruct 
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
by running ci

- vLLM version: release/v0.12.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2026-02-06 17:30:17 +08:00
starmountain1997
bfcc372f75 [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499)
### What this PR does / why we need it?

This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test
by adding both short and long prompt test cases:
- Short test: Validates basic functionality with minimal input ("Hello
")
- Long test: Validates the model can handle prompts near its maximum
context length (~163K tokens, approaching the max_position_embeddings
limit of 163,840)
Additionally, explicitly sets max_model_len=163840 to ensure the test
properly exercises the model's full context window capability.
### Does this PR introduce _any_ user-facing change?

No. This change only affects internal E2E testing infrastructure.  

### How was this patch tested?

The modified test case will be executed as part of the E2E test suite
and has been validated
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499).



- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-04 09:10:50 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
LHXuuu
45a573cff1 [Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W4A8 dynamic weight.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
2026-02-02 16:39:32 +08:00
wjunLu
4970de4242 [CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195)
### What this PR does / why we need it?
Enable the tests that were skipped due to an outdated driver version:
- tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
- tests/e2e/multicard/4-cards/long_sequence/test_basic.py
- tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py

and some cases in
- tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
- tests/e2e/multicard/2-cards/test_external_launcher.py
- tests/e2e/multicard/2-cards/test_offline_weight_load.py
- tests/e2e/multicard/2-cards/test_quantization.py
- tests/e2e/multicard/4-cards/test_data_parallel_tp2.py

TODO:
- tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
- tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-29 22:41:41 +08:00
Qiu
50e0e87646 [bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344)
### What this PR does / why we need it?
PR #5672 attempted to remove the -1 padding for duplicate tokens in the
decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler
slicing approach. However, in the single-ops logic and mixed PD batches,
the decode slot_mapping did not eliminate the -1 and also shared the
slicing method, resulting in incorrect slot_mapping. This PR resolves
this issue, and the logic will be further consolidated in subsequent
refactoring PRs.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 16:48:37 +08:00
wangxiyuan
f8e76a49fa [CI] Upgrade trasnformers version (#6307)
Upgrade transformers to >=4.56.4

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-28 14:06:39 +08:00
meihanc
fea197ad50 [Main2Main] Upgrade vllm commit to 0123 (#6169)
### What this PR does / why we need it?
1.  Upgrade vllm commit to: 0115
(8471b27df97c3eb79f891802fc0e858f8f7ac6a0)
Modify import paths due to the refactors:
https://github.com/vllm-project/vllm/pull/32245
https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913
2. Upgrade vllm commit to: 0119
(9a1f16da1e423ede2c2f52a9850cbfbb39cefe96)
Fix `WorkerProc.__init__() missing 1 required positional argument:
'is_driver_worker'` due to
https://github.com/vllm-project/vllm/pull/28506
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569
3. Upgrade vllm commit to:
0120(148117ea2e689cd43df4be6892671a17cdae5833)
1. Add `skip_compiled` param in `set_forward_context` due to
https://github.com/vllm-project/vllm/pull/30385
2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to
https://github.com/vllm-project/vllm/pull/24322
change `self.max_num_tokens =
vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size`
3. Modify UT import paths due to the
refactors:https://github.com/vllm-project/vllm/pull/32060
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946
4. Upgrade vllm commit to:
0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9)
1. vLLM switched `uses_mrope` from target to draft model config, making
`positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's
direct self.positions access and tests missing
`draft_model_config.uses_mrope`.
https://github.com/vllm-project/vllm/pull/32048
2. Moved bs_to_padded_graph_size from CompilationConfig to
CudagraphDispatcher due to the refactor
https://github.com/vllm-project/vllm/pull/30143
3. Remove unused `maybe_setup_kv_connector` due to
https://github.com/vllm-project/vllm/pull/32077
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834
6. Upgrade vllm commit to:
0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5)
Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig
due to https://github.com/vllm-project/vllm/pull/32414
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054
8. Upgrade vllm commit to:
0123(dc917cceb877dfd13f98c538c4c96158047d98bd)
Setting temperature=0.0 due to the removal of the default temperature
value in https://github.com/vllm-project/vllm/pull/32723
Test result:
https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wjunLu <wjunlu217@gmail.com>
2026-01-27 08:44:36 +08:00
Li Wang
c38c838d03 [CI] Decrease Qwen3 dense model output throughput baseline to make ci happy (#6233)
### What this PR does / why we need it?
As
https://github.com/vllm-project/vllm-ascend/actions/runs/21327913593/job/61388195448
shows, I encountered two CI failures., The results consistently pointed
to the reduced outcome 1600 -> 1514

- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-26 09:04:13 +08:00
starmountain1997
6c73b88dd6 [CI] Enable FLASHCOMM1 with layer_sharding and FULL_DECODE_ONLY in ds32 testing (#6115)
### What this PR does / why we need it?

This PR enables FLASHCOMM1 communication optimization with layer
sharding for DeepSeek-V3.2 W8A8 model testing to
  validate PR #5702. The changes include:

  1. Enable FLASHCOMM1: Set VLLM_ASCEND_ENABLE_FLASHCOMM1=1
  improves performance for distributed inference
2. Add layer sharding: Configure layer_sharding: ["q_b_proj", "o_proj"]
4. Update baselines: Adjust performance baselines to reflect the
improvements from FLASHCOMM1 and layer sharding

### Does this PR introduce _any_ user-facing change?

No. This is a CI/test-only change that enables new communication
optimization features for testing purposes.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-01-23 19:48:37 +08:00
maxmgrdv
ef9d8367f5 [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (#5143)
Introduce W4A4 LAOS Quantization for better model compression and
inference efficiency on Ascend devices.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-22 10:34:58 +08:00
wangxiyuan
69740039b7 [CI] Upgrade CANN to 8.5.0 (#6070)
### What this PR does / why we need it?
1. Upgrade CANN to 8.5.0
2. move triton-ascend 3.2.0 to requirements

note: we skipped the two failed e2e test, see
https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail.
We'll fix it soon.


### How was this patch tested?
Closes: https://github.com/vllm-project/vllm-ascend/issues/5494

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-22 09:29:50 +08:00
Li Wang
8cf1e8d8a7 [CI] Add wait logic for each individual case (#6036)
### What this PR does / why we need it?
Wait until the NPU memory is clean
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: leo-pony <nengjunma@outlook.com>
2026-01-20 21:05:44 +08:00
Icey
402872050a [Tests] move qwen3 performance test from nightly to e2e (#5980)
### What this PR does / why we need it?
Move the qwen3 performance test from nightly to e2e to intercept
performance degradation.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-20 17:08:43 +08:00
zhangxinyuehfad
4f446aec4c [CI] Add DeepSeek-V3.2-W8A8-Pruning e2e test (#5922)
### What this PR does / why we need it?
1. Fix DeepSeek-V3.2-W8A8-Pruning mtp
2. Add DeepSeek-V3.2-W8A8-Pruning e2e test

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-16 15:49:57 +08:00
LI SHENGYONG
da958ee386 [EPLB]Eplb Config Renaming (#5533)
### What this PR does / why we need it?
1. Rename num_iterations_eplb_update to expert_heat_collection_interval.
2. Rename num_wait_worker_iterations to algorithm_execution_interval.
3. Rename init_redundancy_expert to num_redundant_experts because the
variable with the same meaning in vLLM is named this way.
4. Delete gate_eplb because we don't need this feature.
5. Move eplb config into a dict in additional config.
6. Depend on pr5817

### Does this PR introduce _any_ user-facing change?

before this pr:
`--additional-config '{"dynamic_eplb":true,
"num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150,
"init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'`

after this pr: 
`--additional-config
'{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000,
"algorithm_execution_interval":150,"num_redundant_experts": 16,
"expert_map_path": "xxx.json"}}'`

### How was this patch tested?

#### test qwen3-235b eplb num_redundant_experts=16

without pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

with pr5817
| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 86.67 |

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-15 10:26:44 +08:00
LHXuuu
0415e694cd [Quantization] Support compressed tensors moe w8a8 int8 dynamic weight (#5718)
### What this PR does / why we need it?
While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W8A8 Int8 dynamic weight.
2. Specify W4A16 quantization configuration.

Co-authored-by: menogrey 1299267905@qq.com
Co-authored-by: kunpengW-code 1289706727@qq.com

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-01-14 09:17:26 +08:00
drslark
48ec97821a [Bugfix] Fixed an accuracy problem of sp with eagle3 (#5816)
### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.

The problem is described in
https://github.com/vllm-project/vllm-ascend/issues/5825.

It also adds a much more precise way to determine whether drafter should
use `sp` or not.

Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

For simpilicity, we test it as in
https://github.com/vllm-project/vllm-ascend/issues/5825.

And we get the same result of `eagle3` with `sp` disabled.

```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```

* vLLM version: v0.13.0
* vLLM main:
2f4e6548ef

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-14 09:00:37 +08:00
Levi
ecd4232698 [Feat] flashcomm2+oshard Generalized (#4723)
### What this PR does / why we need it?
[FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf)
introduces redundant storage of the o_proj matrix, which imposes
pressure on GPU memory. We propose the FlashComm2+Oshard approach by
integrating the shared linear layer feature (#2931). This approach
distributes weights layer-by-layer to each GPU and accesses the o_proj
of each layer via asynchronous broadcast operations, thereby alleviating
memory pressure while achieving nearly lossless performance compared to
the original FlashComm2. This PR implements a generalized
FlashComm2+Oshard solution.

Using following env to support flashcomm2 with oshard

```shell
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1
--additional-config '{
  "layer_sharding": ["o_proj"]
}'
```

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2026-01-10 22:57:57 +08:00
ZT-AIA
e11ff8e535 [BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394)
### What this PR does / why we need it?
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Depends on this commit https://github.com/vllm-project/vllm/pull/31408 
- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
2026-01-09 15:57:43 +08:00
LI SHENGYONG
b69db4ce55 [EPLB][CI] EPLB add aclgraph and redundant expert ci (#5625)
### What this PR does / why we need it?
EPLB currently does not have CI related to aclgraph and redundancy
experts; this PR adds them.
release on #5529

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Tested the use cases to be added in this PR.

PASSED

====================================================== warnings summary
==========================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 1 passed, 2
warnings in 272.24s (0:04:32)
=====================================================

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-08 09:51:48 +08:00
Li Wang
1165b2c863 [1/N][CI] Refactor accuracy test (#5400)
### What this PR does / why we need it?
1. Accuracy testing no longer compares eager and graph modes; instead,
it directly extracts the golden result under the graph mode
configuration (the implicit purpose of this case is to verify whether
modifications affect existing results)
2. Next step: finer-grained supervision of logits/sampler results
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-07 20:58:15 +08:00
wangxiyuan
6f7a81cd9f [CI] cleanup single/multi-card test (#5623)
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order

- vLLM version: v0.13.0
- vLLM main:
8be6432bda

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-07 14:13:34 +08:00