xc-llm-ascend

Author	SHA1	Message	Date
SILONG ZENG	b804eb12f6	[CI]Nightly test use `main` (#6502 ) ### What this PR does / why we need it? Nightly test use `main` - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-03 15:40:59 +08:00
zhangyiming	41d48cb974	[CI] Update doctest from 0.9.1 to 0.13.0, and copy doc test workflow to nightly CI for better monitor. (#6452 ) ### What this PR does / why we need it? [CI] Update doctest from 0.9.1 to 0.13.0, and copy doc test workflow to nightly CI for better monitor. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: menogrey <1299267905@qq.com>	2026-02-03 15:19:03 +08:00
Feng Liu	03a18ad6fd	[E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149 ) ### What this PR does / why we need it? Add E2E for Prefix Caching cp & Chunked Prefill cp ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: F.Liu <liufeng248@huawei.com> Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com> Co-authored-by: F.Liu <liufeng248@huawei.com>	2026-02-03 15:04:14 +08:00
Shaoxu Cheng	39e77fb9e4	[Feat.]: support 310p w8a8 (#6454 ) ### What this PR does / why we need it? Introduced 310P W8A8 Quantization Support: New modules and methods have been added to enable W8A8 static quantization specifically for the Ascend 310P platform. Platform-Specific Quantization Configuration Loading: The system now dynamically loads the appropriate quantization configurations (AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether the current hardware is an Ascend 310P device. Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization method for 310P is provided, handling the specifics of weight and activation quantization, including input parameter broadcasting and weight data manipulation. Extended AscendModelSlimConfig for 310P: A specialized configuration class for 310P integrates the new W8A8 linear method for both standard linear layers and vocabulary parallel embeddings, ensuring proper quantization application. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com> Signed-off-by: Shaoxu Cheng <2906339855@qq.com>	2026-02-03 14:13:06 +08:00
SILONG ZENG	f4a72f0d16	[CI]Disable early exit to complete all tests (#6482 ) ### What this PR does / why we need it? 1. Disable the feature to exit early upon encountering an error in order to complete all tests. 2. Within each partition, tests are re-sorted by `estimated_time` in ascending order. This allows the CI to cover as many test cases as possible in the early stages. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-03 11:25:51 +08:00
LHXuuu	45a573cff1	[Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889 ) ### What this PR does / why we need it? While using the LLM Compressor quantization tool from the VLLM community to generate quantized weights, the VLLM Ascend engine needs to be adapted to support the compressed tensors quantization format. 1. Support Moe model W4A8 dynamic weight. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` --------- Signed-off-by: LHXuuu <scut_xlh@163.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: menogrey <1299267905@qq.com>	2026-02-02 16:39:32 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
wangxiyuan	f7dc7d9b86	[CI] support build wheel and docker image by workflow (#6453 ) Make image and wheel build CI job work with workflow_dispatch way - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:22 +08:00
wangxiyuan	b4aafd4293	[Core][Misc] Clean up ProfileExecuteDuration (#6461 ) ### What this PR does / why we need it? This PR removes the custom `ProfileExecuteDuration` utility and its usages across the codebase. This utility was used for profiling execution duration of different stages in the inference process. It is replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`, which integrates with PyTorch's profiler. This change simplifies the code by removing a custom implementation in favor of an upstream utility, improving maintainability. Associated documentation and tests for `ProfileExecuteDuration` are also removed. ### Does this PR introduce _any_ user-facing change? `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now. ### How was this patch tested? CI passed. The changes are a cleanup and replacement with a standard utility. Existing tests cover the functionality. The removed feature had its own tests which are also removed. Related RFC: #5304 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-01 20:06:01 +08:00
Li Wang	8969b94a14	[Nightly] Correct nightly image build ref (#6420 ) ### What this PR does / why we need it? The underlying tags for nightly image builds have been corrected, and some useless and confusing workflow fields have been removed. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-30 15:55:58 +08:00
zhangxinyuehfad	1d661bb279	[Bugfix] Specify tensorflow version in accuracy test to avoid segmentation fault (#6292 ) ### What this PR does / why we need it? Specify tensorflow version in accuracy test to avoid segmentation fault - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-30 09:28:24 +08:00
CodeCat	b2857de43f	[ST]Add e2e test for Npugraphex_pass (#6388 ) ### What this PR does / why we need it? We found the custom passes of NPUGraphEX have implemented fusion operator features, which still require E2E test case validation and guard. This PR implements E2E test cases for the AddRMSNormQuant and SplitQKVNormRope operator fusions under NPUGraphEX that are already in the codebase. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: cjian <2318164299@qq.com>	2026-01-30 09:14:07 +08:00
wjunLu	4970de4242	[CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195 ) ### What this PR does / why we need it? Enable the tests that were skipped due to an outdated driver version: - tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py - tests/e2e/multicard/4-cards/long_sequence/test_basic.py - tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py and some cases in - tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py - tests/e2e/multicard/2-cards/test_external_launcher.py - tests/e2e/multicard/2-cards/test_offline_weight_load.py - tests/e2e/multicard/2-cards/test_quantization.py - tests/e2e/multicard/4-cards/test_data_parallel_tp2.py TODO: - tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py - tests/e2e/multicard/4-cards/long_sequence/test_mtp.py ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-29 22:41:41 +08:00
Li Wang	e35f304419	[CI] Auto partition for test cases (#6379 ) ### What this PR does / why we need it? This patch add auto-partition feat for tests, for example, before this pr, we are running e2e single card test for 2h40min, after the auto partition, test case is automatically allocated into the required n parts based on its test duration (greedy strategy) and run in parallel. The advantage of doing this is that our overall test duration will become 1/n of the original. ### Does this PR introduce _any_ user-facing change? Before: e2e single card test spend 2h40min After: e2e single card test spend 1h13min ### How was this patch tested? ```shell python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 1/2 (0-based id=0) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4020.0s): - tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800) - tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480) - tests/e2e/singlecard/test_guided_decoding.py (est_time=354) - tests/e2e/singlecard/test_batch_invariant.py (est_time=320) - tests/e2e/singlecard/pooling/test_embedding.py (est_time=270) - tests/e2e/singlecard/test_quantization.py (est_time=200) - tests/e2e/singlecard/test_llama32_lora.py (est_time=162) - tests/e2e/singlecard/test_cpu_offloading.py (est_time=132) - tests/e2e/singlecard/pooling/test_classification.py (est_time=120) - tests/e2e/singlecard/test_camem.py (est_time=77) - tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70) - tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25) - tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10) (base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1 args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600) +----------------+--------------------+ \| Suite \| Partition \| \|----------------+--------------------\| \| e2e-singlecard \| 2/2 (0-based id=1) \| +----------------+--------------------+ ✅ Enabled 13 test(s) (est total 4025.0s): - tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500) - tests/e2e/singlecard/pooling/test_scoring.py (est_time=500) - tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410) - tests/e2e/singlecard/test_vlm.py (est_time=354) - tests/e2e/singlecard/test_models.py (est_time=300) - tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200) - tests/e2e/singlecard/test_sampler.py (est_time=200) - tests/e2e/singlecard/test_async_scheduling.py (est_time=150) - tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130) - tests/e2e/singlecard/test_ilama_lora.py (est_time=95) - tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76) - tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65) - tests/e2e/singlecard/test_xlite.py (est_time=45) ``` - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-29 20:28:10 +08:00
Li Wang	8b0a7b6d80	[CI] Nightly tests use `releases/v0.13.0` (#6355 ) ### What this PR does / why we need it? The pre-requirement pr is https://github.com/vllm-project/vllm-ascend/pull/6353, this patch aims to transfer nightly tests to `releases/v0.13.0`, what we need to do is just use the branch built image for nightly - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-28 21:46:13 +08:00
Li Wang	501bb395b1	[CI] Fix image build (#6333 ) Try to fix schedule image build CI - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-28 21:36:44 +08:00
Shaoxu Cheng	857c533e27	[CI]: add production safeguards for 300I (#6343 ) Update 310p files tracker to enable 310p e2e test per PR. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-28 16:43:48 +08:00
wangxiyuan	f8e76a49fa	[CI] Upgrade trasnformers version (#6307 ) Upgrade transformers to >=4.56.4 - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-28 14:06:39 +08:00
meihanc	fea197ad50	[Main2Main] Upgrade vllm commit to 0123 (#6169 ) ### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27df97c3eb79f891802fc0e858f8f7ac6a0) Modify import paths due to the refactors： https://github.com/vllm-project/vllm/pull/32245 https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16da1e423ede2c2f52a9850cbfbb39cefe96) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to https://github.com/vllm-project/vllm/pull/28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117ea2e689cd43df4be6892671a17cdae5833) 1. Add `skip_compiled` param in `set_forward_context` due to https://github.com/vllm-project/vllm/pull/30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to https://github.com/vllm-project/vllm/pull/24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors：https://github.com/vllm-project/vllm/pull/32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a7c1b61350c5c40ca1115d3bf8cf2b8cc9) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. https://github.com/vllm-project/vllm/pull/32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor https://github.com/vllm-project/vllm/pull/30143 3. Remove unused `maybe_setup_kv_connector` due to https://github.com/vllm-project/vllm/pull/32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271bb6d1e7e9b1a55be73d755ef1a57dbbe5) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to https://github.com/vllm-project/vllm/pull/32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cceb877dfd13f98c538c4c96158047d98bd) Setting temperature=0.0 due to the removal of the default temperature value in https://github.com/vllm-project/vllm/pull/32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>	2026-01-27 08:44:36 +08:00
Icey	9780a995e1	[BugFix] Fix wheel package build workflow (#6276 ) ### What this PR does / why we need it? Fixes https://github.com/vllm-project/vllm-ascend/actions/runs/21348357385/job/61440051717 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-26 20:42:17 +08:00
InSec	595b57c4d4	[CI][BugFix] Qwen3-Next nightly test fix. (#6247 ) ### What this PR does / why we need it? Qwen3-Next nightly test fix. Temporarily avoid the accuracy issue in the full graph mode. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: InSec <1790766300@qq.com>	2026-01-26 19:53:53 +08:00
Li Wang	de095c5fed	[CI] Add workfolw_dispatch for nightly image build (#6269 ) ### What this PR does / why we need it? Currently, the nightly image is built at 20 PM and 23 PM UTC+8. Due to some timeliness requirements, we need to add a new trigger method for nightly image builds. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 15:56:38 +08:00
dependabot[bot]	1f26f83e34	[CI] Bump actions/checkout from 4 to 6 (#6255 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-26 14:21:00 +08:00
dependabot[bot]	ae71c4237e	[CI] Bump actions/setup-python from 6.1.0 to 6.2.0 (#6256 ) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.1.0 to 6.2.0. - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-26 14:20:14 +08:00
Li Wang	c26ad78f86	[CI][lint] Add rule `codespell` back (#6236 ) ### What this PR does / why we need it? After removing codepsell a while, we discovered that typo had a problem correctly recognizing certain misspelled words, so I suggested adding it back. - vLLM version: v0.14.1 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 14:12:33 +08:00
wangxiyuan	f4abd9b7b5	[CI] Fix 310p image build (#6259 ) Fix 310p docker image build error - vLLM version: v0.14.1 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-26 14:11:56 +08:00
huangning1995	ce11fd49f3	[Feature] Batch invariant torch.compile (#6107 ) ### What this PR does / why we need it? Building upon https://github.com/vllm-project/vllm-ascend/pull/5517 to enable batch-invariant in vllm-ascend, we observed that the performance of BI in eager mode remains suboptimal. This PR further integrates batch-invariant with torch.compile, which improves inference performance by 350% when tested with Qwen3-0.6B. ### Does this PR introduce _any_ user-facing change? Previously, enabling both aclgraph and Batch-Invariant would cause an "ub overflow" error. This occurred because transposed input tensors could produce incorrect stride() values. To fix this, we now call .contiguous() on the input tensors before passing them to Triton kernels. This ensures a contiguous memory layout and prevents transposed tensors from causing incorrect stride calculations. ### Test Plan pytest -sv --durations=0 tests/e2e/singlecard/test_aclgraph_batch_invariant.py ### Test Result ``` ============================================================================ slowest durations ============================================================================ 87.37s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_v1_generation_is_deterministic_across_batch_sizes_with_needle 77.39s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN 74.04s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_logprobs_without_batch_invariance_should_fail 73.59s call tests/e2e/singlecard/test_aclgraph_batch_invariant.py::test_simple_generation (8 durations < 0.005s hidden. Use -vv to show these durations.) ================================================================ 4 passed, 3 warnings in 312.45s (0:05:12) ================================================================ ``` ### Performance export VLLM_BATCH_INVARIANT=1 vllm serve /home/Qwen3-0.6B \ --served-model-name qwen \ --port 8000 \ --max-num-seqs 256 \ --tensor-parallel-size 1 \ --max-model-len 5500 \ --max-num-batched-tokens 5500 \ --reasoning-parser qwen3 \ --gpu-memory-utilization 0.9 \ --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,32]}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true},"enable_weight_nz_layout":true}' vllm bench serve --served-model-name qwen --trust-remote-code --backend vllm --model /home/Qwen3-0.6B/ --endpoint /v1/completions --dataset-name random --random-input-len 512 --random-output-len 256 --num-prompts 800 --max-concurrency 8 torch.compile batch invariant performance: ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 477.21 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 1.68 Output token throughput (tok/s): 429.16 Peak output token throughput (tok/s): 472.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 1287.48 ---------------Time to First Token---------------- Mean TTFT (ms): 285.53 Median TTFT (ms): 312.70 P99 TTFT (ms): 324.22 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.59 Median TPOT (ms): 17.50 P99 TPOT (ms): 18.44 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.59 Median ITL (ms): 17.45 P99 ITL (ms): 18.76 ================================================== ``` Eager ``` ============ Serving Benchmark Result ============ Successful requests: 800 Failed requests: 0 Maximum request concurrency: 8 Benchmark duration (s): 1694.70 Total input tokens: 409600 Total generated tokens: 204800 Request throughput (req/s): 0.47 Output token throughput (tok/s): 120.85 Peak output token throughput (tok/s): 136.00 Peak concurrent requests: 16.00 Total token throughput (tok/s): 362.54 ---------------Time to First Token---------------- Mean TTFT (ms): 164.29 Median TTFT (ms): 129.71 P99 TTFT (ms): 1961.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 65.81 Median TPOT (ms): 65.15 P99 TPOT (ms): 72.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 65.81 Median ITL (ms): 64.64 P99 ITL (ms): 75.72 ================================================== ``` - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: huangning1995 <huangning12@huawei.com>	2026-01-26 09:15:06 +08:00
Li Wang	63adbedb7a	[Worker] Implement update max_model_len interface for NPUWorker (#6193 ) ### What this PR does / why we need it? This patch purpose to add the `update_max_model_len` interface. - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:03:33 +08:00
Li Wang	ca297eb57f	[CI] Migrate e2e test runner to hk (#5344 ) ### What this PR does / why we need it? This patch add new runner labels for the HK region, and e2e single-card testing has been migrated to this runner. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-26 09:00:51 +08:00
wangxiyuan	99bdd7363c	[CI] update vLLM to 0.14.1 (#6222 ) Upgrade vLLM to 0.14.1 - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-25 17:52:16 +08:00
Shaoxu Cheng	fbae41697e	[310P]: refactoring for 310p kvcache and some ops class (#6117 ) ### What this PR does / why we need it? * Refactor the LayerNorm and activation operator classes to decouple the 310P device implementation from the main branch. * Refactor `mm_encoder_attention` on 310P to use the `torch_npu._npu_flash_attention_unpad` operator. * Refactor the QKV inputs in the prefill stage of `attention_v1` on 310P so they are no longer padded to 16× alignment. * Refactor `model_runner` on 310P to align the KV-cache initialization logic with the mainline implementation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? use the e2e tests. - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: Tflowers-0129 <2906339855@qq.com>	2026-01-24 20:34:29 +08:00
Icey	4fcacca8a6	[BugFix] Fix build wheel (#6218 ) ### What this PR does / why we need it? - Fixes https://github.com/vllm-project/vllm-ascend/actions/runs/21312847954/job/61351587180 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-24 20:08:20 +08:00
Li Wang	af4dbb6b26	[CI] Use nginx for package cache to speed up CI (#6170 ) ### What this PR does / why we need it? Use nginx for package cache to speed up CI - vLLM version: v0.14.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-23 16:56:16 +08:00
zhangxinyuehfad	193acc2c19	[CI] Add nightly ci test for deepseek v3.1 (#5386 ) ### What this PR does / why we need it? Add nightly ci test for deepseek v3.1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 14:36:49 +08:00
wjunLu	72ffc00b86	[Bugfix] Fix structured outputs errors: `TypeError: apply_token_bitmask_inplace_cpu()` (#6151 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5524 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-23 09:52:55 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wjunLu	88632cf976	[CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (#6145 ) ### What this PR does / why we need it? Upgrade wheel building's CANN to 8.5.0 and update the Docs - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-22 19:50:54 +08:00
meihanc	e54d294df3	[CI]Install clang in dokerfile for triton ascend (#4409 ) ### What this PR does / why we need it? Install clang in dokerfile for triton ascend - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-22 19:01:28 +08:00
Li Wang	484e7c59dc	[CI] optimize lint term (#5986 ) ### What this PR does / why we need it? This patch purpose to optimize the lint check term. The main idea is to reduce unnecessary installation time. 1. The installation of vllm is not must, only append the path of vllm src to the `PATHONPATH` is effective 2. This installation of `requirements-dev.txt` is not must, we have a pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the requirements installed in advance. NOTE: the conditions for triggering image builds are: 1).Daily scheduled build; 2) Build when requirements are modified; 3) Manual build. This ensures that the dependencies in our image are up-to-date to the greatest extent possible. 3. The `mypy` was separated from the `pre-commit` hook for performance reasons; we found that integrating `mypy` into the `pre-commit` hook resulted in poor performance. 4. Reduce the CPU core consumption from 16 -> 8 ### Does this PR introduce _any_ user-facing change? The end-to-end lint time was optimized from 20min/per PR to 8min/per PR ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-22 15:46:59 +08:00
maxmgrdv	ef9d8367f5	[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (#5143 ) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-22 10:34:58 +08:00
wangxiyuan	69740039b7	[CI] Upgrade CANN to 8.5.0 (#6070 ) ### What this PR does / why we need it? 1. Upgrade CANN to 8.5.0 2. move triton-ascend 3.2.0 to requirements note: we skipped the two failed e2e test, see https://github.com/vllm-project/vllm-ascend/issues/6076 for more detail. We'll fix it soon. ### How was this patch tested? Closes: https://github.com/vllm-project/vllm-ascend/issues/5494 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-22 09:29:50 +08:00
meihanc	53bfb38192	[CI]Update triton ascend version in 3.2.0 (#6067 ) ### What this PR does / why we need it? update triton ascend version in 3.2.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-21 16:02:23 +08:00
Li Wang	0c0514579f	[CI][Lint] Show lint diff on failure (#5956 ) ### What this PR does / why we need it? Currently, some of lint checks default automatic code correction but only shows which files were modified (without specifying the changes); in a CI environment, we can make a small optimization to show which lines were modified to give the developers some specifying hint. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangli <wangli858794774@gmail.com>	2026-01-20 21:07:01 +08:00
zhangxinyuehfad	750c06c78a	[CI] Add DeepSeek-V3.2-W8A8 nightly ci test (#4633 ) ### What this PR does / why we need it? Add DeepSeek-V3.2-W8A8 nightly ci test： DeepSeek-V3.2-W8A8 1node DP2+TP8 :tests/e2e/nightly/models/test_deepseek_v3_2_w8a8.py ### Does this PR introduce _any_ user-facing change - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-20 21:05:15 +08:00
Icey	402872050a	[Tests] move qwen3 performance test from nightly to e2e (#5980 ) ### What this PR does / why we need it? Move the qwen3 performance test from nightly to e2e to intercept performance degradation. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: wxsIcey <1790571317@qq.com>	2026-01-20 17:08:43 +08:00
meihanc	ea57e3e7a4	[Main2Main] Upgrade vllm commit to releases/v0.14.0 (#5988 ) ### What this PR does / why we need it? Upgrade vllm commit to releases/v0.14.0 - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-20 15:10:40 +08:00
LeeWenquan	55b20ac63b	[Ops] Add layernorm for qwen3Next (#5765 ) ### What this PR does / why we need it? Add layernormFn triton op for qwen3Next model for better performance. <img width="248" height="526" alt="image" src="https://github.com/user-attachments/assets/27b47157-5df5-4db1-aa88-1dae799b2bf6" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2026-01-20 14:43:14 +08:00
Song Zhixin	2b6dc100b5	Eagle3 mm support, enablement on qwen3vl (#4848 ) ### What this PR does / why we need it? follow pr [https://github.com/vllm-project/vllm/pull/20788](https://github.com/vllm-project/vllm/pull/20788) , Eagle3 mm support, enablement on qwen3vl target model [Qwen/Qwen3-VL-8B-Instruct]([https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct]) eagle3 [MNN/Qwen3-VL-8B-Instruct-Eagle3](https://www.modelscope.cn/models/MNN/Qwen3-VL-8B-Instruct-Eagle3) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? pytest ./tests/e2e/singlecard/test_completion_with_prompt_embeds.py -vv vLLM with eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images --speculative-config '{ "method": "eagle3", "model": "/model/hf/Qwen3-VL-8B-Instruct-Eagle3", "num_speculative_tokens": 3 }' ``` vLLM without eagle3 : ```bash vllm serve /model/Qwen3-VL-8B-Instruct --enforce-eager --port 9100 --max-model-len 32768 --max-num-seqs 32 --tensor-parallel-size 2 --allowed-local-media-path /model/gx/images ``` bench: ``` vllm bench serve --backend openai-chat --base-url http://127.0.0.1:9100 --tokenizer /model/Qwen3-VL-8B-Instruct --endpoint /v1/chat/completions --model /model/Qwen3-VL-8B-Instruct --dataset-name random --num-prompts 50 --max-concurrency 5 --temperature 0 --top-p 1.0 --seed 123 ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: jesse <szxfml@gmail.com>	2026-01-19 08:58:07 +08:00
zhangxinyuehfad	4f446aec4c	[CI] Add DeepSeek-V3.2-W8A8-Pruning e2e test (#5922 ) ### What this PR does / why we need it? 1. Fix DeepSeek-V3.2-W8A8-Pruning mtp 2. Add DeepSeek-V3.2-W8A8-Pruning e2e test ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-16 15:49:57 +08:00
wangxiyuan	69b170b8b5	[CI] skip 310 test for full test (#5943 ) Skip 310 test for full test. It's not stable now. - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-01-16 10:36:20 +08:00

1 2 3 4 5 ...

608 Commits