xc-llm-ascend

Author	SHA1	Message	Date
wangxiyuan	9c6d031797	[MISC] Clean up useless env USE_OPTIMIZED_MODEL (#6618 ) Clean up uesless env `USE_OPTIMIZED_MODEL` - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-09 15:38:58 +08:00
SILONG ZENG	06aa6036f6	[Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch #8 ) (#6604 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd` Signed-off-by: MrZ20 <2609716663@qq.com>	2026-02-07 09:16:07 +08:00
wangxiyuan	06c0aed124	[CI] Fix broken CI (#6599 ) Revert `4fb3d5e1b2` it breaks E2E Test - vLLM version: v0.15.0 - vLLM main: `d7e17aaacd`	2026-02-06 17:23:58 +08:00
SILONG ZENG	4fb3d5e1b2	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #8 ) (#6129 ) ### What this PR does / why we need it? Scope of Changes: \| File Path \| \| :--- \| \| vllm_ascend/ops/\_\_init\_\_.py \| \| vllm_ascend/ops/activation.py \| \| vllm_ascend/ops/flashcomm2_oshard_manager.py \| \| vllm_ascend/ops/layernorm.py \| \| vllm_ascend/ops/mla.py \| \| vllm_ascend/ops/mm_encoder_attention.py \| \| vllm_ascend/ops/register_custom_ops.py \| \| vllm_ascend/ops/vocab_parallel_embedding.py \| \| vllm_ascend/ops/weight_prefetch.py \| \| vllm_ascend/spec_decode/\_\_init\_\_.py \| \| vllm_ascend/spec_decode/eagle_proposer.py \| \| vllm_ascend/spec_decode/interface.py \| \| vllm_ascend/spec_decode/mtp_proposer.py \| \| vllm_ascend/spec_decode/ngram_proposer.py \| \| vllm_ascend/spec_decode/suffix_proposer.py \| ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `d68209402d` Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: SILONG ZENG <2609716663@qq.com>	2026-02-06 15:25:08 +08:00
wangxiyuan	eeedf7c503	[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470 ) ### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to https://github.com/vllm-project/vllm/pull/31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to https://github.com/vllm-project/vllm/pull/32082 by https://github.com/vllm-project/vllm-ascend/pull/6335. - Fix `ReshapeAndCacheOperation setup failed!` due to https://github.com/vllm-project/vllm/pull/25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: `dc917cceb8` co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-02-02 15:57:55 +08:00
Shanshan Shen	76ac688388	[MM][Perf] Parallelize Q/K/V padding in AscendMMEncoderAttention for better performance (#6204 ) ### What this PR does / why we need it? Currently, we pad the last dim of qkv to 128 before flash attention (in `AscendMMEncoderAttention`) to get better performance on Ascend NPU. However, the qkv padding is executed serially, which may lead to more overhead when launching `aclnnConstantPadNd` (launch 3 times). Since the three operations are mutually independent, we stack qkv first and then pad them in one kernel launch. With this optimization, TTFT has been reduced by 3.15%, peak throughput has been increased by 4.20%. --- ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Launch the server: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --dtype bfloat16 \ --limit-mm-per-prompt '{"image": 1}' \ --max-model-len 16384 \ --max-num-batched-tokens 16384 ``` Run benchmark: ```bash vllm bench serve \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --backend openai-chat \ --endpoint /v1/chat/completions \ --dataset-name hf \ --hf-split train \ --dataset-path lmarena-ai/vision-arena-bench-v0.1 \ --num-prompts 1000 \ --no-stream ``` Before this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 122.33 Total input tokens: 66638 Total generated tokens: 122845 Request throughput (req/s): 8.17 Output token throughput (tok/s): 1004.18 Peak output token throughput (tok/s): 3073.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1548.90 ---------------Time to First Token---------------- Mean TTFT (ms): 51757.16 Median TTFT (ms): 44853.42 P99 TTFT (ms): 110700.14 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 226.06 Median TPOT (ms): 206.85 P99 TPOT (ms): 935.31 ---------------Inter-token Latency---------------- Mean ITL (ms): 208.82 Median ITL (ms): 96.37 P99 ITL (ms): 2183.13 ================================================== ``` After this PR: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Benchmark duration (s): 121.47 Total input tokens: 66638 Total generated tokens: 122860 Request throughput (req/s): 8.23 Output token throughput (tok/s): 1011.47 Peak output token throughput (tok/s): 3202.00 Peak concurrent requests: 1000.00 Total token throughput (tok/s): 1560.08 ---------------Time to First Token---------------- Mean TTFT (ms): 50125.08 Median TTFT (ms): 46270.85 P99 TTFT (ms): 108107.12 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 227.11 Median TPOT (ms): 205.13 P99 TPOT (ms): 816.08 ---------------Inter-token Latency---------------- Mean ITL (ms): 204.60 Median ITL (ms): 92.66 P99 ITL (ms): 2219.02 ================================================== ``` - vLLM version: v0.14.0 - vLLM main: `d68209402d` Signed-off-by: shen-shanshan <467638484@qq.com>	2026-01-26 10:20:24 +08:00
zhangxinyuehfad	819a4459ce	Drop vLLM 0.13.0 support (#6069 ) ### What this PR does / why we need it? Drop vLLM 0.13.0 support, upgrade to 0.14.0 - vLLM version: v0.13.0 - vLLM main: `d68209402d` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2026-01-23 09:45:08 +08:00
wjunLu	c11a05c4e1	[Main2Main] Upgrade vllm commit to 0113 (#5839 ) ### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors https://github.com/vllm-project/vllm/pull/31916 https://github.com/vllm-project/vllm/pull/32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to https://github.com/vllm-project/vllm/pull/24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified https://github.com/vllm-project/vllm/pull/31998 - Skip some pooling tests, which are caused by https://github.com/vllm-project/vllm/pull/32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs https://github.com/vllm-project/vllm/pull/32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by https://github.com/vllm-project/vllm/pull/32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2026-01-15 09:48:53 +08:00
Li Wang	58adf7c8ac	[Bugfix] Correctly handle the output shape in multimodal attention (#5443 ) ### What this PR does / why we need it? Fix https://github.com/vllm-project/vllm-ascend/issues/5297, for `AscendMMEncoderAttention` forward, we should keep the output shape consistence with the input - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-27 18:42:46 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
Shanshan Shen	6c478531f8	[CustomOp] Register AscendApplyRotaryEmb CustomOp and remove related patch (#4667 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/29873, register `AscendApplyRotaryEmb` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### ✅ Test Qwen2.5-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b02c1ff3415d2462","object":"chat.completion","created":1766129265,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-In struct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is \"TONGYI Qwen.\" The word \"TONGYI\" is writ ten in blue, and \"Qwen\" is written in gray. The text appears to be part of a logo or branding design.","refusal":null,"annotations":null,"audio": null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"tok en_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":129,"completion_tokens":51,"prompt_tokens_d ``` #### ✅ Test Qwen3-VL Run: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-a3a7de5a900a9321","object":"chat.completion","created":1766129586,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-23 10:04:37 +08:00
Shanshan Shen	b84ad8c5d8	[CustomOp] Register AscendMMEncoderAttention CustomOp and remove related patch (#4750 ) ### What this PR does / why we need it? Following https://github.com/vllm-project/vllm/pull/30125, register `AscendMMEncoderAttention` CustomOp and remove related patch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ✅ Run Qwen2.5-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` ✅ Run Qwen3-VL: ```bash vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \ --max_model_len 16384 ``` Output: ``` {"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is “TONGYI Qwen”.\n\n### How it looks:\n- “TONGYI” is written in uppercase letters in a bold, modern sans-serif font, colored blue.\n- “Qwen” is written in lowercase letters in a slightly thinner, elegant sans-serif font, colored dark gray.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} ``` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-12-22 14:32:53 +08:00

12 Commits