xc-llm-ascend

Author	SHA1	Message	Date
wjunLu	b8f245792e	[Main2Main] Upgrade vllm commit to 0106 (#5617 ) ### What this PR does / why we need it? Upgrade vllm commit to 0106 - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 15:50:40 +08:00
meihanc	c1dcddce3f	[CI]update bisheng version (#5621 ) ### What this PR does / why we need it? update bisheng version in 20260105 - vLLM version: v0.13.0 - vLLM main: `8be6432bda` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-06 15:22:22 +08:00
wjunLu	3cf059a72b	[Main2Main] Upgrade vllm commit to 0105 (#5595 ) ### What this PR does / why we need it? Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e) 1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since https://github.com/vllm-project/vllm/pull/31517 deleted unused arg 2. Remove dense `Qwen/Qwen3-0.6B` in `tests/e2e/multicard/test_aclgraph_capture_replay.py` and `tests/e2e/multicard/test_data_parallel.py` due to https://github.com/vllm-project/vllm/pull/30739 where offline data parallel mode will not be supported/useful for dense models 3. Adapt `vllm_ascend/worker/worker.py` due to https://github.com/vllm-project/vllm/pull/31584 4. Adapt `self.block_size` calling due to https://github.com/vllm-project/vllm/pull/31540 5. Modify `test_mla_v1.py` due to https://github.com/vllm-project/vllm/pull/28454 , which refactorred `get_head_size()` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: wjunLu <wjunlu217@gmail.com>	2026-01-06 08:44:29 +08:00
Qiu	b10ef9b9f3	[docs] Correct image about prefill phase of PCP (#5598 ) ### What this PR does / why we need it? Remove the incorrectly depicted DCP all_gather operation in the prefill stage PCP for GQA diagram. Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2026-01-05 20:21:59 +08:00
huqi	2d22700d69	Docs: Add A3 Docker image guidance for Atlas A3 machines (#5256 ) Fixes #3386 - Update Qwen3-30B-A3B.md to use A3-specific image tag - Update Qwen3-Dense.md to provide both A2 and A3 image options - Update Qwen3-Next.md to use A3-specific image for Atlas A3 environments Previously, documentation only mentioned A2 images (vllm-ascend:version) but Atlas A3 machines require A3-specific images (vllm-ascend:version-a3). This change ensures users select the correct image for their hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: hu-qi <huqi1024@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-01-05 19:42:42 +08:00
huqi	9d8b4c8d9d	[Doc] Add NNAL installation guide and requirements (#5235 ) Fixes #2727 - Add NNAL to the software requirements table with version information - Add note explaining that prebuilt Docker images include NNAL - Add warning message for manual installation when encountering libatb.so errors - Improve visibility of NNAL installation instructions to prevent runtime errors This addresses the issue where users encounter 'libatb.so not found' errors due to missing NNAL installation in their environment. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com> Signed-off-by: hu-qi <huqi1024@gmail.com> Co-authored-by: zhangyiming <34808445+menogrey@users.noreply.github.com>	2026-01-05 19:40:26 +08:00
zhangmuzhi_yuwan	6c1a685b30	[Doc] add new doc for mooncake: PD-Colocated cross-node multi-instance validation of Mooncake's KV Cache reuse and performance. (#5415 ) ### What this PR does / why we need it? This documentation provides a comprehensive technical guide for deploying vLLM-Ascend using a Prefill-Decode (PD) colocated architecture integrated with Mooncake, a high-performance distributed KV Cache transfer engine. As Large Language Model (LLM) serving scales, managing KV Cache efficiently across distributed nodes is essential for reducing latency and optimizing hardware utilization. The tutorial focuses on a multi-instance setup using Huawei Atlas 800T A2 nodes. By leveraging Mooncake’s distributed memory pooling, vLLM instances can achieve seamless cross-node KV Cache reuse. This capability allows an instance to retrieve precomputed cache from a remote node's DRAM via high-speed RoCE networks, effectively bypassing redundant prefill computations. ### Does this PR introduce _any_ user-facing change? No - vLLM version: release/v0.13.0 - vLLM main: `0bfd7484fd` --------- Signed-off-by: zhangmuzhibangde <1037640609@qq.com> Signed-off-by: zhangmuzhi_yuwan <1037640609@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2026-01-05 14:19:57 +08:00
lilinsiman	52863c4165	[Refactor][EAGLE] 2/N: load model and generate token (#5437 ) ### What this PR does / why we need it? 1. Refactor eagle and mtp function: load_model and generate_token_ids 2. Remove redundant code in mtp and eagle file 3. Refactor the UT of file 2/N of Refactor and merge mtp and eagle Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut and tests - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2026-01-05 14:07:54 +08:00
L4	c23cf30709	[Doc] eval-type not support service but server (#2920 ) ### What this PR does / why we need it? fix wrong eval-type in accuracy doc - vLLM version: v0.10.2 - vLLM main: `fec347dee1` Signed-off-by: root <root@liaolile-laptop.localdomain> Co-authored-by: root <root@liaolile-laptop.localdomain>	2026-01-05 11:17:39 +08:00
zhangxinyuehfad	a099b994b3	[Doc] update supported models (#5379 ) ### What this PR does / why we need it? 1. update supported models: Llama2 & Kimi-K2-Thinking & ERNIE-4.5 & Qwen3-Omni 2. update Supported Hardware - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2026-01-05 09:21:52 +08:00
meihanc	fbb93ad8f2	[bugfix]update bishengir source envs (#5582 ) ### What this PR does / why we need it? Due to the update of the Bisheng version's installation path, the corresponding source path in the environment variables needs to be updated. - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2026-01-05 09:13:40 +08:00
InSec	7cf65d0581	[Doc]modify the quantization user guide and add a quantization adaptation developer guide (#5554 ) ### What this PR does / why we need it? This PR makes the following modifications: 1.delete the `user_guide/feature_guide/quantization-llm-compressor.md` and merge it into `user_guide/feature_guide/quantization.md`. 2.update the content of `user_guide/feature_guide/quantization.md`. 3.add guidance `developer_guide/feature_guide/quantization.md' on the adaptation of quantization algorithms and quantized models. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `7157596103` --------- Signed-off-by: IncSec <1790766300@qq.com> Signed-off-by: InSec <1790766300@qq.com>	2026-01-05 09:12:11 +08:00
baxingpiaochong	46c2fc6a3c	[KVPOOL]decode save kvcache (#5168 ) ### What this PR does / why we need it? kvpool decode save kvcache now only support mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: baxingpiaochong <771405853@qq.com> Co-authored-by: Chao Lei <leichao139636@163.com>	2026-01-04 22:22:01 +08:00
Cao Yi	749c4a3deb	[Doc] Fix typo in ASCEND_RT_VISIBLE_DEVICES (#5581 ) Fixed a typo in the environment variable name. `ASCEBD_RT_VISIBLE_DEVICES` -> `ASCEND_RT_VISIBLE_DEVICES` Fixes #5580 Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2026-01-04 17:01:02 +08:00
TmacAaron	fd4b4fd06f	[Doc] Fix spelling mistake of environment variable name ASCEND_RT_VISIBLE_DEVICES in Doc (#5570 ) ### What this PR does / why we need it? Spelling mistake of Environment Variable "ASCEND_RT_VISIBLE_DEVICES" in [Doc](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.1.html#prefill-decode-disaggregation). - vLLM version: v0.13.0 - vLLM main: `7157596103` Signed-off-by: TmacAaron <yangyit139@gmail.com>	2026-01-04 11:52:58 +08:00
Jade Zheng	38570cfeb6	[Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072 ) By converting the KV cache from ND to NZ format when the decode node receives it, this PR ensures that the KV NZ feature works correctly during the decoding phase in disagg-prefill scenario. - vLLM version: v0.11.0 - vLLM main: `83f478bb19` --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Co-authored-by: ghphotoframe <854746559@qq.com> Co-authored-by: alex101-ops <alex1015718386@gmail.com>	2025-12-31 14:24:04 +08:00
wjunLu	3c2d3e52e5	[Main2Main] Upgrade vllm commit to 1230 (#5495 ) ### What this PR does / why we need it? Upgrade vllm commit to 1230 Affected by https://github.com/vllm-project/vllm/pull/27614 (and the core PR https://github.com/vllm-project/vllm/pull/26866), we have to make the following changes: 1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep compatible with both vllm version of `v0.13.0` and latest main commitID, while vllm enables async scheduling by default 2. Skip `test_guided_decoding.py` due to xgrammar errors (https://github.com/vllm-project/vllm-ascend/issues/5524) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: wjunLu <wjunlu217@gmail.com>	2025-12-31 09:44:35 +08:00
Li Wang	2ee17e50a1	[2/N] Upgrade nightly doc (#5534 ) ### What this PR does / why we need it? Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade the corresponding doc for developers - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-31 09:11:42 +08:00
zhangyiming	98798d80a0	[Doc] Add new contributors. (#5537 ) ### What this PR does / why we need it? Update new contributors. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: menogrey <1299267905@qq.com>	2025-12-31 07:39:42 +08:00
huqi	c85cc045f8	Docs: Remove deprecated --task parameter for embedding models (#5257 ) Fixes #3376 - Remove --task embed from vllm serve command in Qwen3_embedding.md - Remove task='embed' parameter from LLM constructor in Python example The --task parameter has been deprecated in recent vLLM versions in favor of automatic model type detection. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: hu-qi <huqi1024@gmail.com>	2025-12-30 16:09:07 +08:00
wangxiyuan	4ff4d1cef9	[Doc] Fix issue link for 0.12.0 (#5500 ) Correct issue link for v0.12.0 release note Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-30 10:34:20 +08:00
meihanc	8c4e9bb76b	[CI]update triton ascend version (#5392 ) ### What this PR does / why we need it? update triton-ascend version to 1229 and bisheng version in 1225; - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>	2025-12-30 09:51:45 +08:00
Nengjun Ma	5e96f94d2a	Update corresponding vllm commit ID to 12 29 (#5475 ) ### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (https://github.com/vllm-project/vllm/pull/31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: `5326c89803` --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-12-29 22:48:05 +08:00
ZT-AIA	24328aaf00	update vllm pin to 12.27 (#5412 ) ### What this PR does / why we need it? update vllm pin to 12.27 1、Fix Qwen2-MoE shared_expert_gate ：https://github.com/vllm-project/vllm/pull/31339 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: release/v0.13.0 vLLM main: `5326c89803` Co-authored-by: leo-pony [nengjunma@outlook.com](nengjunma@outlook.com) --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: leo-pony <nengjunma@outlook.com>	2025-12-28 00:19:36 +08:00
Mengqing Cao	1b5d5abf86	[ReleaseNote] Add release note for v0.13.0rc1 (#5334 ) ### What this PR does / why we need it? Add release note for v0.13.0rc1 - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-27 18:46:57 +08:00
weiguihua2	c30c3dc831	[Doc]modify pcp tutorial doc (#5440 ) ### What this PR does / why we need it? modify pcp tutorial doc Because some optimization points have been submitted as PRs and haven't been merged yet, I'll update the performance data now and refresh it again after the PRs are merged. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 17:47:09 +08:00
MengLong Chen	b8b5521f5b	[Doc] Update DeepSeek V3.1/R1 2P1D doc (#5387 ) ### What this PR does / why we need it? The PR updates the documentation for DeepSeek-V3.1 and DeepSeek-R1 in the scenario of prefill-decode disaggregation. Updated some PD separation-related setting parameters and optimal configurations. This script has been verified. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-12-27 17:28:43 +08:00
cookieyyds	843751768e	[DOC]Fix model weight download links (#5436 ) Updated download links for DeepSeek-V3.2 model weights. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>	2025-12-27 17:14:31 +08:00
Zhu Yi Lin	04104031d0	[Doc] Modify DeepSeek-R1/V3.1 documentation (#5426 ) ### What this PR does / why we need it? Modify DeepSeek-R1/V3.1 documentation. Mainly update the mtp size and some other configs. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-27 17:13:58 +08:00
Angazenn	eab306b09c	[doc] Update Qwen3-235B doc for reproducing latest performance (#5323 ) ### What this PR does / why we need it? This PR updates Qwen3-235B doc to give a simple recipe for repreducing our latest perfomance on Atlas A3 servers. - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-27 15:55:58 +08:00
Zhu Yi Lin	be2a947521	[Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (#5419 ) ### What this PR does / why we need it? Currently, HCCL_OP_EXPANSION_MODE="AIV" is causing some freezing issues on A2.so we have temporarily removed it from the documentation. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-27 12:44:50 +08:00
LookAround0301	ca31d6823e	[Doc] add long_sequence feature user guide (#5343 ) ### What this PR does / why we need it? add long_sequence feature user guide - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: LookAround <lixushi@huawei.com>	2025-12-27 10:44:43 +08:00
weiguihua2	69f96950e1	[Doc] modify pcp tutorials (#5411 ) ### What this PR does / why we need it? modify pcp tutorials modify pcp perf statistics and add note: Context parallel feature currently is only supported on Atlas A3 device, and will be supported on Atlas A2 in the future. - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 10:36:10 +08:00
fems14	2ef4d1979e	[bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398 ) ### What this PR does / why we need it? 1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error Resolution 2.Update KV Pool Documentation ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-12-27 09:53:57 +08:00
weiguihua2	ce52e17bf3	[Doc]add long sequence tutorials (#5364 ) ### What this PR does / why we need it? Provide sample guidance for running long-sequence DeepSeek across multiple nodes To guide users on using the context parallel feature, a practical example is provided. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-12-27 09:52:11 +08:00
ZT-AIA	1d8aa892bf	Update vllm pin to 12.26 (#5378 ) ### What this PR does / why we need it? Update vllm pin to 12.26 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `81786c8774` --------- Signed-off-by: ZT-AIA <1028681969@qq.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-26 23:44:48 +08:00
LeeWenquan	7685d0c239	rollback causal_conv1d_fn to torch ops & update qwen3Next doc (#5391 ) ### What this PR does / why we need it? Rollback causal_conv1d_fn ops from triton to torch version to fix hanging issues，meanwhile update Qwen3Next doc - vLLM version: release/v0.13.0 - vLLM main: `254f6b9867` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-12-26 19:57:38 +08:00
Zhu Yi Lin	06732dbf5b	[Doc] update R1/V3.1 doc (#5383 ) ### What this PR does / why we need it? This PR updates DeepSeek-R1/V3.1 doc to give a simple recipe for repreducing our latest perfomance on Atlas A3/A2 servers. ### Does this PR introduce any user-facing change? No. Signed-off-by: GDzhu01 <809721801@qq.com>	2025-12-26 17:09:22 +08:00
zhangsicheng5	8ed87dfa84	[doc] Add context parallel user guide (#5358 ) 1. Add context parallel user guide 2. Add context parallel related message in supported features/models - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>	2025-12-26 17:03:47 +08:00
Qiu	da0b113cf5	[doc]<PCP&DCP> add developer guide for PCP&DCP (#5372 ) ### What this PR does / why we need it? add developer guide for PCP&DCP - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>	2025-12-26 16:17:38 +08:00
wangxiyuan	29d2fe653d	cleanup ascend config (#5296 ) 1. refresh additional config doc 2. move kv config logic to platform. 3. improve `dump_config` init logic and rename it to `dump_config_path` this change is user impacted. dump_config is changed from dict to string. 4. correct `enable_async_exponential` type 5. remove useless `chunked_prefill_for_mla` - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-26 14:07:37 +08:00
ZT-AIA	adaa89a7a5	Update vllm pin to 12.25 (#5342 ) ### What this PR does / why we need it? - Fix vllm break in the pr: 1.[Drop v0.14 deprecations ]https://github.com/vllm-project/vllm/pull/31285 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: ZT-AIA <1028681969@qq.com>	2025-12-26 14:05:40 +08:00
cookieyyds	2da8038dd2	[doc] update using command (#5373 ) ### What this PR does / why we need it? Update the configuration for optimal performance of deepseek v3.2 in the usage tutorial. - vLLM version: release/v0.13.0 - vLLM main: `bc0a5a0c08` --------- Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-12-25 22:28:35 +08:00
wangxiyuan	2ae0bad96d	Remove VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE (#5272 ) `VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE` is only used together with `VLLM_ASCEND_ENABLE_PREFETCH_MLP` which is useless totally. This PR remove it. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-25 11:09:56 +08:00
Nengjun Ma	42c989a437	Update vllm pin to 12.24 (#5307 ) ### What this PR does / why we need it? Fix vllm break in the pr: 1. [Add MiMo-V2-Flash support] (https://github.com/vllm-project/vllm/pull/30836) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Co-authored-by: zxwang [1476209578@qq.com](mailto:1476209578@qq.com) - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-24 17:24:31 +08:00
ZYang6263	a3f65b938f	[Doc] Add pa_shape_list description to qwen dense tutorial (#5225 ) ### What this PR does / why we need it? Add pa_shape_list description to qwen dense tutorial. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: ZYang6263 <zy626375@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>	2025-12-24 14:40:20 +08:00
Nengjun Ma	3b59f20a28	update to vllm 12-19 (#5223 ) ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? Fix vllm break: 1. [Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement] (https://github.com/vllm-project/vllm/pull/29558) Fix Solution: Add the now-necessary `all2all_backend` parameter. The impact of this parameter on the original `set_splitting_ops_for_v1` implementation is only that graph mode is disabled in `vllm` if `deepep_high_throughput` is enabled; it has no effect on the `vllm-ascend` logic. 2.[Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ] (https://github.com/vllm-project/vllm/pull/30684) Fix Solution: The reason why the GPU does not need to convert qkv to 3D is that the GPU's flash_attention operator is compatible with 3D and 4D (b s h d and s b ( h d)), but the NPU's flash_attention_unpad operator only supports 3D (s b ( h d)). Therefore, we need to introduce the reshape_qkv_to_3d operation. 4.Skip Tencent-Hunyuan/HunyuanOCR test case, as it has following issue in upgrade vllm code: https://github.com/vllm-project/vllm-ascend/issues/5297 ### How was this patch tested? Co-authored-by: zxwang <1476209578@qq.com> - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zxwang <1476209578@qq.com> Co-authored-by: zxwang <1476209578@qq.com>	2025-12-23 23:52:11 +08:00
Tiger Xu / Zhonghu Xu	cb963c53a5	[Doc] Added deploying on k8s with kthena (#4674 ) ### What this PR does / why we need it? [Kthena](https://github.com/volcano-sh/kthena) is a Kubernetes-native LLM inference platform that transforms how organizations deploy and manage Large Language Models in production. Built with declarative model lifecycle management and intelligent request routing, it provides high performance and enterprise-grade scalability for LLM inference workloads. The platform extends Kubernetes with purpose-built Custom Resource Definitions (CRDs) for managing LLM workloads, supporting multiple inference engines (vLLM, SGLang, Triton) and advanced serving patterns like prefill-decode disaggregation. This pr added a example on deloying llm on Ascend Kubernetes clusters. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>	2025-12-23 17:46:04 +08:00
rongfu.leng	c9b5881bcd	[Doc] fix docs set rope_theta value is 10e6 in qwen3-235b model (#5258 ) ### What this PR does / why we need it? Fixes https://github.com/vllm-project/vllm-ascend/issues/5201 ### Does this PR introduce _any_ user-facing change? No, doc only ### How was this patch tested? - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: rongfu.leng <lenronfu@gmail.com>	2025-12-23 10:21:46 +08:00
zhangyiming	35dbdbb398	[Doc] Add new contributors and relative scripts. (#5070 ) ### What this PR does / why we need it? [Doc] Add new contributors and relative scripts. Usage of scripts: - `export GITHUB_TOKEN=<your github token>` - `bash tools/collect_user_first_contribution.sh vllm-project/vllm-ascend <base_sha> <head_sha>` and save the result to one temporary file such as `contributors.txt` - `python tools/format_contributors.py contributors.txt --start <start index now>` - Use the output to update the `contributors.md` - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: menogrey <1299267905@qq.com>	2025-12-23 10:01:45 +08:00

... 2 3 4 5 6 ...

595 Commits