xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	3fc31ee1cb	[1/N][refactor] torchair deepseek modeling refactor (#2384 ) ### What this PR does / why we need it? Move torchair related model arch into torchair moduel to make the code clear. Next step we'll remove all torchair related code outside of torchair moduel. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `08d5f7113a` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-18 15:00:37 +08:00
Pleaplusone	19fdc9a3f0	[Bugfix] Fix header include issue in rope (#2397 ) ### What this PR does / why we need it? vLLM-Ascend's rope implementaion include several header file that are not supposed to be included by outside users. Current implementation may break when canntoolkits update, this PR remove those not compatible file includes to guarantee the safety of upgrading cann toolkits. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by rope unittest - vLLM version: v0.10.0 - vLLM main: `3e6dd40016` Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-18 14:33:38 +08:00
Chao Lei	03ca2b26ca	[P/D] Mooncake Connector for v1 distributed (#1568 ) ### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](https://github.com/kvcache-ai/Mooncake/pull/502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `1c859a1387` --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>	2025-08-18 14:30:07 +08:00
CaveNightingale	2bb7e55022	[Bugfix][PD]fix non-working disaggregated prefill (#2374 ) ### What this PR does / why we need it? Mainline vLLM fixes its disaggregated prefill in https://github.com/vllm-project/vllm/pull/22598 . But it is still not working in vllm-ascend. To be concrete, decoder instances crash before vllm's fix and hang after vllm's fix in ascend devices. This patch allows disaggregated prefill to work. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Qwen3-0.6B 1P1D tp=1 dp=1 - vLLM version: v0.10.0 - vLLM main: `0fe85087a9` --------- Signed-off-by: CaveNightingale <cavenightingale@foxmail.com>	2025-08-15 16:59:52 +08:00
22dimensions	1b40665548	[Misc] remove unused file (cache.py) (#2377 ) ### What this PR does / why we need it? cache.py only contains a function that will never be called, so remove it. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.10.0 - vLLM main: `f1f0d2fab8` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-15 10:27:43 +08:00
Mengqing Cao	61866b8ac6	[Quickfix] update CachedRequestState as NewRequestData changed (#2367 ) ### What this PR does / why we need it? 1. update `CachedRequestState` as `NewRequestData` changed in https://github.com/vllm-project/vllm/pull/22570 2. drop maintenance of vllm v0.10.0 in the branch main ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `92ff41abea` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-15 07:35:27 +08:00
Li Wang	2ad7e1251e	[Doc] Fix quant documentation to make it reproducible (#2277 ) ### What this PR does / why we need it? Fixed the expression of msit for code clone - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-14 17:19:47 +08:00
Icey	c721ae6042	[CustomOp] Register RMSNorm instead of overwrite forward_oot (#2284 ) ### What this PR does / why we need it? Use function CustomOp.register_oot to achieve the customop registery ``` from vllm.model_executor.custom_op import CustomOp CustomOp.register_oot(_decorated_op_cls=AscendRMSNorm, name="RMSNorm") ``` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-14 17:18:30 +08:00
shiyuan680	e14f2ef669	refactor select_experts of moe module (#2150 ) ### What this PR does / why we need it? this pr refactor select_experts of moe module i merge implementations of quantitative and non-quantitative method in a new class use such as vllm like ExpertsSelector.select_experts ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test in qwen3-moe and all ut. - vLLM version: v0.10.0 - vLLM main: `e18859298d` Signed-off-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: yangcheng (AJ) <y00806874@china.huawei.com>	2025-08-14 11:50:53 +08:00
Shanshan Shen	103654ccd6	[Misc] Remove redundant imported `envs`, using `envs_ascend` instead (#2193 ) ### What this PR does / why we need it? Remove redundant imported `envs`, using `envs_ascend` instead. ```python import vllm.envs as envs_vllm import vllm_ascend.envs as envs_ascend ``` - vLLM version: v0.10.0 - vLLM main: `71683ca6f6` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-14 09:33:39 +08:00
Shanshan Shen	55d0790597	[2/N][Refactor] Refactor V1 attention for better extensibility (#1995 ) ### What this PR does / why we need it? Refactor V1 Attention for better extensibility (prepared for torchair attention refactor). Main changes: - Move different kinds of foward into their method respectively, e.g., `_forward_prefill_no_cache()`, `_forward_prefill_cache_hit()`, `_forward_decode_only()`, `_forward_v1_style()`. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-14 09:32:41 +08:00
Mengqing Cao	8914d5a4b2	[Quickfix] Add the missing `apply_router_weight_on_input` in FusedMoE init (#2348 ) ### What this PR does / why we need it? Add the missing `apply_router_weight_on_input` in FusedMoE init Quick fix on https://github.com/vllm-project/vllm-ascend/pull/2268#discussion_r2265828849 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `6807af8f46` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-14 09:17:50 +08:00
zhenghaojiang	0f7492d18e	[Bugfix] fix the oom when chunkprefill with long context like 64k (#2319 ) The attn mask was declared in the mla.py，we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` --------- Signed-off-by: haojiangzheng <justineric096@gmail.com>	2025-08-13 17:15:59 +08:00
jack	8bfd16a145	[Doc] Add container image save/load FAQ for offline environments (#2347 ) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: `d16aa3dae4` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2025-08-13 16:00:43 +08:00
yiz-liu	992271b027	[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (#2125 ) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: `f7ad6a1eb3` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-12 21:10:20 +08:00
wangxiyuan	1a70564e7c	[5/N][Refactor] torchair model runner refactor (#2216 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: create common function `_capture_model` for capture_model - vLLM version: v0.10.0 - vLLM main: `1891a265d3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-12 14:24:50 +08:00
Mengqing Cao	49ec6c98b7	[Doc] Update faq (#2334 ) ### What this PR does / why we need it? - update determinitic calculation - update support device ### Does this PR introduce _any_ user-facing change? - Users should update ray and protobuf when using ray as distributed backend - Users should change to use `export HCCL_DETERMINISTIC=true` when enabling determinitic calculation ### How was this patch tested? N/A - vLLM version: v0.10.0 - vLLM main: `ea1292ad3e` Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-12 14:12:53 +08:00
Wang Kunpeng	dc585f148a	[main][prefill optimization] Optimize parallel strategies to reduce communication overhead (#2198 ) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com>	2025-08-12 14:12:12 +08:00
Ronald1995	81817908ca	ut: add ci guard for ut coverage (#2317 ) ### What this PR does / why we need it? add ci guard for ut coverage, if ut coverage of patch pr is below 80%, the ci will failed/ ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `458e74eb90` --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-08-12 08:05:01 +08:00
jack	9c6d108330	Configure Gemini (#2298 ) ### What this PR does / why we need it? This PR requests Gemini AI to review PRs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>	2025-08-11 22:21:29 +08:00
wangxiyuan	c8b0f5f799	[4/N][Refactor] torchair model runner refactor (#2208 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_convert_torch_foramt` for initialize_kv_cache - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 21:39:24 +08:00
zhenghaojiang	eb43a475f4	[Feat] chunkprefill mla support torchair graph (#1772 ) chunkprefill mla only support eager mode now，we want to optimaze it by support torchair graph, the idea is simple, when all the request is running in decode, use torchair graph to deal with it, else when chunkprefill or prefill only, use the eager mode - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com>	2025-08-11 19:58:59 +08:00
wangxiyuan	881e36d6a9	[3/N][Refactor] torchair model runner refactor (#2207 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_build_attention_metadata` and `_generate_dummy_run_hidden_states` for dummy_run - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 18:03:19 +08:00
whx	29aaba5f84	[Perf][MTP] Optimize reject sampler in greedy situation. (#2137 ) This PR port optimization in PR #2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-08-11 17:37:49 +08:00
dependabot[bot]	ca274001b0	Bump actions/download-artifact from 4 to 5 (#2311 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-08-11 16:02:12 +08:00
Pleaplusone	c0f0b70813	[core] Support capture custom ops into aclgraph (#2113 ) ### What this PR does / why we need it? Thanks to the PR https://github.com/vllm-project/vllm-ascend/pull/426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: `1b99028069` --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>	2025-08-11 15:59:42 +08:00
wangxiyuan	1ab15414bb	[2/N][Refactor] torchair model runner refactor (#2204 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: move `torchair` related logic into `_get_forward_metadata_across_dp` and override it in torchair model runner - vLLM version: v0.10.0 - vLLM main: `1b99028069` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 14:06:49 +08:00
wangxiyuan	9260910c8d	[CI] Fix broken CI (#2302 ) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: `65552b476b` 3. fix kv_connector_output bug, see: `796bae07c5` - vLLM version: v0.10.0 - vLLM main: `d1af8b7be9` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 11:22:32 +08:00
yangqinghao-cmss	ee6f79c44a	Add ut for test_communicator.py (#2293 ) ### What this PR does / why we need it? Add ut for test_communicator.py - vLLM version: v0.10.0 - vLLM main: `e5ebeeba53` Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-08-09 08:26:04 +08:00
Icey	3e65c406b8	Fix accuracy test create PR (#2274 ) ### What this PR does / why we need it? Fix create PR of accuracy test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Local testing: https://github.com/nv-action/vllm-benchmarks/pull/87 - vLLM version: v0.10.0 - vLLM main: `099c046463` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 14:12:11 +08:00
Icey	0bd5ff5299	Fix accuracy test config and add DeepSeek-V2-Lite test (#2261 ) ### What this PR does / why we need it? This PR fix accuracy test related to https://github.com/vllm-project/vllm-ascend/pull/2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-08-08 11:09:16 +08:00
Mengqing Cao	ad1083761f	[CI][Quickfix] Fix AscendFusedMoE init error (#2268 ) ### What this PR does / why we need it? Fix AscendFusedMoE init error. Use `super().__init__()` instead of `super(FusedMoE, self).__init__()` to ensure the member variables in base class could be called by the children class ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.10.0 - vLLM main: `766bc8162c` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-08 10:20:23 +08:00
huangxialu	dceef080b1	[main] remove torch.cat and replace it by List[0] (#2153 ) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` performance: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: `fa00c5d75b` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-07 17:20:19 +08:00
Ronald1995	b2598c3271	enable mm allreduce test (#2192 ) ### What this PR does / why we need it? This PR is to add e2e test for using npu_mm_all_reduce_base fusion kernel. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `5d5d419ca6` Signed-off-by: Ronald1995 <ronaldautomobile@163.com>	2025-08-07 17:19:23 +08:00
Mengqing Cao	4604882a3e	[ReleaseNote] Release note of v0.10.0rc1 (#2225 ) ### What this PR does / why we need it? Release note of v0.10.0rc1 - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-08-07 14:46:49 +08:00
Yikun Jiang	58c8d4fdcd	Remove transformer pins for v0.9.1-dev (#2234 ) ### What this PR does / why we need it? Remove transformer pins for v0.9.1-dev, because we already release the v0.9.1rc2 with right transformer version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest CI passed - vLLM version: v0.10.0 - vLLM main: `7e6544c797` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-07 14:41:10 +08:00
zhangxinyuehfad	92eebc0c9b	[Doc] Update user guide for suported models (#2263 ) ### What this PR does / why we need it? Update user guide for suported models - vLLM version: v0.10.0 - vLLM main: `4be02a3776` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:39:51 +08:00
22dimensions	440d28a138	[Tutorial] Add qwen3 8b w4a8 tutorial (#2249 ) ### What this PR does / why we need it? Add a new single npu quantization tutorial, and using the latest qwen3 model. - vLLM version: v0.10.0 - vLLM main: `8e8e0b6af1` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-08-07 14:39:38 +08:00
zhangxinyuehfad	bcd0b532f5	[Doc] Update user guide for using lm-eval (#1325 ) ### What this PR does / why we need it? Update user guide for using lm-eval 1. add using lm-eval on online server 2. add using offline datasets - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:15:49 +08:00
zhangxinyuehfad	dbba3cabb0	[Doc] Update tutorials for single_npu_audio and single_npu_multimodal (#2252 ) ### What this PR does / why we need it? Update tutorials for single_npu_audio and single_npu_multimodal - vLLM version: v0.10.0 - vLLM main: `6b47ef24de` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-08-07 14:08:14 +08:00
Li Wang	205eff2b12	[Bugfix] Disable check vllm init temporary (#2250 ) ### What this PR does / why we need it? For the vllm src https://github.com/vllm-project/vllm/tree/main/vllm/attention/layers do not have `__init__.py`, which will break the python src init check, so we skip it for now ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `6b47ef24de` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-07 10:37:22 +08:00
lbk-sys	c611291661	【main】SP For Qwen3 MoE (#2209 ) ### What this PR does / why we need it? Qwen3 MoE supports SP. In scenarios like AlltoAll, AlltoAllv, and MC2, replacing AllReduce with Reduce-Scatter and AllGather achieves computational benefits in norm operations while saving one AllGather communication. This feature is enabled during the P-phase and delivers notable gains in long-sequence scenarios (e.g., 16k–25k), with performance improvements reaching 5%–10%. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` compilation_config={ "pass_config":{ "enable_sequence_parallelism": True } }, enable_expert_parallel=True, ``` - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: libaokui <libaokui@huawei.com> Co-authored-by: libaokui <libaokui@huawei.com>	2025-08-07 09:15:49 +08:00
Li Wang	57b9f02185	[Bugfix] Fix disaggregated pd error (#2242 ) ### What this PR does / why we need it? Fix `ascend_env has no attr VLLM_ASCEND_ENABLE_CHUNK_MC2`, remove useless lines - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:48:10 +08:00
xuyexiong	26fc36b0e0	[V1] MTP supports torchair (#2145 ) ### What this PR does / why we need it? Support MTP with： - [x] V0 Scheduler - [x] TorchAir - [x] Single DP - [x] Multi DP - [x] Disaggregate PD Known issues： - [ ] Not support V1 Scheduler (chunked prefill), will be supported in a few weeks - [ ] vllm v0.10.0 does not support metrics with `DP > 1` right now, need to comment out the line 171-175 in file `vllm/vllm/v1/metrics/loggers.py` ``` if (len(self.engine_indexes) > 1 and vllm_config.speculative_config is not None): raise NotImplementedError("Prometheus metrics with Spec Decoding " "with >1 EngineCore per AsyncLLM is not " "supported yet.") ``` To start an online server with torchair enabled, here is an example: ``` python -m vllm.entrypoints.openai.api_server \ --model="/weights/DeepSeek-R1_w8a8/" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --max-num-seqs 16 \ --no-enable-prefix-caching \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]},"enable_weight_nz_layout":true}' \ --gpu_memory_utilization 0.9 ``` offline example with torchair enabled ``` from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=16, temperature=0) # Create an LLM. llm = LLM( model="/home/data/DeepSeek-R1_w8a8/", tensor_parallel_size=16, max_num_seqs=16, gpu_memory_utilization=0.9, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, max_model_len=2000, additional_config = { 'torchair_graph_config': { 'enabled': True, "graph_batch_sizes": [16], 'enable_multistream_shared_expert': False, }, "ascend_scheduler_config": { "enabled": True }, # 'expert_tensor_parallel_size': 16, } ) # Generate texts from the prompts. # llm.start_profile() outputs = llm.generate(prompts, sampling_params) # llm.stop_profile() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.0 - vLLM main: `302962e806` --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-08-06 19:37:43 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00
huangxialu	875a86cbe9	ut: add example and e2e test for sleepmode in external_launcher (#2152 ) ### What this PR does / why we need it? This pr add e2e testcase to make sure sleep mode in external_launcher is ok. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: `74333ae2f6` Signed-off-by: huangxialu <huangxialu1@huawei.com>	2025-08-06 11:11:53 +08:00
Wang Kunpeng	8a59367d0c	[main][Feature] Support deepseek w4a8 quantization (#2172 ) ### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Adding ut case in `tests/ut/quantization/test_w4a8_dynamic.py` and `tests/ut/quantization/test_quantizer.py` Adding e2e case in `tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC` to test deepseek w4a8_dynamic quantized model #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment transformers>=4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#%E8%BF%90%E8%A1%8C%E5%89%8D%E5%BF%85%E6%A3%80) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-%E6%B7%B7%E5%90%88%E9%87%8F%E5%8C%96%E5%89%8D%E4%B8%89%E5%B1%82-mlpw8a8-dynamic-%E9%87%8F%E5%8C%96mla%E5%85%B1%E4%BA%AB%E4%B8%93%E5%AE%B6w8a8%E9%87%8F%E5%8C%96%E8%B7%AF%E7%94%B1%E4%B8%93%E5%AE%B6w4a8-dynamic%E9%87%8F%E5%8C%96) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and add `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3`; `quantization_config` is removed; tips:The group_size and weights match. If the w4a8 weights are not generated using msmodelslim, you can check the group_size in quantization_config in config.json. #### 2.How to run w4a8 ##### a.How to run eager mode export VLLM_USE_V1=1 # v1 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --max-num-seqs 128 --enforce-eager ##### b.How to run graph mode export VLLM_USE_V1=1 # v1 export HCCL_BUFFSIZE=1024 python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' eg: python -m vllm.entrypoints.openai.api_server --model=/weight/dsr1_w4a8_vllm --trust-remote-code -tp 4 -dp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 5120 --additional_config='{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-08-06 10:17:44 +08:00
Ruri	e31b31f9c3	[main][Bugfix] Fix unable to load qwen3_moe quantized weights (#2219 ) ### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to #1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-08-06 09:08:36 +08:00
Yikun Jiang	54ace9e12b	Add release note for v0.9.1rc2 (#2188 ) ### What this PR does / why we need it? Add release note for v0.9.1rc2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.0 - vLLM main: `c494f96fbc` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-08-06 09:04:46 +08:00
sherie	126cdfc92b	[Test] add rejection sampler ut (#2084 ) ### What this PR does / why we need it? add rejection sampler ut. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT passed - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-08-05 19:03:36 +08:00

1 2 3 4 5 ...

705 Commits