xc-llm-ascend

Author	SHA1	Message	Date
Rui Kang	427b17e2da	[Misc] Add a model loader that utilizes HCCL for weight loading (#2888 ) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>	2025-10-23 15:56:07 +08:00
Crazyang	f06a6cad1b	[Doc] Update the modelslim website from gitee to gitcode. (#3615 ) ### What this PR does / why we need it? Because the ModelSlim code repository has migrated from gitee to gitcode, all relevant links in the repository have been updated. [migration notice](https://gitee.com/ascend/msit/tree/master/.%E6%9C%AC%E9%A1%B9%E7%9B%AE%E5%B7%B2%E7%BB%8F%E6%AD%A3%E5%BC%8F%E8%BF%81%E7%A7%BB%E8%87%B3%20Gitcode%20%E5%B9%B3%E5%8F%B0) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Crazyang <im.crazyang@gmail.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weichen <calvin_zhu0210@outlook.com>	2025-10-23 15:38:16 +08:00
KyrieWang	60e2be1b36	[Feat] Dynamic Batch Feature (#3490 ) [RFC](https://github.com/vllm-project/vllm-ascend/issues/3328) for more details. Add dynamic batch feature in chunked prefilling strategy, the token budget can be refined to achieve better effective throughput and TPOT. !!! NOTE: only 910B3 is supported till now, we are working on further improvements. Additional file for lookup table is required. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Cheng Wang <wangchengkyrie@outlook.com>	2025-10-22 14:13:32 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
offline893	6c9909c861	[Patch]patch of v1 executor when enable eplb. (#3511 ) ### What this PR does / why we need it? when using dynamic eplb, patch v1 executor to avoid create child process failed. ### How was this patch tested? deepseek in v3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-19 10:54:26 +08:00
offline893	5a3082cd15	[EPLB]Record expert map without dynamic eplb. (#3409 ) What this PR does / why we need it? 1.Record expert map without dynamic eplb. 2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb. 3.change eplb doc Does this PR introduce any user-facing change? How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-15 14:21:15 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
offline893	5d13bbe796	[BugFix]Modify eplb feature guide. (#3183 ) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Co-authored-by: offline0806 <3337230449@qq.com>	2025-09-25 17:01:51 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
yupeng	973a7cfdf0	[DOC] update doc: LoRA with ACLGraph (#2430 ) ### What this PR does / why we need it? Update DOC. Guide users to run LoRA with ACLGraph. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.0 - vLLM main: `de7b67a023` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-08-21 08:55:55 +08:00
Li Wang	2ad7e1251e	[Doc] Fix quant documentation to make it reproducible (#2277 ) ### What this PR does / why we need it? Fixed the expression of msit for code clone - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-14 17:19:47 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00
Wang Kunpeng	e3a2443c3a	[main][Doc] add mla pertoken quantization FAQ (#2018 ) ### What this PR does / why we need it? When using deepseek series models generated by the --dynamic parameter, if torchair graph mode is enabled, we should modify the configuration file in the CANN package to prevent incorrect inference results. - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-07-27 08:47:51 +08:00
Li Wang	bdfb065b5d	[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 ) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `29c6fbe58c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-25 22:16:10 +08:00
wangxiyuan	eb921d2b6f	[Doc] Fix 404 error (#1797 ) Fix url 404 error in doc - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:38 +08:00
wangxiyuan	b5b7e0ecc7	[Doc] Add qwen3 embedding 8b guide (#1734 ) 1. Add the tutorials for qwen3-embedding-8b 2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2 - vLLM version: v0.9.2 - vLLM main: `5923ab9524` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-11 17:40:17 +08:00
wangxiyuan	3d1e6a5929	[Doc] Update user doc index (#1581 ) Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: `49e8c7ea25` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-10 14:26:59 +08:00

18 Commits