xc-llm-ascend

Author	SHA1	Message	Date
Shanshan Shen	303c08aec9	[Doc] Update structured output doc with upstream link (#5058 ) ### What this PR does / why we need it? Cherry-pick from main https://github.com/vllm-project/vllm-ascend/pull/4015. Currently, the usage of structured output feature in vllm-ascend is totally the same as that in vllm. Thus, IMO, it's better to remove this doc directly to avoid some case that there are some changes in the upstream doc and we don't update our doc in time, which can be misleading to users. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-16 11:32:53 +08:00
LI SHENGYONG	c94b38c82e	[Readme] EPLB Support Scenarios (#4315 ) ### What this PR does / why we need it? Add information on the scope of EPLB support. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:25:39 +08:00
zhangxinyuehfad	75de3fa172	[v0.11.0][Doc] Update doc (#3852 ) ### What this PR does / why we need it? Update doc Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-29 11:32:12 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
offline893	6c9909c861	[Patch]patch of v1 executor when enable eplb. (#3511 ) ### What this PR does / why we need it? when using dynamic eplb, patch v1 executor to avoid create child process failed. ### How was this patch tested? deepseek in v3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-19 10:54:26 +08:00
offline893	5a3082cd15	[EPLB]Record expert map without dynamic eplb. (#3409 ) What this PR does / why we need it? 1.Record expert map without dynamic eplb. 2.Add export PYTHONOPTIMIZE=1 when using dynamic eplb. 3.change eplb doc Does this PR introduce any user-facing change? How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-15 14:21:15 +08:00
Wang Kunpeng	859e861d92	[main][quantization] Support deepseek w4a8 per-channel quantization (#3011 ) ### What this PR does / why we need it? 1.Support deepseek w4a8 per-channel quantization 2.The eager mode supports converting weights to the NZ format ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### How to get weights using Modelslim ##### Installation steps git clone https://gitcode.com/Ascend/msit.git cd msit/msmodelslim bash install.sh ##### Generate w4a8 per-channel weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-09-27 21:01:16 +08:00
offline893	5d13bbe796	[BugFix]Modify eplb feature guide. (#3183 ) ### What this PR does / why we need it? Revise the EPLB feature guide content.Add eplb params to ascend config. ### Does this PR introduce any user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `52d0cb8458` Co-authored-by: offline0806 <3337230449@qq.com>	2025-09-25 17:01:51 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
Li Wang	042605f4b2	[Doc] Add stable modelslim branch (#2545 ) ### What this PR does / why we need it? The branch `br_release_MindStudio_8.1.RC2_TR5_20260624` is commercial delivery version of modelslim in Q3, and has been verified available ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7d67a9d9f9` Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-27 09:05:46 +08:00
yupeng	973a7cfdf0	[DOC] update doc: LoRA with ACLGraph (#2430 ) ### What this PR does / why we need it? Update DOC. Guide users to run LoRA with ACLGraph. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. - vLLM version: v0.10.0 - vLLM main: `de7b67a023` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-08-21 08:55:55 +08:00
Li Wang	2ad7e1251e	[Doc] Fix quant documentation to make it reproducible (#2277 ) ### What this PR does / why we need it? Fixed the expression of msit for code clone - vLLM version: v0.10.0 - vLLM main: `afa5b7ca0b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-14 17:19:47 +08:00
Li Wang	bf84f2dbfa	[Doc] Support kimi-k2-w8a8 (#2162 ) ### What this PR does / why we need it? In fact, the kimi-k2 model is similar to the deepseek model, and we only need to make a few changes to support it. what does this pr do: 1. Add kimi-k2-w8a8 deployment doc 2. Update quantization doc 3. Upgrade torchair support list ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `9edd1db02b` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-08-06 19:28:47 +08:00
Wang Kunpeng	e3a2443c3a	[main][Doc] add mla pertoken quantization FAQ (#2018 ) ### What this PR does / why we need it? When using deepseek series models generated by the --dynamic parameter, if torchair graph mode is enabled, we should modify the configuration file in the CANN package to prevent incorrect inference results. - vLLM version: v0.10.0 - vLLM main: `7728dd77bb` --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com>	2025-07-27 08:47:51 +08:00
Li Wang	bdfb065b5d	[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 ) ### What this PR does / why we need it? 1. Enable pymarkdown check 2. Enable python `__init__.py` check for vllm and vllm-ascend 3. Make clean code ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: `29c6fbe58c` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-07-25 22:16:10 +08:00
wangxiyuan	eb921d2b6f	[Doc] Fix 404 error (#1797 ) Fix url 404 error in doc - vLLM version: v0.9.2 - vLLM main: `9ad0a4588b` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-15 11:52:38 +08:00
wangxiyuan	b5b7e0ecc7	[Doc] Add qwen3 embedding 8b guide (#1734 ) 1. Add the tutorials for qwen3-embedding-8b 2. Remove VLLM_USE_V1=1 in docs, it's useless any more from 0.9.2 - vLLM version: v0.9.2 - vLLM main: `5923ab9524` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-11 17:40:17 +08:00
wangxiyuan	3d1e6a5929	[Doc] Update user doc index (#1581 ) Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: `49e8c7ea25` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-10 14:26:59 +08:00

18 Commits