xc-llm-ascend/docs/source/user_guide/feature_guide/index.md

# Feature Guide

This section provides a detailed usage guide of vLLM Ascend features.

:::{toctree}
:caption: Feature Guide
:maxdepth: 1
graph_mode
quantization
sleep_mode
structured_output
lora
eplb_swift_balancer
:::
[Doc] Update user doc index (#1581) Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: https://github.com/vllm-project/vllm/commit/49e8c7ea256bd48a36391b5bc72212af39278b67 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-07-10 14:26:59 +08:00			`# Feature Guide`

			`This section provides a detailed usage guide of vLLM Ascend features.`

			`:::{toctree}`
			`:caption: Feature Guide`
			`:maxdepth: 1`
			`graph_mode`
			`quantization`
			`sleep_mode`
			`structured_output`
			`lora`
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com> 2025-09-17 10:36:43 +08:00			`eplb_swift_balancer`
[Doc] Update user doc index (#1581) Add user doc index to make the user guide more clear - vLLM version: v0.9.1 - vLLM main: https://github.com/vllm-project/vllm/commit/49e8c7ea256bd48a36391b5bc72212af39278b67 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> 2025-07-10 14:26:59 +08:00			`:::`