xc-llm-ascend

Author	SHA1	Message	Date
liziyu	464270e4ca	Remove useless PD check in deepseek (#3161 ) ### What this PR does / why we need it? Remove useless PD check in deepseek ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>	2025-09-24 23:25:47 +08:00
Mengqing Cao	2d885869c5	[KVCache][Bugfix] Fix kv cache initialization error of attention layer (#3113 ) ### What this PR does / why we need it? Fixes #3096 1. Fix kv cache initialization error of attention layer. There are some models with layer name like `attn.attn`, instead of `self_attn`, but the initialization of kv cache tensors only check for `self_attn` and `attn.attn`, which leding to the error `AssertionError: Some layers are not correctly initialized` 2. Set the default value of input arg `sampling_metadata` in `compute_logits` for the modeling files in vllm-ascend. Thus fixing the error `Qwen3NextForCausalLM.compute_logits() missing 1 required positional argument: 'sampling_metadata'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally with internlm - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-24 11:32:34 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
linfeng-yuan	d01fd1d1c3	[misc][torchair] fix bugs around `deepseek mtp`, `enable_shared_expert_dp` and `use_cached_kv_cache_bytes` (#3074 ) ### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-23 14:52:42 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
linfeng-yuan	8bcc0ccd57	[bugfix] fix shared expert dp with hybrid kvcache (#2964 ) ### What this PR does / why we need it? https://github.com/vllm-project/vllm-ascend/pull/2849 moves the implementation of `shared_expert_dp` to torchair deepseek_modeling. However, the calling of `set_forward_context` with `enforce_eager` and `shared_expert_dp` falls back to the implementation of model_runner_v1.py and set the global attn_metadata as a dictionary. It leads to a RuntimerError when attn_metadata is got from the forward context and used in torchair_deepseek_v2.py. This PR fixes this problem by introducing the transformation of attn_metadata in this file. Note that current E2E testing lacks the case of deepseek with `shared_expert_dp`. We need to add an ST with `shared_expert_dp` in testing workflow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? e2e vllm serving with `enable_shared_expert_dp: true` passed. - vLLM version: v0.10.2 - vLLM main: `de3e53a75b` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-17 20:01:47 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00
wyu0-0	eab3635850	[Bugfix] Retrieve num_redundant_experts from eplb_config in torchair qwen3_moe.py (#2857 ) ### What this PR does / why we need it? This PR addresses a configuration retrieval issue related to EPLB (Expert Parallel Load Balancing) settings in qwen3_moe.py. The key change is adjusting the source of num_redundant_experts to correctly fetch from the eplb_config sub-structure within parallel_config, rather than directly from parallel_config. This aligns with the updated configuration hierarchy for EPLB-related parameters. This change references `vllm_ascend/models/qwen3_moe.py` https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/qwen3_moe.py#L255-L257 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? run bash as follows and test pass ``` source /sfs_turbo/humpy/B080/cann_b080/ascend-toolkit/set_env.sh source /sfs_turbo/humpy/B080/cann_b080/nnal/atb/set_env.sh #export HCCL_BUFFSIZE=300 # export HCCL_SOCKET_IFNAME="eth0" # export TP_SOCKET_IFNAME="eth0" # export GLOO_SOCKET_IFNAME="eth0" # export HCCL_IF_IP=33.215.118.231 export VLLM_USE_V1=1 export VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1 export TASK_QUEUE_ENABLE=1 # export VLLM_VERSION=0.9.1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 rm -rf ./.torchair_cache/ rm -rf ./dynamo_* rm -rf /root/ascend/log/debug/plog/* python -m vllm.entrypoints.openai.api_server \ --model=/sfs_turbo/tzq/model/Qwen/Qwen3-235B-A22B/ \ --served-model-name auto \ --port 8006 \ -tp 1 \ -dp 16 \ --enable_expert_parallel \ --max-num-seqs 48 \ --max-model-len 32768 \ --gpu-memory-utilization 0.95 \ --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes_init":false,"graph_batch_sizes":[1, 8, 16, 24, 48]}, "ascend_scheduler_config":{"enabled":false}, "refresh":true}' \ --kv-transfer-config \ '{ "kv_connector": "SharedStorageConnector", "kv_buffer_device": "npu", "kv_role": "kv_consumer", "kv_parallel_size": 2, "kv_port": "20002", "engine_id": "decode-'${NODE_RANK}'", "kv_rank": 1, "kv_connector_extra_config": { "prefill": { "dp_size": 1, "tp_size": 16 }, "decode": { "dp_size": 16, "tp_size": 1 } } }' \ 2>&1 disown ``` - vLLM version: main - vLLM main: `0ae43dbf8c` Signed-off-by: wyu0-0 <woshilynn@163.com>	2025-09-11 22:15:19 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
Angazenn	e7409e95ee	[1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437 ) ### What this PR does / why we need it? 1. Similar to #2384 , this PR add a torchair-specific modeling for pangu. 2. Fixes a bug introduced by routed_scaling_factor in #2675 . 3. remove eager test case for pangu since there has already been a torchair test case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zengyanjia <z00883269@china.huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>	2025-09-04 10:39:21 +08:00
Wang Yixuan	20a7bc4b71	[3/N][refactor] refactoer quantization (#2504 ) ### What this PR does / why we need it? Move torchair related qunatization section into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair quantization. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `959783fb99` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-27 10:45:50 +08:00
Wang Yixuan	0f81e032f0	[1/N][refactor] torchair fused_moe refactor (#2438 ) ### What this PR does / why we need it? Move torchair related fused_moe section into torchair_fused_moe to make the code clear. Next step we'll remove all torchair related code outside of torchair_fused_moe . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.0 vLLM main: `08d5f7113a` - vLLM version: v0.10.1.1 - vLLM main: `170e8ea9ea` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-25 15:46:10 +08:00
Nicholas Tao	7bec1a9b9c	qwen3_moe/qwen25 support torchair graph (#2403 ) ### What this PR does / why we need it? Added support for the TorchAir graph mode in qwen3_moe and qwen2.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=True, max_model_len=4096, max_num_seqs=16, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.4, additional_config={ "torchair_graph_config": { "enabled": True, "use_cached_graph": False, "graph_batch_sizes_init": False, "graph_batch_sizes": [16] }, "ascend_scheduler_config": { "enabled": True, "chunked_prefill_enabled":True, }, "refresh": True, }, ) ``` - vLLM version: v0.10.0 - vLLM main: `b87cb97a53` Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>	2025-08-20 11:23:50 +08:00
linfeng-yuan	3fc31ee1cb	[1/N][refactor] torchair deepseek modeling refactor (#2384 ) ### What this PR does / why we need it? Move torchair related model arch into torchair moduel to make the code clear. Next step we'll remove all torchair related code outside of torchair moduel. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `08d5f7113a` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-18 15:00:37 +08:00

15 Commits