xc-llm-ascend

Author	SHA1	Message	Date
wyu0-0	eab3635850	[Bugfix] Retrieve num_redundant_experts from eplb_config in torchair qwen3_moe.py (#2857 ) ### What this PR does / why we need it? This PR addresses a configuration retrieval issue related to EPLB (Expert Parallel Load Balancing) settings in qwen3_moe.py. The key change is adjusting the source of num_redundant_experts to correctly fetch from the eplb_config sub-structure within parallel_config, rather than directly from parallel_config. This aligns with the updated configuration hierarchy for EPLB-related parameters. This change references `vllm_ascend/models/qwen3_moe.py` https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/models/qwen3_moe.py#L255-L257 ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? run bash as follows and test pass ``` source /sfs_turbo/humpy/B080/cann_b080/ascend-toolkit/set_env.sh source /sfs_turbo/humpy/B080/cann_b080/nnal/atb/set_env.sh #export HCCL_BUFFSIZE=300 # export HCCL_SOCKET_IFNAME="eth0" # export TP_SOCKET_IFNAME="eth0" # export GLOO_SOCKET_IFNAME="eth0" # export HCCL_IF_IP=33.215.118.231 export VLLM_USE_V1=1 export VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ=1 export TASK_QUEUE_ENABLE=1 # export VLLM_VERSION=0.9.1 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 rm -rf ./.torchair_cache/ rm -rf ./dynamo_* rm -rf /root/ascend/log/debug/plog/* python -m vllm.entrypoints.openai.api_server \ --model=/sfs_turbo/tzq/model/Qwen/Qwen3-235B-A22B/ \ --served-model-name auto \ --port 8006 \ -tp 1 \ -dp 16 \ --enable_expert_parallel \ --max-num-seqs 48 \ --max-model-len 32768 \ --gpu-memory-utilization 0.95 \ --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes_init":false,"graph_batch_sizes":[1, 8, 16, 24, 48]}, "ascend_scheduler_config":{"enabled":false}, "refresh":true}' \ --kv-transfer-config \ '{ "kv_connector": "SharedStorageConnector", "kv_buffer_device": "npu", "kv_role": "kv_consumer", "kv_parallel_size": 2, "kv_port": "20002", "engine_id": "decode-'${NODE_RANK}'", "kv_rank": 1, "kv_connector_extra_config": { "prefill": { "dp_size": 1, "tp_size": 16 }, "decode": { "dp_size": 16, "tp_size": 1 } } }' \ 2>&1 disown ``` - vLLM version: main - vLLM main: `0ae43dbf8c` Signed-off-by: wyu0-0 <woshilynn@163.com>	2025-09-11 22:15:19 +08:00
lidenghui1110	5a7181569c	[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. (#2167 ) ### What this PR does / why we need it? This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing. performance data: <img width="1442" height="442" alt="image" src="https://github.com/user-attachments/assets/83270fc5-868a-4387-b0a9-fac29b4a376d" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. \| Name \| Effect \| Required \| Type \| Constraints \| \| :---------------------------- \| :--------------------------------------- \| :------- \| :--- \| :----------------- \| \| oproj_tensor_parallel_size \| Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces. \| No \| int \| default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value. \| example `--additional_config={"oproj_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `eddaafc1c7` --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zzh <zzh_201018@outlook.com>	2025-09-07 10:31:32 +08:00
Angazenn	e7409e95ee	[1/N][Draft][Refactor]torchair pangu_moe modeling refactor (#2437 ) ### What this PR does / why we need it? 1. Similar to #2384 , this PR add a torchair-specific modeling for pangu. 2. Fixes a bug introduced by routed_scaling_factor in #2675 . 3. remove eager test case for pangu since there has already been a torchair test case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `6997a25ac6` --------- Signed-off-by: zengyanjia <z00883269@china.huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>	2025-09-04 10:39:21 +08:00
Wang Yixuan	20a7bc4b71	[3/N][refactor] refactoer quantization (#2504 ) ### What this PR does / why we need it? Move torchair related qunatization section into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair quantization. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `959783fb99` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-27 10:45:50 +08:00
Wang Yixuan	0f81e032f0	[1/N][refactor] torchair fused_moe refactor (#2438 ) ### What this PR does / why we need it? Move torchair related fused_moe section into torchair_fused_moe to make the code clear. Next step we'll remove all torchair related code outside of torchair_fused_moe . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.0 vLLM main: `08d5f7113a` - vLLM version: v0.10.1.1 - vLLM main: `170e8ea9ea` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-25 15:46:10 +08:00
Nicholas Tao	7bec1a9b9c	qwen3_moe/qwen25 support torchair graph (#2403 ) ### What this PR does / why we need it? Added support for the TorchAir graph mode in qwen3_moe and qwen2.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=True, max_model_len=4096, max_num_seqs=16, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.4, additional_config={ "torchair_graph_config": { "enabled": True, "use_cached_graph": False, "graph_batch_sizes_init": False, "graph_batch_sizes": [16] }, "ascend_scheduler_config": { "enabled": True, "chunked_prefill_enabled":True, }, "refresh": True, }, ) ``` - vLLM version: v0.10.0 - vLLM main: `b87cb97a53` Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>	2025-08-20 11:23:50 +08:00
linfeng-yuan	3fc31ee1cb	[1/N][refactor] torchair deepseek modeling refactor (#2384 ) ### What this PR does / why we need it? Move torchair related model arch into torchair moduel to make the code clear. Next step we'll remove all torchair related code outside of torchair moduel. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `08d5f7113a` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-18 15:00:37 +08:00

7 Commits