xc-llm-ascend

Author	SHA1	Message	Date
linfeng-yuan	90a75a90a9	[bugfix] fix torchair runtime error caused by configuration mismtaches and file missing (#2532 ) ### What this PR does / why we need it? This PR ports #2312 #2506 #2531 to main branch. Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. Additionally, this pr fixes a recompilation problem of torchair graph mode caused by `running_in_graph` variable in `AscendMLATorchairImpl`. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). Additionally, we've made a change to how the TORCHAIR_CACHE_HOME enviroment variable is utilized to enhance safety and prevent accidental file deletion by adding a suffix directory. ### How was this patch tested? CI and e2e vllm serving pass. - vLLM version: v0.10.1.1 - vLLM main: `70549c1245` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-03 17:56:12 +08:00
Wang Yixuan	253b01b9a5	[7/N][refactor]fix torchair rope ops (#2683 ) ### What this PR does / why we need it? Due to the registration mechanism, torchair ops can not take effect， so have to patch the Ascend ops to adapt torchair ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main vLLM main: `7ea22e42d5` - vLLM version: main - vLLM main: `7ea22e42d5` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-09-02 17:21:56 +08:00
xuyexiong	214b32a346	[V1][BUGFIX][0.10.1] FIX mtp on main branch (#2632 ) ### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: https://github.com/vllm-project/vllm-ascend/pull/2627 torchair multi DP fix: https://github.com/vllm-project/vllm-ascend/pull/2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `7c8271cd1e` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-02 11:12:41 +08:00
panchao-hub	ea53f9076e	support torchair mode (#2641 ) ### What this PR does / why we need it? support torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `5438967fbc` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-09-01 15:49:07 +08:00
Wang Yixuan	c2c97f3079	[5/N][refactor]add torchair rotary ops (#2559 ) ### What this PR does / why we need it? Move torchair related rotary ops into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair rotary ops. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `81eea3d348` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-09-01 09:09:21 +08:00
weichen	3a5fc5ee01	[Refactor][MoE] remove redundant code after refactoring fused_moe (#2612 ) ### What this PR does / why we need it? There are a lot of redundant codes related to moe here, and the structure is not very clear. We did the following things： we have placed the relatively independent code related to apply_mlp into a separate file; removed the environment variables of alltoall_buffer and alltoall_seq. Remove the code related to alltoall_buffer and alltoall_seq, and retain the sole TokenDispatcher inheritance class. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e&ut - vLLM version: v0.10.1.1 - vLLM main: `4071c76cf3` --------- Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-08-30 22:28:50 +08:00
panchao-hub	20ae71291d	[torchair]remove aicpu op (#2640 ) ### What this PR does / why we need it? remove aicpu op for torchair mode ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.1.1 vLLM main: `05d839c19e` - vLLM version: v0.10.1.1 - vLLM main: `67c14906aa` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-08-30 15:51:12 +08:00
panchao-hub	7215454de6	bugfix for torchair graph (#2639 ) ### What this PR does / why we need it? bugfix for torchair graph ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `67c14906aa` Signed-off-by: zhangdepeng <zhangdepeng2@huawei.com> Co-authored-by: zhangdepeng <zhangdepeng2@huawei.com>	2025-08-30 15:49:48 +08:00
yiz-liu	aadc75c247	[Fix] Resolve data-parallel (DP) assertion errors in TorchAir (#2626 ) ### What this PR does / why we need it? It is confirmed that `num_input_tokens` must be assigned the value of `maybe_padded_num_tokens` under all circumstances. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Waiting for daily test for TorchAir. - vLLM version: v0.10.1.1 - vLLM main: `006477e60b` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-29 16:06:49 +08:00
yiz-liu	dfc7eb39ad	[Fix] Fix DP-related padding logic (#2582 ) ### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (approximately 10% performance gain for TPOT!). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: `c5d004aaaf` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-08-28 19:39:58 +08:00
Wang Yixuan	936c102105	[bugfix][refactor]fix torchair_w8a8 (#2569 ) ### What this PR does / why we need it? torchair w8a8 and w4a8 Separate from fused_moe due to the refactor and change for fused_moe ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `69244e67e6` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-28 09:10:31 +08:00
Wang Yixuan	20a7bc4b71	[3/N][refactor] refactoer quantization (#2504 ) ### What this PR does / why we need it? Move torchair related qunatization section into torchair dir to make the code clear. Next step we'll remove all torchair related code outside of torchair quantization. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: main vLLM main: `ab9f2cfd19` - vLLM version: v0.10.1.1 - vLLM main: `959783fb99` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-27 10:45:50 +08:00
weiguihua2	acdc53c2f6	[Bugfix] Fix the bug of cos invalid shape when dp (#2558 ) ### What this PR does / why we need it? Fix the bug of cos invalid shape when dp ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: `1fdc732419` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-27 10:36:23 +08:00
wangxiyuan	de7649492d	[Refactor] cleanup converting_weight_acl_format_format (#2482 ) move maybe_converting_weight_acl_format_format to torchair module, it's only used with 310p+torchair - vLLM version: v0.10.1.1 - vLLM main: `49ab23b3cc` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-25 19:48:55 +08:00
Wang Yixuan	0f81e032f0	[1/N][refactor] torchair fused_moe refactor (#2438 ) ### What this PR does / why we need it? Move torchair related fused_moe section into torchair_fused_moe to make the code clear. Next step we'll remove all torchair related code outside of torchair_fused_moe . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.10.0 vLLM main: `08d5f7113a` - vLLM version: v0.10.1.1 - vLLM main: `170e8ea9ea` Signed-off-by: hust17yixuan <303660421@qq.com>	2025-08-25 15:46:10 +08:00
ZhaoJiangJiang	3629bc4431	feat: add mtp ut and fix some bugs (#2453 ) ### What this PR does / why we need it? Fix mtp mode ut ### Does this PR introduce _any_ user-facing change? Nothing ### How was this patch tested? This can be tested in the same way as a unit test. - vLLM version: v0.10.0 - vLLM main: `53415653ff` Signed-off-by: 赵江江 <zhaojiangjiang1@h-partners.com> Co-authored-by: 赵江江 <zhaojiangjiang1@h-partners.com>	2025-08-22 17:09:08 +08:00
linfeng-yuan	0ca3f48c90	[2/N][refactor] torchair deepseek mla backend refactor (#2459 ) ### What this PR does / why we need it? This PR move current unified mla backend to torchair folder and remove torchair-related code in attention/mla_v1.py (1.3k -> 0.9k). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running eager mode with mla backend, and torchair mode with code before [2445](https://github.com/vllm-project/vllm-ascend/pull/2445) - vLLM version: v0.10.0 - vLLM main: `f571ff8eb6` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-21 14:02:30 +08:00
weiguihua2	0dca4c6dbd	refact runner model v1 (#2461 ) refact model runner v1 ### What this PR does / why we need it? 1. Separate the execute model logic from the prepare input logic 2. Disassemble the torchchair in model runner v1 - vLLM version: v0.10.0 - vLLM main: `68fcd3fa73` --------- Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-21 08:54:57 +08:00
Nicholas Tao	7bec1a9b9c	qwen3_moe/qwen25 support torchair graph (#2403 ) ### What this PR does / why we need it? Added support for the TorchAir graph mode in qwen3_moe and qwen2.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```bash llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=True, max_model_len=4096, max_num_seqs=16, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.4, additional_config={ "torchair_graph_config": { "enabled": True, "use_cached_graph": False, "graph_batch_sizes_init": False, "graph_batch_sizes": [16] }, "ascend_scheduler_config": { "enabled": True, "chunked_prefill_enabled":True, }, "refresh": True, }, ) ``` - vLLM version: v0.10.0 - vLLM main: `b87cb97a53` Signed-off-by: taoyuxiang <oui.nicholas.tao@gmail.com>	2025-08-20 11:23:50 +08:00
Mengqing Cao	1327f9be1c	Fix some ci issue and refactor modelrunner (#2445 ) ### What this PR does / why we need it? Fix some ci issue and refactor modelrunner ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.0 - vLLM main: `4d9c61993a` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weiguihua2 <weiguihua2@huawei.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: weiguihua2 <weiguihua2@huawei.com>	2025-08-20 09:01:04 +08:00
Shanshan Shen	83e0f41408	[3/N][Refactor] Move `torchair_attention` to `torchair` dir (#2017 ) ### What this PR does / why we need it? 1. Move `torchair_attention` to `torchair` dir. 2. Make `AscendAttentionTorchairBackend` extend `AscendAttentionBackend` to reduce duplicate methods. 3. Make `AscendTorchairMetadata` extend `AscendMetadata` to reduce duplicate properties. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `0933f9d518` --------- Signed-off-by: shen-shanshan <467638484@qq.com>	2025-08-19 10:25:22 +08:00
linfeng-yuan	3fc31ee1cb	[1/N][refactor] torchair deepseek modeling refactor (#2384 ) ### What this PR does / why we need it? Move torchair related model arch into torchair moduel to make the code clear. Next step we'll remove all torchair related code outside of torchair moduel. ### Does this PR introduce _any_ user-facing change? No. - vLLM version: v0.10.0 - vLLM main: `08d5f7113a` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-08-18 15:00:37 +08:00
wangxiyuan	1a70564e7c	[5/N][Refactor] torchair model runner refactor (#2216 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: create common function `_capture_model` for capture_model - vLLM version: v0.10.0 - vLLM main: `1891a265d3` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-12 14:24:50 +08:00
wangxiyuan	c8b0f5f799	[4/N][Refactor] torchair model runner refactor (#2208 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_convert_torch_foramt` for initialize_kv_cache - vLLM version: v0.10.0 - vLLM main: `14a5d903ab` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 21:39:24 +08:00
wangxiyuan	881e36d6a9	[3/N][Refactor] torchair model runner refactor (#2207 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What's this PR do: create common function `_build_attention_metadata` and `_generate_dummy_run_hidden_states` for dummy_run - vLLM version: v0.10.0 - vLLM main: `ebf7605b0d` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 18:03:19 +08:00
wangxiyuan	1ab15414bb	[2/N][Refactor] torchair model runner refactor (#2204 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203 What's this PR do: move `torchair` related logic into `_get_forward_metadata_across_dp` and override it in torchair model runner - vLLM version: v0.10.0 - vLLM main: `1b99028069` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-11 14:06:49 +08:00
wangxiyuan	292fb8f696	[1/N][Refactor] torchair model runner refactor (#2205 ) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow #2203, this is the first PR. What this PR does: create the new torchair model runner, more function will be added later - vLLM version: v0.10.0 - vLLM main: `586f286789` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-08-05 18:43:04 +08:00
zzzzwwjj	ba3dfbd59e	[main][refactor] Refactoring forward_context and model_runner_v1 (#1979 ) ### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: `57c22e57f9` Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-07-28 14:06:20 +08:00
wangxiyuan	7265dc090d	[2/4][Refactor] Refactor torchair utils (#1892 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker 2. Refactor utils (this PR) - simple change that move all torchair related util function to torchair module 3. Refactor model_runner 4. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 19:43:30 +08:00
wangxiyuan	af56ae3ed1	[1/4][Refactor] Refactor torchair worker (#1885 ) There is a lot torchair specified logic in common code. It results hard code maintenance. We will create a new torchair module to launch torchair related logic there. I plan to add 4 PR. 1. Refactor worker (this PR) - create torchair module and move torchair related code in worker to the new module 3. Refactor utils 4. Refactor model_runner 5. Refactor attention - vLLM version: v0.9.2 - vLLM main: `8188196a1c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-07-21 11:50:46 +08:00

30 Commits