xc-llm-ascend

Author	SHA1	Message	Date
starkwj	389030a8f8	add env vars & misc	2026-02-11 06:27:58 +00:00
starkwj	739d074b0c	update other platforms' Dockerfile	2026-01-23 03:24:25 +00:00
starkwj	2a571d8bc8	support multi npu partially	2026-01-09 04:36:39 +00:00
starkwj	fa0fb46853	fix reload return value	2026-01-07 07:42:30 +00:00
lumian	074ae28d6e	更新 README.md	2026-01-05 20:33:31 +08:00
starkwj	caf0289e1a	add Dockerfile and readme	2026-01-05 11:31:07 +00:00
starkwj	135cc0a505	vllm-ascend vnpu v1	2025-12-26 07:37:35 +00:00
zhangyiming	2f1aed98cc	[Doc] Update version policy to the latest. (#5071 ) ### What this PR does / why we need it? [Doc] Update version policy to the latest. Signed-off-by: menogrey <1299267905@qq.com>	2025-12-16 15:24:46 +08:00
zzzzwwjj	8c41770f1f	[bugfix] fix fp32 trans nz (#5068 ) ### What this PR does / why we need it? fix fp32 trans nz error, disable fp32 dtype trans nz. Signed-off-by: zzzzwwjj <1183291235@qq.com>	2025-12-16 15:04:31 +08:00
wangxiyuan	11e6d6c291	[doc] update developer guide (#5060 ) Update developer doc for v0.11.0-dev. This PR mainly picks developer doc from main to v0.11.0-dev. All related Feature work with 0.11.0 already. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-16 14:09:52 +08:00
zhangyiming	e07abfaa75	[Doc] Add new contributors. (#5066 ) ### What this PR does / why we need it? [Doc] Add new contributors. Signed-off-by: menogrey <1299267905@qq.com>	2025-12-16 12:47:40 +08:00
zhangxinyuehfad	ca0823f238	[0.11.0][Bugfix] fix fastapi version (#5052 ) ### What this PR does / why we need it? fix fastapi version Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-16 11:34:11 +08:00
Shanshan Shen	303c08aec9	[Doc] Update structured output doc with upstream link (#5058 ) ### What this PR does / why we need it? Cherry-pick from main https://github.com/vllm-project/vllm-ascend/pull/4015. Currently, the usage of structured output feature in vllm-ascend is totally the same as that in vllm. Thus, IMO, it's better to remove this doc directly to avoid some case that there are some changes in the upstream doc and we don't update our doc in time, which can be misleading to users. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-12-16 11:32:53 +08:00
Clorist33	2b5b309133	[Bugfix]Fix precision issues in moe_mlp (vllm-ascend v0.11.0-dev) (#5023 ) ### What this PR does / why we need it? Use group_list[0] to replace group_diff[0] in function "cumsum_group_list" (moe_mlp.py). The purpose is to modify it to the correct logic of converting cumsum to count. ### Does this PR introduce _any_ user-facing change? No Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com>	2025-12-16 08:40:03 +08:00
zhangxinyuehfad	87c0cfafa3	[0.11.0][Bugfix] fix fastapi version (#5048 ) ### What this PR does / why we need it? fix fastapi version <0.124.0 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-12-15 23:51:38 +08:00
wangxiyuan	01a13a9b77	fix nz for quantization (#4943 ) quantization ops rely on NZ by force, we should remove the nz check for it. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-12 14:54:41 +08:00
sunchendd	5932abc446	[Bugfix] Fix the Eagle3 inference failure issue. (#4721 ) ### What this PR does / why we need it? Fix the Eagle3 inference failure issue. error message: "EngineCore encountered an issue. See stack trace (above) for the root cause." Fixes https://github.com/vllm-project/vllm-ascend/issues/4323 ### How was this patch tested? `vllm serve /nfs/1_AscendPackage/05_weights_public/Qwen3-32B \ --served-model-name Qwen3-32B \ -tp 4 \ --host "0.0.0.0" \ --port "8000" \ --trust-remote-code \ --speculative-config '{"method":"eagle3","model":"/home/scd/qwen3_32b_eagle3/","num_speculative_tokens":4,"draft_tensor_parallel_size":1}' \ --max-num-batched-tokens 4096 \ --max-model-len 4096` ``` curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-32B", "prompt": "hi, where is the capital of France?", "max_tokens": 10, "temperature": 0 }' \| python3 -m json.tool ``` vLLM version: v0.11.0 vLLM-ascend version: v0.11.0rc2 Signed-off-by: 17764591921 <sunchend@outlook.com>	2025-12-12 14:52:29 +08:00
Clorist33	4f0dddc9ee	[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev (#4885 ) ### What this PR does / why we need it? This PR fixes a bug in the moe_mlp module by correcting the arguments passed to the torch_npu.npu_dequant_swiglu_quant function.It properly converts group_list from a cumulative sum to counts for the group_index parameter. ### Does this PR introduce _any_ user-facing change? No - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/main --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: tanqingshan (A) <50050625@china.huawei.com> Co-authored-by: Mercykid-bash <ruanche0218@gmail.com>	2025-12-12 14:51:47 +08:00
Slightwind	9c0ad46c1a	[0.11.0][Bugfix] Remove the ZMQ communication setup on the D node (#4916 ) In the PD separation scenario, the D node does not need to perform get operations, and therefore does not need to create ZeroMQ (ZMQ) communication. --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-12-12 14:37:49 +08:00
1092626063	ceadc2788d	Revert "[refactor]support gatingtopk operator generalization (#4356 )" (#4873 ) This reverts commit `c4a11a745a`. ops npu_gating_top_k caused Qwen3-30B precision problem, so revert it. Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-10 15:45:20 +08:00
linfeng-yuan	9a144bc7be	[Docs][0.11.0] delete AIV env variables in DSV32 documentation (#4833 ) ### What this PR does / why we need it? Delete wrong configuration in deepseek v3.2 documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:53:53 +08:00
Mercykid-bash	8f45f9ce29	BugFix: Resolve shape mismatch in eplb update and calculation issues in quant_apply_mlp (#4777 ) ## Description This PR addresses two key issues in the MoE module when redundant experts are enabled, and fixes a calculation precision bug in the forward inference of quantized MLP: ### 1. Shape Mismatch in EPLB Expert Map Update - Root Cause: When redundant experts are turned on, a shape inconsistency occurs during the expert map update in `Vllm_apaptor`: - The shape of `self.expert_map_per_layer[layer_id]` is `[num_physical_experts,]` (aligned with physical expert count). - The shape of `updated_expert_map` is `[num_logical_experts,]` (aligned with logical expert count). - Indices in `self.expert_map_per_layer[layer_id]` that exceed the logical expert count cannot be properly mapped, leading to tensor shape mismatch errors. - The same shape mismatch exists in the `log2phy` map update (between `self.log2phy_map_per_layer[layer_id]` and `updated_log2phy_map`). - Fix: - Fix the shape initialization of `expert_map_per_layer` and `log2phy_map_per_layer` to be consistently set to `[num_physical_experts,]` across the module lifecycle. - Align the shape of `updated_expert_map` and `updated_log2phy_map` with the pre-initialized physical-expert-sized tensors during update operations, ensuring shape consistency for index mapping. ### 2. Calculation Precision Issue in Quantized MoE MLP Forward Inference - Root Cause: In the forward pass of `moe_mlp`, the `torch_npu.npu_dequant_swiglu_quant` operator only accepts group lists in Count format as input. However, the group list provided by `quant_apply_mlp` was in Cumsum format, which caused operator input format mismatch and degraded calculation precision. - Fix: - Convert the cumsum-formatted group list from `quant_apply_mlp` to Count format before passing it to `torch_npu.npu_dequant_swiglu_quant`. - Ensure the input format of the dequantization operator meets its requirements, restoring the expected calculation precision for quantized MoE MLP layers. ## Impact - Resolves shape mismatch errors in EPLB expert/log2phy map updates when redundant experts are enabled, ensuring stable expert routing. - Fixes quantized MoE MLP forward precision issues on NPU, aligning operator input formats with NPU kernel requirements. - No breaking changes to existing interfaces; the fixes are backward-compatible for scenarios without redundant experts enabled. --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: Mercykid-bash <ruanche0218@gmail.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-09 15:46:58 +08:00
linfeng-yuan	695e5c9ebc	[0.11.0][ops] npu_top_k_top_p supports k and p only (#4153 ) ### What this PR does / why we need it? With CANN 8.3 and corresponding PTA 2.7.1, `npu_top_k_top_p` supports passing only k (1<=k<=1024) and p separately. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E performance test with only `top_k` and `p` seperately. This pr gains 0.2ms improvements in TPOT with `batch_size=16`. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:45:40 +08:00
Li Wang	4588d1f215	[CI] Use arm node for unit tests (#4819 ) ### What this PR does / why we need it? Use arm node for unit tests Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-09 15:45:14 +08:00
linfeng-yuan	e0757dc376	[0.11.0]fix the configuration conflicts in documentation (#4824 ) ### What this PR does / why we need it? Fix configuration errors in our documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-12-09 15:37:06 +08:00
zhangxinyuehfad	033e3557cc	[cherry-pick]fix qwen3vl mrope op (#4484 ) (#4811 ) ### What this PR does / why we need it? Qwen2.5-VL mrope precision problem would been solved once this pr is merged ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on G8600 with textVQA dataset - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: shaopeng-666 <lishaopeng21@huawei.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-09 11:07:32 +08:00
Levi	9862a23985	【0.11.0-dev】optimization of kimi-k2 in cann8.3 (#4555 ) ### What this PR does / why we need it? In cann8.3， npu_moe_gating_top_k operator can support expert nums with 384, so kimi can use the operator to get better preformance. --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-09 08:49:15 +08:00
zhangxinyuehfad	0d094531b4	[bugfix] Fixed the bug in retrieving the quantization method for mlp.… (#4797 ) When retrieving the quantization method for MOE (e.g., the quantization file of DeepSeek v3.2 exp do not match the model's naming convention in eager mode), a KeyError is raised: "model.layers.3.mlp.experts.weight not in self.quant_description". However the quantization file is like : ```bash "model.layers.3.mlp.experts.255.gate_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.gate_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.down_proj.weight_offset": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_scale": "W8A8_DYNAMIC", "model.layers.3.mlp.experts.255.up_proj.weight_offset": "W8A8_DYNAMIC", ``` Co-Authored-By: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>	2025-12-09 08:47:19 +08:00
Levi	4e728f1f40	[Bugfix] fix qwen3-vl-moe shape ERROR during the _prepare_inputs phase under high concurrency. (#4658 ) ### What this PR does / why we need it? Earlier we fixed a similar issue for qwen2.5-vl 【 https://github.com/vllm-project/vllm-ascend/issues/4430 】, and then the multimodal models in vllm v0.11.0 should all have this problem. Here, we have specifically proposed a fix for qwen3-vl-moe. --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-08 19:30:16 +08:00
Wang Yixuan	d412565ec9	[Cherry-pick]bmm_transpose to v011dev (#3995 ) ### What this PR does / why we need it? Add a custom op to acclerater the deepseek model. The fusion ops combine the bmm and transpose together, which is applied to mla module. Cherry-pick from this commtid c68ddc11ce53334fc9a17bad58342148cbf14e86 ### Does this PR introduce _any_ user-facing change? No --------- Signed-off-by: hust17yixuan <303660421@qq.com>	2025-12-08 19:22:14 +08:00
Angazenn	6391f0625f	[v0.11.0-dev][bugfix] Add branch for stream up-lifting in `update_attn_params` (#4437 ) ### What this PR does / why we need it? #3985 move stream context initialization before for-loops to improve performance. However, we find that this might cause potential accuracy drop when used with pd disaggregation. Thus we partly revert this change when using pd disaggregation, and we shall fix this bug in th future. ### Does this PR introduce _any_ user-facing change? No. --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-12-08 08:54:46 +08:00
Li Wang	2598124e67	[Image] Correcting the vllm tag of the openeuler image on the A2 device. (#4745 ) ### What this PR does / why we need it? Corrected the vllm tag, which should have been in v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-06 10:55:22 +08:00
offline893	350999c4ef	[Bugfix]Fix eplb enable when using mtp float weights. (#4576 ) ### What this PR does / why we need it? Fix eplb enable when using mtp float weights. It will be remove when eplb supporting mtp and float weights. ### How was this patch tested? Deepseek-V3 + MTP + EPLB in A3. --------- Signed-off-by: offline0806 <3337230449@qq.com> Signed-off-by: offline893 <158537145+offline893@users.noreply.github.com> Co-authored-by: offline0806 <3337230449@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-05 21:15:32 +08:00
1092626063	c4a11a745a	[refactor]support gatingtopk operator generalization (#4356 ) ### What this PR does / why we need it? This pr is cherry-pick from : https://github.com/vllm-project/vllm-ascend/pull/2958 and https://github.com/vllm-project/vllm-ascend/pull/4340 Past： npu_moe_gating_top_k can only support 'group_count=256' pattern Now： 1、npu_moe_gating_top_k support all size of group_count 2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are included in `torch_npu.npu_moe_gating_top_k` CANN: depends on 8.3.RC1 Performance： 1. GLM4.5-w8a8, TPS improve 6% 2. Qwen3, the same as before --------- Signed-off-by: 1092626063 <1092626063@qq.com>	2025-12-04 20:10:13 +08:00
LI SHENGYONG	593a96056c	【EPLB】Eplb Redundant Experts Bugfix (#4232 ) ### What this PR does / why we need it? Redundant experts bugfix The calculation logic for redundant experts has been fixed, allowing the correct number of redundant experts to be calculated using the map. Therefore, there is no longer a need to set the redundant expert parameter when passing the map. ### Does this PR introduce _any_ user-facing change? After configuring the path for experts_map, users do not need to configure iinit_redundancy_expert. ### How was this patch tested? The accuracy of EPLB was tested with and without the use of redundant experts. --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-03 12:00:05 +08:00
Mengqing Cao	b6d63bbd52	[v0.11.0-dev][CI] Fix ngram lacking of input arg `dummy_compute_logits` error (#4648 ) ### What this PR does / why we need it? Fix ngram lacking of input arg `dummy_compute_logits` error ### How was this patch tested? CI passed with existing test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-12-03 09:22:07 +08:00
Levi	865f1f7fc8	[Bugfix] Resolve the interface compatibility issue of get_input_embeddings in MM (#4638 ) ### What this PR does / why we need it? Resolve the interface compatibility issue of get_input_embeddings in MM， because the get_input_embeddings func of other model does not have the is_multimodal parameter --------- Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-02 22:21:47 +08:00
Levi	3b4cb23616	[Bugfix] fix qwen2.5-vl-72b shape ERROR during the _prepare_inputs phase under high concurrency. (#4553 ) ### What this PR does / why we need it? qwen2.5-vl-72b reports a shape ERROR during the _prepare_inputs phase under high concurrency【 issue https://github.com/vllm-project/vllm-ascend/issues/4430 】 This PR fix it. The related PR in main branch :https://github.com/vllm-project/vllm-ascend/pull/3612 The related commit in vllm : `17c540a993/vllm/model_executor/models/interfaces.py` 【The _get_text_embeddings function has been refactored to interfaces.pyin vllm.】 Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com>	2025-12-02 14:20:45 +08:00
Zetong Li	52abd47f8c	[Bugfix][SHM] Use writer lock by default and remove redundant env (#4117 ) ### What this PR does / why we need it? This PR aims to remove env introduced by #3988 and use lock by default. As described in https://github.com/vllm-project/vllm/issues/27858, we have tested the writer lock method in various scenarios and the performance is almost unaffected. Therefore, we believe that it would be safe to enable the lock by default and remove the redundant env `SHM_BARRIER` now. After discussion, we decide to preserve env and set it as true by default. ### Does this PR introduce _any_ user-facing change? `SHM_BARRIER` is set as true by default. ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>	2025-12-01 22:27:01 +08:00
Li Wang	76d0ba4342	[Image][Build] Cherry pick #4062 from main (#4506 ) ### What this PR does / why we need it? This patch aims to integrate the mooncake [v0.3.7.2.post2](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.7.post2) to vllm-ascend images Signed-off-by: wangli <wangli858794774@gmail.com>	2025-12-01 11:39:40 +08:00
zouyida2052	2b4f7a5016	[cherry-pick pr-4254] bugfix for mtp>1 when lm_head_tp>1 (#4360 ) ### What this PR does / why we need it? Previously, the dummy run executed compute_logits only once, regardless of num_speculative_tokens. This caused execute_model to hang on compute_logits when lm head tensor parallelism exceeded 1. The fix ensures compute_logits executes correctly during dummy run, matching num_speculative_tokens. Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-12-01 11:11:15 +08:00
LI SHENGYONG	cd9f5c0611	[bugfix] dep ineffective (#4416 ) ### What this PR does / why we need it? The expert mapping table and weights of the dynamic EPLB were not updated, causing the accuracy to be correct but not effective. This bug has now been fixed. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-29 15:19:11 +08:00
henryxuxu0716	71acc8ddeb	For nz unset in bf16&fp16 (#4495 ) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? disable NZ for float weight case. This is only a quick fix for dev branch. For main branch, we'll consider more case to make it more common. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? qwen2.5 32B <img width="441" height="221" alt="image" src="https://github.com/user-attachments/assets/7ae18ffd-1ce2-43d9-9960-be45250ad0da" /> --------- Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>	2025-11-28 17:32:25 +08:00
Zhu Yi Lin	96c362361e	[0.11.0][TEST] Delete Comment (#4428 ) ### What this PR does / why we need it? delete chinese comment pick from https://github.com/vllm-project/vllm-ascend/pull/4427 ### Does this PR introduce _any_ user-facing change? no Signed-off-by: GDzhu01 <809721801@qq.com>	2025-11-25 21:39:36 +08:00
zhangxinyuehfad	a686f2962a	[0.11.0][Bugfix] fix e2e full test (#4424 ) ### What this PR does / why we need it? pin Transformer version to 4.57.1 fix 'dict' object has no attribute 'model_type' https://github.com/vllm-project/vllm-ascend/actions/runs/19660859460/job/56306822464 picked from https://github.com/vllm-project/vllm-ascend/pull/4423 Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-11-25 21:21:42 +08:00
Shanshan Shen	cdaf7f4a51	[MM][Bugfix] Minor fix for VL model verification (#4385 ) ### What this PR does / why we need it? To fix ops test, where `model_config` has been set to `None` and doesn't has `hf_config` attribute, we have added a check for `model_config` to guarantee it is not `None_Type`. cherry-pick from main: https://github.com/vllm-project/vllm-ascend/pull/4384. Signed-off-by: shen-shanshan <467638484@qq.com>	2025-11-25 20:36:32 +08:00
wujinyuan1	386a85eccc	[Bugfix]Fix the hang issue of multimodal model when running with DP>1 (#4393 ) ### What this PR does / why we need it? When cudagraph_mode is set to FULL_DECODE_ONLY, if dp > 1, the dummy-run process will be triggered. When calling the update_attn_params function, the num_tokens parameter needs to be passed, and this value is obtained through positions.shape[0]. However, the multimodal model uses mRope (multi-dimensional rotary positional embeddings), which causes the shape of positions to be 2. As a result, the value obtained from positions.shape[0] is incorrect. We solve this problem by replacing positions.shape[0] with num_tokens. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wujinyuan1 <wjy9595@qq.com> Co-authored-by: wujinyuan1 <wjy9595@qq.com>	2025-11-25 09:32:22 +08:00
weichen	a3164ac372	[v0.11.0][Bugfix][MoE] enable force_load_balance in aclgraph (#4367 ) ### What this PR does / why we need it? Enable force_load_balance in aclgraph, solving OOM issues. pick from https://github.com/vllm-project/vllm-ascend/pull/4366 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? e2e & ut Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>	2025-11-25 09:16:57 +08:00
mazhixin000	75452abe1e	[Doc][v11.0-dev][cherry-pick]Add single node PD disaggregation instructions (#4370 ) ### What this PR does / why we need it? add single node PD disaggregation instructions for Qwen 2.5VL model. ### Does this PR introduce _any_ user-facing change? no --------- Signed-off-by: mazhixin <mazhixin7@huawei.com> Signed-off-by: mazhixin000 <mazhixinkorea@163.com> Co-authored-by: mazhixin <mazhixin7@huawei.com>	2025-11-24 17:23:11 +08:00
wangxiyuan	a2e4c3fe78	Revert "[cherry-pick][refactor]support gatingtopk operator generalization (#4050 )" (#4352 ) This reverts commit `c87a77e8b4`. it breaks ops e2e test Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-21 23:03:20 +08:00

1 2 3 4 5 ...

1275 Commits