xc-llm-ascend/vllm_ascend at 51c8f60eb0b9670ba340b5b7488f25871aef2af8 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Jade Zheng 51c8f60eb0 [Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 )

### What this PR does / why we need it?

Previously, the dummy run executed compute_logits only once, regardless
of num_speculative_tokens. This caused execute_model to hang on
compute_logits when lm head tensor parallelism exceeded 1. The fix
ensures compute_logits executes correctly during dummy run, matching
num_speculative_tokens.

I set the `non_blocking` argument to False when moving
`exceeds_max_model_len` to the CPU. From what I understand, using
`non_blocking=True` and immediately accessing the tensor on the CPU can
cause accuracy problems. However, this issue doesn't happen when
transferring data to a device. ref:
https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/18

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>

2025-12-01 10:22:36 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[OPS] add bmm_transpose ops (#3990 )

2025-12-01 09:09:51 +08:00

upgrade to vllm 0.11.2 (#4400 )

2025-11-26 11:48:58 +08:00

Revert "drop ascend scheduler" (#4580 )

2025-11-29 22:20:48 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[Bugfix] Fix kvpool precision synchronization (#4574 )

2025-11-30 09:39:07 +08:00

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216 )

2025-11-30 22:52:05 +08:00

Drop 0.11.0 support (#4377 )

2025-11-24 17:08:20 +08:00

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00

Drop 0.11.0 support (#4377 )

2025-11-24 17:08:20 +08:00

remove qwen3-next model file (#4573 )

2025-11-29 18:37:26 +08:00

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216 )

2025-11-30 22:52:05 +08:00

[BugFix] Fix Qwen2.5_Omni vision customized op attr err (#4568 )

2025-12-01 09:18:55 +08:00

[EPLB][Ops] Integerate grouped_matmul_swiglu_quant_weight_nz_tensor_list operator into dynamic EPLB (#4216 )

2025-11-30 22:52:05 +08:00

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 )

2025-12-01 10:22:36 +08:00

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 )

2025-12-01 10:22:36 +08:00

[Bugfix] Resolve MTP > 1 issue when lm head tp > 1 (#4254 )

2025-12-01 10:22:36 +08:00

__init__.py

[Misc][Doc] Add service profiling feature with user guide (#3756 )

2025-11-12 09:07:14 +08:00

ascend_config.py

Revert "drop ascend scheduler" (#4580 )

2025-11-29 22:20:48 +08:00

ascend_forward_context.py

[Refactor] remove moe type of multicast. (#4224 )

2025-11-24 17:32:37 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

Revert "drop ascend scheduler" (#4580 )

2025-11-29 22:20:48 +08:00

profiling_config.py

Revert "drop ascend scheduler" (#4580 )

2025-11-29 22:20:48 +08:00

utils.py

Move mla to ops module (#4575 )

2025-11-29 18:36:55 +08:00