xc-llm-ascend/vllm_ascend at ff3914e31a4a44b89bde0ebb5ad6bd1fb2ab4df8 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

Yizhou ff3914e31a [Fix] Refines decode mode padding condition for uniform queries (#5164 )

### What this PR does / why we need it?
The reason why we cannot use `self.cudagraph_batch_sizes[-1]` is that
it's actually not the max number of tokens to be padded in
`FULL_DECODE_ONLY` mode, much larger instead. And it's trimmed only
before capturing to `compilation_cases`, this really caused us lots of
trouble.

Updates the logic to ensure padding occurs only when the number of input
tokens falls within a valid uniform decode query range, improving
consistency and avoiding unnecessary padding in specific decode modes.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
None.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

2025-12-18 21:09:23 +08:00

..

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

fix: use batch_matmul_transpose operator in MLA _v_up_proj for better performance (#5142 )

2025-12-18 16:48:55 +08:00

[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 )

2025-12-18 20:25:44 +08:00

[bugfix][refactor] fix recompute_scheduler break with vllm 0.12.0 & support async scheduling & refactor recompute_scheduler.py (#4895 )

2025-12-11 22:24:49 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )

2025-07-28 16:01:59 +08:00

[Feat] Support MLP_TP feature, exclude MOE layer (#4999 )

2025-12-18 20:06:53 +08:00

[Misc] Upgrade vllm hash to 12_14 (#5000 )

2025-12-15 19:54:23 +08:00

upgrade vLLM to main (#4608 )

2025-12-02 22:10:52 +08:00

[refact] unified soc_version code (#4359 )

2025-11-26 14:28:55 +08:00

[CI] speed up ut (#4901 )

2025-12-11 18:45:43 +08:00

[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 )

2025-12-18 20:25:44 +08:00

qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788 )

2025-12-18 11:31:04 +08:00

[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077 )

2025-12-18 20:25:44 +08:00

feat: implement high-performance Triton kernels for rejection sampling (#4830 )

2025-12-18 19:42:10 +08:00

[Fix] Fix DeepSeek V3.2 "no attr" error (#5147 )

2025-12-18 14:46:41 +08:00

[Fix] Refines decode mode padding condition for uniform queries (#5164 )

2025-12-18 21:09:23 +08:00

implement model runner v2 basic framework (#5051 )

2025-12-18 15:51:54 +08:00

__init__.py

clean up model module (#4611 )

2025-12-02 17:35:47 +08:00

ascend_config.py

enable npugraph_ex (#5120 )

2025-12-18 09:08:40 +08:00

ascend_forward_context.py

[Bugfix][MoE] Remove All2All in w4a8_dynamic (#4977 )

2025-12-17 17:39:57 +08:00

cpu_binding.py

[main] support cpu binding (#3546 )

2025-10-21 09:17:03 +08:00

envs.py

[Feat] Add custom Embedding tensor model parallel (#2616 )

2025-12-12 14:41:20 +08:00

flash_common3_context.py

[Perf]enable prefill flashcommon3 (#4065 )

2025-12-14 09:34:13 +08:00

meta_registration.py

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

platform.py

implement model runner v2 basic framework (#5051 )

2025-12-18 15:51:54 +08:00

profiling_config.py

Drop ascend scheduler (#4623 )

2025-12-05 09:03:45 +08:00

utils.py

[main] rename device type (#5099 )

2025-12-17 14:08:19 +08:00