xc-llm-ascend

Author SHA1 Message Date

Author	SHA1	Message	Date
zhangyiming	1c954ff264	[main2main] upgrade vllm to 0308 (#7213 ) ### What this PR does / why we need it? Update main2main to vllm 0308. breaks: * https://github.com/vllm-project/vllm/pull/30681 * https://github.com/vllm-project/vllm/pull/35552 remove self.cudagraph_batch_sizes * https://github.com/vllm-project/vllm/pull/35158 clear_metadata -> defer_finalize * https://github.com/vllm-project/vllm/pull/36006 remove CacheConfig.cpu_offload_gb * https://github.com/vllm-project/vllm/pull/35472 * https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder * https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens * https://github.com/vllm-project/vllm/pull/28053 - vLLM version: v0.16.0 - vLLM main: `4034c3d32e` --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: menogrey <1299267905@qq.com> Co-authored-by: MrZ20 <2609716663@qq.com>	2026-03-18 09:24:43 +08:00
liuchen2026fly	640ecd1b77	[BugFix] Fix muls_add fusion not working for GLM5 models (#6928 ) ### What this PR does / why we need it? fix: support model-specific routed_scaling_factor in muls_add fusion Previously, MulsAddFusionPass used a hardcoded scale=1.0, which failed to match the x * routed_scaling_factor + y pattern in models like GLM5 that use routed_scaling_factor=2.5. This caused the muls_add fusion to be skipped, leaving unoptimized mul+add operations. This fix reads routed_scaling_factor from model config (defaulting to 1.0 for backward compatibility) and uses it as the pattern scale, enabling correct fusion for GLM5 and other models with custom scaling factors. Fixes: Unoptimized mul+add in GLM5 attention blocks Tested: GLM5-W8A8 with routed_scaling_factor=2.5 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: `15d76f74e2` Signed-off-by: liuchenbing <chenliumail@163.com> Co-authored-by: liuchenbing <chenliumail@163.com>	2026-03-05 22:35:54 +08:00
whx	16c879cdf7	[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 ) ### What this PR does / why we need it? Add muls_add triton kernel with related fusion pass. What's more, this PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2026-03-02 17:54:25 +08:00

zhangyiming

1c954ff264

[main2main] upgrade vllm to 0308 (#7213 )

### What this PR does / why we need it?
Update main2main to vllm 0308.
breaks:

* https://github.com/vllm-project/vllm/pull/30681
* https://github.com/vllm-project/vllm/pull/35552 remove
self.cudagraph_batch_sizes
* https://github.com/vllm-project/vllm/pull/35158 clear_metadata ->
defer_finalize
* https://github.com/vllm-project/vllm/pull/36006 remove
CacheConfig.cpu_offload_gb
* https://github.com/vllm-project/vllm/pull/35472
* https://github.com/vllm-project/vllm/pull/34552 attn_metadata_builder
* https://github.com/vllm-project/vllm/pull/30515 profile_seq_lens
* https://github.com/vllm-project/vllm/pull/28053 

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: MrZ20 <2609716663@qq.com>

2026-03-18 09:24:43 +08:00

liuchen2026fly

640ecd1b77

[BugFix] Fix muls_add fusion not working for GLM5 models (#6928 )

### What this PR does / why we need it?
fix: support model-specific routed_scaling_factor in muls_add fusion
Previously, MulsAddFusionPass used a hardcoded scale=1.0, which failed
to match the x * routed_scaling_factor + y pattern in models like GLM5
that use routed_scaling_factor=2.5. This caused the muls_add fusion to
be skipped, leaving unoptimized mul+add operations.

This fix reads routed_scaling_factor from model config (defaulting to
1.0
for backward compatibility) and uses it as the pattern scale, enabling
correct fusion for GLM5 and other models with custom scaling factors.

Fixes: Unoptimized mul+add in GLM5 attention blocks
Tested: GLM5-W8A8 with routed_scaling_factor=2.5
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: liuchenbing <chenliumail@163.com>
Co-authored-by: liuchenbing <chenliumail@163.com>

2026-03-05 22:35:54 +08:00

whx

16c879cdf7

[Triton][Config] Add muls_add triton kernel and refactor AscendCompilationConfig (#5518 )

### What this PR does / why we need it?
Add muls_add triton kernel with related fusion pass. What's more, this
PR refactors `AscendCompilationConfig` and delete `NpugraphExConfig`.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI passed with new added test.


- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>

2026-03-02 17:54:25 +08:00

3 Commits