xc-llm-ascend

Author SHA1 Message Date

Author	SHA1	Message	Date
chenjunyi	c12eb22cbe	[feat] mlapo add bf16 no_quant support (#4852 ) ### What this PR does / why we need it? This PR adds mlapo operation support for bf16 no_quant mode. ### Does this PR introduce _any_ user-facing change? This PR makes quant related parameters optional. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>	2025-12-11 11:06:56 +08:00
h1074112368	74033999ed	mlapo add qdown output (#4707 ) ### What this PR does / why we need it? This PR adds mlapo operation support qdown of output. ### Does this PR introduce _any_ user-facing change? mlapo operation add enable_inner_out of input ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: h1074112368 <h1074112368@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-06 11:18:53 +08:00
Chen Chen	bcc313e8f2	add mla_preprocess kernel (#3226 ) ### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-12 07:39:45 +08:00

chenjunyi

c12eb22cbe

[feat] mlapo add bf16 no_quant support (#4852 )

### What this PR does / why we need it?
This PR adds mlapo operation support for bf16 no_quant mode.

### Does this PR introduce _any_ user-facing change?
This PR makes quant related parameters optional. 
### How was this patch tested?
CI passed with new added/existing test.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: chenjunyi <isjunyi.chen@gmail.com>

2025-12-11 11:06:56 +08:00

h1074112368

74033999ed

mlapo add qdown output (#4707 )

### What this PR does / why we need it?
This PR adds mlapo operation support qdown of output.
### Does this PR introduce _any_ user-facing change?
mlapo operation add enable_inner_out of input
### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: h1074112368 <h1074112368@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

2025-12-06 11:18:53 +08:00

Chen Chen

bcc313e8f2

add mla_preprocess kernel (#3226 )

### What this PR does / why we need it?

- Adds the `mla_preprocess` custom kernel to provide an optimized
pre-processing operator for Multi-head Latent Attention (MLA) on Ascend
NPUs.
- Wires the new kernel into the C++ extension pipeline so vLLM can
invoke it directly, cutting Python-side tensor shuffling and memory
copies that previously bottlenecked MLA compilation paths.

### Does this PR introduce any user-facing change?

- No. The change only introduces a low-level kernel; public APIs and
inference behavior remain unchanged.

### How was this patch tested?

- Dedicated Ascend kernels are not covered by our CI yet, so no extra
automated tests were added. Future MLA-focused regression runs will
cover this path.

- vLLM version: v0.11.0

Signed-off-by: Chen Chen <0109chenchen@gmail.com>

2025-10-12 07:39:45 +08:00

3 Commits