xc-llm-ascend

Files

Qi Mao 7372225bcb [FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 )

Description:

This PR updates the implementation of the Triton operator for deployment
on NPU devices, focusing on optimizing grid size and memory handling
based on NPU limitations.

Design Plan:

Grid Calculation: The grid size is now dynamically calculated by batch
and dim to ensure that the number of programs executed does not exceed
the NPU's vector core capacity. This ensures optimal parallelism without
overloading the hardware.

Data Block Handling: Due to the limited on-chip memory (UB) on Ascend
NPUs, this implementation splits large data into smaller chunks of 32k
or less per block. The kernel performs a for-loop to process the data in
these smaller chunks, minimizing memory usage and avoiding potential
overflows.

Changes Compared to GPU Implementation:

Grid and Block Sizing:

For GPU, the grid and block size were determined based on available
thread counts and memory size. In contrast, the NPU version dynamically
adjusts these parameters using B_TILE and BLOCK_N to optimize for NPU’s
architecture.

Memory Chunking:

The original GPU implementation did not require chunking due to the
higher available memory and processing capacity. For the NPU, data is
divided into smaller chunks (32k or smaller) to comply with memory
constraints on the device. The kernel has been modified to handle this
chunking mechanism inside a loop.

Optimized Thread Usage:

The NPU implementation takes into account the hardware-specific thread
limit (24 threads per vector core), ensuring that the number of active
programs is aligned with the NPU's vector core count, avoiding
over-subscription that would lead to serial processing.

This PR ensures that the operator functions efficiently on Ascend NPU,
considering hardware limitations while maintaining the same
functionality and input parameters as the GPU implementation.


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: maoxx241 <maomaoyu870@gmail.com>

2025-12-26 09:12:30 +08:00

__init__.py

[Ops][Triton] Add a triton kernel supporting partial rope. (#4413 )

2025-12-02 17:10:19 +08:00

test_causal_conv1d.py

[FIX] Update _causal_conv1d_update_kernel for Efficient Conv State Handling on NPU (#5322 )

2025-12-26 09:12:30 +08:00

test_l2norm.py

[Kernel] add l2norm triton kernel (#4595 )

2025-12-25 06:06:18 +08:00

test_rejection_sampler.py

fix e2e rejection-sampler error (#5341 )

2025-12-25 11:39:38 +08:00

test_rope.py

[Ops][Triton] Add a triton kernel supporting partial rope. (#4413 )

2025-12-02 17:10:19 +08:00