### What this PR does / why we need it?
**Optimization:** Replaces fixed block sizes (128x128x128) in
`linear_persistent_kernel` with adaptive selection logic that considers:
- Matrix dimensions (M, N, K)
- Device NPU vector core count
- Data type (float32 vs others)
**Why:** Fixed block sizes lead to suboptimal hardware utilization
across different matrix shapes. Adaptive sizing maximizes occupancy and
memory efficiency for varied workload patterns, improving throughput for
batch-invariant linear operations in LLM inference.
**Details:**
- Small matrices (M < 256): Size-proportional allocation
- Medium matrices (256 ≤ M < 1024): Balanced distribution based on grid
capacity
- Large matrices (M ≥ 1024): Optimized for dominant dimension
### Does this PR introduce _any_ user-facing change?
No. This is a performance optimization. The API and numerical results
remain unchanged; only kernel execution efficiency improves.
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0
Signed-off-by: DDCHY <843049740@qq.com>
Signed-off-by: zjchenn <zjchenn@gmail.com>
Co-authored-by: DDCHY <843049740@qq.com>