### What this PR does / why we need it?
- Add explicit .contiguous() after permute/view to ensure mem-friendly
layout
- Replace nested PCP/DCP Python loops with fully vectorized tensor
operations
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: F.Liu <liufeng248@huawei.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>