### What this PR does / why we need it?
Before optimizing,the rmsnorm time in one decoding is 531.5us. After
optimizing,the rmsnorm time in one decoding is 105us.
I closed the previous
PR(https://github.com/vllm-project/vllm-ascend/pull/2456) by mistake and
resubmitted it now
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.10.2
- vLLM main:
b1068903fd
---------
Signed-off-by: socrahow <suzihao4@h-partners.com>