### What this PR does / why we need it?
This pr is cherry-pick from :
https://github.com/vllm-project/vllm-ascend/pull/2958 and
https://github.com/vllm-project/vllm-ascend/pull/4340
Past:
npu_moe_gating_top_k can only support 'group_count=256' pattern
Now:
1、npu_moe_gating_top_k support all size of group_count
2、the functionality of `torch_npu.npu_moe_gating_top_k_softmax` are
included in `torch_npu.npu_moe_gating_top_k`
CANN: depends on 8.3.RC1
Performance:
1. GLM4.5-w8a8, TPS improve 6%
2. Qwen3, the same as before
---------
Signed-off-by: 1092626063 <1092626063@qq.com>