### What this PR does / why we need it?
- According to https://github.com/vllm-project/vllm-ascend/issues/807,
we pull request for customer ascendc kernel of multi-step.
- also a bug we found in multi_step_runner.py is fixed when we use
multi-step on V0 Engine.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
we add Unit Test file and offline inference file to test the custom
ascendc kernel. See test/ops/test_multi_step.py and
examples/offline_multi_step.py
---------
Signed-off-by: wan_danfeng <wonderful199082@126.com>
### What this PR does / why we need it?
Adopt custom kernel rotary embedding in actual model inference,
customized rotary_embedding will generate contiguous query and key in
the cpp side to reduce the overhead of two contiguous and index_select
compared with rotary_embedding in torch_npu. For now, rotary_embedding
can only support the scenario of `is_neox = true`, non-neox version rope
will be updated soon in the future.
---------
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>