xc-llm-ascend

Files

lhp-deep 0e3186f07c [model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 )

### What this PR does / why we need it?

This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.

The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>

2026-03-24 17:29:14 +08:00

[model_runner_v2]:optimize the performance of the _compute_slot_mappings_kernel (#7575 )

2026-03-24 17:29:14 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align (#7103 )

2026-03-15 09:44:09 +08:00

model_runner_v1.py

[Feat][SP] Suport SP for VL MoE models (#7044 )

2026-03-24 17:16:00 +08:00