xc-llm-ascend

Files

XiaoxinWang 320877d488 move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 )

### What this PR does / why we need it?
The contiguous() operation temporarily increases memory usage, leading
to higher peak GPU memory, which necessitates reducing
gpu_memory_utilization. However, making tensors contiguous in
modelrunnerv1 significantly enhances operator performance, resulting in
greater end-to-end model benefits despite the memory overhead.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>

2025-12-26 09:19:47 +08:00

[Refactor] move the metadata from attention_v1 to util(ready for extract common_cp) & realize Ascendmetadata inherit from the parent class. (#5203 )

2025-12-23 00:10:52 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[feature] support pcp + mtp in full graph (#4572 )

2025-12-22 16:13:39 +08:00

model_runner_v1.py

move contiguous in fused_sigmoid_gating_delta_rule_update to model_runner_v1 (#5274 )

2025-12-26 09:19:47 +08:00

npu_input_batch.py

Drop 0.12.0 support (#5146 )

2025-12-20 09:38:53 +08:00

worker.py

[refactor] refactor weight trans nz and transpose (#4878 )

2025-12-19 14:27:24 +08:00