xc-llm-ascend

Files

Shanshan Shen 14d9a64047 [ModelRunner][V1] Optimize V1 attention mask (#442 )

### What this PR does / why we need it?
Pre-construct a mask matrix to improve the efficiency of attention mask
construction during inference.

Note that the length of the matrix needs to be carefully balanced: a
matrix that is too large will consume excessive VRAM, while a matrix
that is too small will require dynamic concatenation during inference,
leading to performance degradation.

Therefore, an environment variable is added here to dynamically set the
size of the pre-constructed mask matrix based on requirements.

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: didongli182 <didongli@huawei.com>

2025-04-02 10:33:53 +08:00

attention

[Core] Make V1 work and enable V1 engine test (#389 )

2025-03-28 19:34:23 +08:00

models

FastPatch: Optimized Patch Embedding for Qwen2VL (#345 )

2025-03-26 14:28:20 +08:00

ops

[Feature] Implement EP-compatible fused_moe (#121 )

2025-03-11 21:08:02 +08:00

quantization

[BugFix] Fix bugs when using ascend quantization (#275 )