xc-llm-ascend

Files

Cao Yi 9e2965bae2 [Feature] Support Flash Comm V1 for VL models (with MLA) (#7390 )

## Summary

Flash Comm V1 (flashcomm1) was previously blocked for all VL models.

**Root cause:** For VL models, `inputs_embeds` at layer 0 originates
from the vision encoder as a full `[N, H]` tensor — it has **not** been
reduce-scattered across TP ranks. The original MLA forward path assumed
inputs were already scattered, producing wrong output shapes under TP >
1.

**Fix:**
- Detect at init time (statically, not via runtime shape checks) whether
a layer is the first layer of a VL model (`is_vl_first_layer`) so dynamo
treats the branch as a constant.
- In `AscendMultiHeadLatentAttention.forward`, when `flashcomm1 + TP > 1
+ is_vl_first_layer`, set `need_gather_q_kv=False` and pre-allocate
output as `[N//tp_size, H]`.
- Remove the platform-level assertion that prevented VL models from
enabling Flash Comm V1.

**Other improvements:**
- `is_vl_model()` now uses vllm's canonical detection (`hf_config is not
hf_text_config`) instead of fragile key-name checks, with the old checks
kept as fallback.
- Added `parse_layer_idx(prefix)` utility.
- Added `maybe_chunk_residual` call in `AscendRMSNorm` before the
add-rms-norm op.
- Removed unnecessary CPU/fp32 round-trip in
  `AscendLearnable2DInterpPosEmbDivided_fixed.forward()`.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Co-authored-by: LoganJane <loganJane73@hotmail.com>

2026-03-22 21:05:28 +08:00

fused_moe

[refactor] replace scattered business kwargs with typed request objects and explicit stage boundaries (#7024 )

2026-03-20 23:23:57 +08:00

triton

[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 )

2026-03-19 17:19:18 +08:00

__init__.py

[OPS]add split_qkv_tp_rmsnorm_rope ops (#7376 )

2026-03-19 17:19:18 +08:00

activation.py

[Attention] add gpt-oss support (#5901 )