Files

jack 29e3cdde20 [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731 )

### What this PR does / why we need it

This PR introduces the **first AI-assisted model-adaptation skill
package** for `vllm-ascend`.

The goal is to make model adaptation work (especially for recurring
feature-request issues) **repeatable, auditable, and easier to hand
off**.

### Scope in this PR

This PR adds only skill/workflow assets under:

- `.agents/skills/vllm-ascend-model-adapter/SKILL.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md`
- `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md`

### Workflow improvements

The skill standardizes:

1. **Environment assumptions** used in our Docker setup
- implementation roots: `/vllm-workspace/vllm` and
`/vllm-workspace/vllm-ascend`
- serving root: `/workspace`
- model path convention: `/models/<model-name>`

2. **Validation strategy**
- Stage A: fast `--load-format dummy` gate
- Stage B: mandatory real-weight gate before sign-off
- avoid false-ready by requiring request-level checks (not startup log
only)

3. **Feature-first verification checklist**
- ACLGraph / EP / flashcomm1 / MTP / multimodal
- explicit `supported / unsupported / not-applicable /
checkpoint-missing` outcomes

4. **Delivery contract**
- minimal scoped code changes
- required artifacts (Chinese report + runbook, e2e config YAML,
tutorial doc)
- one signed commit in delivery repo

### What this PR does NOT do

- No runtime/kernel/model patch is included in this PR.
- No direct model support claim is made by this PR alone.
- Model-specific adaptation/fix work should be submitted in follow-up
PRs using this skill as the workflow baseline.

### Why this matters for maintainers

This gives the repo a shared, explicit AI-assistance protocol, so future
model-adaptation PRs are easier to review, compare, and reproduce.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-02-26 08:48:15 +08:00

1.7 KiB

Raw Blame History

FP8-on-NPU Lessons

1) Recommended debug order

Start with --load-format dummy to quickly verify architecture path.
Run with real weights to validate weight mapping and load-time stability.
If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
Validate /v1/models, then one text request, then one VL request (if multimodal).

2) FP8 checkpoint on NPU

Common symptom:

fp8 quantization is currently not supported in npu.

Recommended pattern:

do not force fp8 execution kernels on NPU;
dequantize fp8 weights to bf16 during loading using paired tensors:
- *.weight
- *.weight_scale_inv
keep strict unpaired scale/weight checks to avoid silent corruption.

3) Typical real-only risks (dummy may not expose)

missing fp8 scale keys during real shard loading;
wrong weight remap path only triggered by real checkpoints;
KV/QK norm sharding mismatch under TP + replicated KV heads.

4) KV replication + TP pitfalls

Typical symptom:

shape mismatch like 128 vs 64 when tp_size > num_key_value_heads.

Recommended pattern:

detect KV-head replication explicitly;
use local norm/shard loader path for replicated KV heads;
avoid assuming uniform divisibility for all head dimensions.

5) ACLGraph stability for fp8-origin checkpoints

Recommended pattern:

prefer HCCL_OP_EXPANSION_MODE=AIV when using graph mode;
keep practical capture sizes and re-test from small, stable shapes;
use --enforce-eager only as temporary isolation fallback.

6) Reporting discipline

Always report both:

what dummy validated (fast gate), and
what only real weights validated (mandatory gate).

Do not sign off fp8-on-NPU adaptation with dummy-only evidence.

1.7 KiB Raw Blame History

FP8-on-NPU Lessons

1) Recommended debug order

2) FP8 checkpoint on NPU

3) Typical real-only risks (dummy may not expose)

4) KV replication + TP pitfalls

5) ACLGraph stability for fp8-origin checkpoints

6) Reporting discipline

1.7 KiB

Raw Blame History