[Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731)

### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-02-26 08:48:15 +08:00
parent 3b59d0ebe9
commit 29e3cdde20
7 changed files with 838 additions and 0 deletions
--- a/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
@@ -0,0 +1,64 @@
+# Multimodal + EP + ACLGraph Lessons
+
+This note captures practical patterns that repeatedly matter for VL checkpoints on Ascend.
+
+## 1) Out-of-box feature expectation
+
+Try best to validate key features by default:
+
+- ACLGraph
+- MTP
+- multimodal (if model supports VL)
+- EP (MoE models only)
+- flashcomm1 (MoE models only)
+
+If any feature fails, keep logs and explain the reason in the final report.
+For non-MoE models, EP/flashcomm1 should be marked not-applicable.
+
+## 2) Validate in this order
+
+1. Single text request success (`/v1/models` + `/v1/chat/completions`).
+2. Single text+image request success.
+3. Graph evidence (`Replaying aclgraph`) when graph mode is expected.
+4. Capacity baseline: `128k + bs16`.
+5. Concurrency expansion if needed (`32/64` suggested).
+
+## 3) EP + graph startup expectations
+
+- Startup latency is much higher than eager due to:
+    - compile warmup
+    - graph capture rounds
+    - multimodal encoder profiling
+- Do not treat slow startup as failure unless logs show hard errors.
+
+## 4) Always distinguish two max lengths
+
+- **Theoretical max**: from model config (`max_position_embeddings`).
+- **Practical max**: largest value that actually starts and serves on current hardware + TP/EP settings.
+
+Report both values explicitly.
+
+## 5) Multimodal testing with temporary layer reduction
+
+- Reducing `num_hidden_layers` can speed smoke tests.
+- This does **not** remove ViT structure itself.
+- Still require one full-layer validation before final sign-off.
+
+## 6) Feature-status semantics
+
+Use four categories:
+
+- ✅ supported and verified
+- ❌ framework-level unsupported
+- ⚠️ checkpoint missing (weights/config do not provide feature)
+- N/A not-applicable (for example EP/flashcomm1 on non-MoE models)
+
+Typical examples:
+
+- flashcomm1 on non-MoE VL models is often N/A or ❌ depending on framework gate.
+- MTP may be ⚠️ checkpoint missing even if framework has code paths.
+
+## 7) Keep docs and defaults aligned with latest success path
+
+- If EP+graph is validated and requested/expected, it should be the default runbook path.
+- Eager mode should be documented as fallback/troubleshooting only.