### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2.8 KiB
2.8 KiB
Deliverables
Required outputs in current repo
- One final signed commit (
git commit -sm ...) containing the adaptation changes. - Chinese analysis report(精简但完整):
- model architecture summary
- incompatibility root causes
- code changes and rationale
- startup and inference verification evidence
- feature status matrix(supported / unsupported / checkpoint-missing / not-applicable)
- max model len: config theoretical vs runtime practical
- dummy-vs-real validation matrix(what dummy proved / what only real proved)
- false-ready cases and final resolution path(if any)
- fallback ladder evidence(which fallback was tried, what changed)
- Chinese compact runbook:
- how to start server in
/workspace(direct command, default:8000) - how to run OpenAI-compatible validation
- optional eager fallback command
- optional
TORCHDYNAMO_DISABLE=1fallback command (if relevant)
- how to start server in
- Test config YAML at
tests/e2e/models/configs/<ModelName>.yaml— must includemodel_name,hardware,taskswith accuracy metrics (name + value), andnum_fewshot. Use accuracy results from evaluation to populate metric values. Follow the schema of existing configs (e.g.Qwen3-8B.yaml). - Tutorial doc at
docs/source/tutorials/models/<ModelName>.md— must follow the standard template: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. Fill in model-specific details (HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, accuracy table). - Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
Commit discipline
- Keep one signed commit for code changes in the current working repo.
- If implementation occurred in
/vllm-workspace/*, backport minimal final diff to current repo before commit. - Keep diff scoped to target model adaptation.
Validation discipline
- Always provide log file paths for key claims.
- Keep docs synchronized with latest successful test mode (do not leave stale command variants as default).
- Final report must include pass/fail reason for each key feature attempt: ACLGraph / EP / flashcomm1 / MTP / multimodal.
- EP and flashcomm1 are MoE-only checks; for non-MoE models mark as not-applicable with evidence.
- Final report should include baseline capacity result (
128k + bs16) or explicit reason if not feasible. - Dummy-first can be used to speed up iterations, but real-weight gate is mandatory before final sign-off.
- Startup-only evidence is insufficient; include first-request smoke results.
Suggested final response structure
- What changed
- What went well / what went wrong
- Validation performed
- Commit hash and changed files
- Optional next step