### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2.0 KiB
2.0 KiB
vLLM Ascend Model Adapter Skill
Adapt and debug models for vLLM on Ascend NPU — covering both already-supported architectures and new models not yet registered in vLLM.
What it does
This skill guides an AI agent through a deterministic workflow to:
- Triage a model checkpoint (architecture, quant type, multimodal capability).
- Implement minimal code changes in
/vllm-workspace/vllmand/vllm-workspace/vllm-ascend. - Validate via a two-stage gate (dummy fast gate + real-weight mandatory gate).
- Deliver one signed commit with code, test config, and tutorial doc.
File layout
| File | Purpose |
|---|---|
SKILL.md |
Skill definition, constraints, and execution playbook |
references/workflow-checklist.md |
Step-by-step commands and templates |
references/troubleshooting.md |
Symptom-action pairs for common failures |
references/fp8-on-npu-lessons.md |
FP8 checkpoint handling on Ascend |
references/multimodal-ep-aclgraph-lessons.md |
VL, EP, and ACLGraph patterns |
references/deliverables.md |
Required outputs and commit discipline |
Quick start
- Open a conversation with the AI agent inside the vllm-ascend dev container.
- Invoke the skill (e.g.
/vllm-ascend-model-adapter). - Provide the model path (default
/models/<model-name>) and the originating issue number. - The agent follows the playbook in
SKILL.mdand produces a ready-to-merge commit.
Key constraints
- Never upgrade
transformers. - Start
vllm servefrom/workspace(direct command, port 8000). - Dummy-only evidence is not sufficient — real-weight validation is mandatory.
- Final delivery is exactly one signed commit in the current repo.
Two-stage validation
- Stage A (dummy): fast architecture / operator / API path check with
--load-format dummy. - Stage B (real): real-weight loading, fp8/quant path, KV sharding, runtime stability.
Both stages require request-level verification (/v1/models + at least one chat request),
not just startup success.