Files
xc-llm-ascend/.agents/skills/vllm-ascend-model-adapter/README.md
jack 29e3cdde20 [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731)
### What this PR does / why we need it

This PR introduces the **first AI-assisted model-adaptation skill
package** for `vllm-ascend`.

The goal is to make model adaptation work (especially for recurring
feature-request issues) **repeatable, auditable, and easier to hand
off**.

### Scope in this PR

This PR adds only skill/workflow assets under:

- `.agents/skills/vllm-ascend-model-adapter/SKILL.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md`
- `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md`

### Workflow improvements

The skill standardizes:

1. **Environment assumptions** used in our Docker setup
- implementation roots: `/vllm-workspace/vllm` and
`/vllm-workspace/vllm-ascend`
- serving root: `/workspace`
- model path convention: `/models/<model-name>`

2. **Validation strategy**
- Stage A: fast `--load-format dummy` gate
- Stage B: mandatory real-weight gate before sign-off
- avoid false-ready by requiring request-level checks (not startup log
only)

3. **Feature-first verification checklist**
- ACLGraph / EP / flashcomm1 / MTP / multimodal
- explicit `supported / unsupported / not-applicable /
checkpoint-missing` outcomes

4. **Delivery contract**
- minimal scoped code changes
- required artifacts (Chinese report + runbook, e2e config YAML,
tutorial doc)
- one signed commit in delivery repo

### What this PR does NOT do

- No runtime/kernel/model patch is included in this PR.
- No direct model support claim is made by this PR alone.
- Model-specific adaptation/fix work should be submitted in follow-up
PRs using this skill as the workflow baseline.

### Why this matters for maintainers

This gives the repo a shared, explicit AI-assistance protocol, so future
model-adaptation PRs are easier to review, compare, and reproduce.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-02-26 08:48:15 +08:00

2.0 KiB

vLLM Ascend Model Adapter Skill

Adapt and debug models for vLLM on Ascend NPU — covering both already-supported architectures and new models not yet registered in vLLM.

What it does

This skill guides an AI agent through a deterministic workflow to:

  1. Triage a model checkpoint (architecture, quant type, multimodal capability).
  2. Implement minimal code changes in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend.
  3. Validate via a two-stage gate (dummy fast gate + real-weight mandatory gate).
  4. Deliver one signed commit with code, test config, and tutorial doc.

File layout

File Purpose
SKILL.md Skill definition, constraints, and execution playbook
references/workflow-checklist.md Step-by-step commands and templates
references/troubleshooting.md Symptom-action pairs for common failures
references/fp8-on-npu-lessons.md FP8 checkpoint handling on Ascend
references/multimodal-ep-aclgraph-lessons.md VL, EP, and ACLGraph patterns
references/deliverables.md Required outputs and commit discipline

Quick start

  1. Open a conversation with the AI agent inside the vllm-ascend dev container.
  2. Invoke the skill (e.g. /vllm-ascend-model-adapter).
  3. Provide the model path (default /models/<model-name>) and the originating issue number.
  4. The agent follows the playbook in SKILL.md and produces a ready-to-merge commit.

Key constraints

  • Never upgrade transformers.
  • Start vllm serve from /workspace (direct command, port 8000).
  • Dummy-only evidence is not sufficient — real-weight validation is mandatory.
  • Final delivery is exactly one signed commit in the current repo.

Two-stage validation

  • Stage A (dummy): fast architecture / operator / API path check with --load-format dummy.
  • Stage B (real): real-weight loading, fp8/quant path, KV sharding, runtime stability.

Both stages require request-level verification (/v1/models + at least one chat request), not just startup success.