From 29e3cdde20944457df4ab45550be1630794954fd Mon Sep 17 00:00:00 2001 From: jack Date: Thu, 26 Feb 2026 08:48:15 +0800 Subject: [PATCH] [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731) ### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> --- .../vllm-ascend-model-adapter/README.md | 46 ++++ .../skills/vllm-ascend-model-adapter/SKILL.md | 140 ++++++++++ .../references/deliverables.md | 47 ++++ .../references/fp8-on-npu-lessons.md | 57 ++++ .../multimodal-ep-aclgraph-lessons.md | 64 +++++ .../references/troubleshooting.md | 229 ++++++++++++++++ .../references/workflow-checklist.md | 255 ++++++++++++++++++ 7 files changed, 838 insertions(+) create mode 100644 .agents/skills/vllm-ascend-model-adapter/README.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/SKILL.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/references/deliverables.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md create mode 100644 .agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md diff --git a/.agents/skills/vllm-ascend-model-adapter/README.md b/.agents/skills/vllm-ascend-model-adapter/README.md new file mode 100644 index 00000000..0943af2d --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/README.md @@ -0,0 +1,46 @@ +# vLLM Ascend Model Adapter Skill + +Adapt and debug models for vLLM on Ascend NPU — covering both already-supported +architectures and new models not yet registered in vLLM. + +## What it does + +This skill guides an AI agent through a deterministic workflow to: + +1. Triage a model checkpoint (architecture, quant type, multimodal capability). +2. Implement minimal code changes in `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend`. +3. Validate via a two-stage gate (dummy fast gate + real-weight mandatory gate). +4. Deliver one signed commit with code, test config, and tutorial doc. + +## File layout + +| File | Purpose | +| ---- | ------- | +| `SKILL.md` | Skill definition, constraints, and execution playbook | +| `references/workflow-checklist.md` | Step-by-step commands and templates | +| `references/troubleshooting.md` | Symptom-action pairs for common failures | +| `references/fp8-on-npu-lessons.md` | FP8 checkpoint handling on Ascend | +| `references/multimodal-ep-aclgraph-lessons.md` | VL, EP, and ACLGraph patterns | +| `references/deliverables.md` | Required outputs and commit discipline | + +## Quick start + +1. Open a conversation with the AI agent inside the vllm-ascend dev container. +2. Invoke the skill (e.g. `/vllm-ascend-model-adapter`). +3. Provide the model path (default `/models/`) and the originating issue number. +4. The agent follows the playbook in `SKILL.md` and produces a ready-to-merge commit. + +## Key constraints + +- Never upgrade `transformers`. +- Start `vllm serve` from `/workspace` (direct command, port 8000). +- Dummy-only evidence is not sufficient — real-weight validation is mandatory. +- Final delivery is exactly one signed commit in the current repo. + +## Two-stage validation + +- **Stage A (dummy)**: fast architecture / operator / API path check with `--load-format dummy`. +- **Stage B (real)**: real-weight loading, fp8/quant path, KV sharding, runtime stability. + +Both stages require request-level verification (`/v1/models` + at least one chat request), +not just startup success. diff --git a/.agents/skills/vllm-ascend-model-adapter/SKILL.md b/.agents/skills/vllm-ascend-model-adapter/SKILL.md new file mode 100644 index 00000000..0000d714 --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/SKILL.md @@ -0,0 +1,140 @@ +--- +name: vllm-ascend-model-adapter +description: "Adapt and debug existing or new models for vLLM on Ascend NPU. Implement in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend, validate via direct vllm serve from /workspace, and deliver one signed commit in the current repo." +--- + +# vLLM Ascend Model Adapter + +## Overview + +Adapt Hugging Face or local models to run on `vllm-ascend` with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM. + +## Read order + +1. Start with `references/workflow-checklist.md`. +2. Read `references/multimodal-ep-aclgraph-lessons.md` (feature-first checklist). +3. If startup/inference fails, read `references/troubleshooting.md`. +4. If checkpoint is fp8-on-NPU, read `references/fp8-on-npu-lessons.md`. +5. Before handoff, read `references/deliverables.md`. + +## Hard constraints + +- Never upgrade `transformers`. +- Primary implementation roots are fixed by Dockerfile: + - `/vllm-workspace/vllm` + - `/vllm-workspace/vllm-ascend` +- Start `vllm serve` from `/workspace` with direct command by default. +- Default API port is `8000` unless user explicitly asks otherwise. +- Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box. +- `--enable-expert-parallel` and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence. +- If any feature cannot be enabled, keep evidence and explain reason in final report. +- Do not rely on `PYTHONPATH=:$PYTHONPATH` unless debugging fallback is strictly needed. +- Keep code changes minimal and focused on the target model. +- Final deliverable commit must be one single signed commit in the current working repo (`git commit -sm ...`). +- Keep final docs in Chinese and compact. +- **Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights.** +- **Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.** + +## Execution playbook + +### 1) Collect context + +- Confirm model path (default `/models/`; if environment differs, confirm with user explicitly). +- Confirm implementation roots (`/vllm-workspace/vllm`, `/vllm-workspace/vllm-ascend`). +- Confirm delivery root (the current git repo where the final commit is expected). +- Confirm runtime import path points to `/vllm-workspace/*` install. +- Use default expected feature set: ACLGraph + EP + flashcomm1 + MTP + multimodal (if model has VL capability). +- User requirements extend this baseline, not replace it. + +### 2) Analyze model first + +- Inspect `config.json`, processor files, modeling files, tokenizer files. +- Identify architecture class, attention variant, quantization type, and multimodal requirements. +- Check state-dict key prefixes (and safetensors index) to infer mapping needs. +- Decide whether support already exists in `vllm/model_executor/models/registry.py`. + +### 3) Choose adaptation strategy (new-model capable) + +- Reuse existing vLLM architecture if compatible. +- If architecture is missing or incompatible, implement native support: + - add model adapter under `vllm/model_executor/models/`; + - add processor under `vllm/transformers_utils/processors/` when needed; + - register architecture in `vllm/model_executor/models/registry.py`; + - implement explicit weight loading/remap rules (including fp8 scale pairing, KV/QK norm sharding, rope variants). +- If remote code needs newer transformers symbols, do not upgrade dependency. +- If unavoidable, copy required modeling files from sibling transformers source and keep scope explicit. +- If failure is backend-specific (kernel/op/platform), patch minimal required code in `/vllm-workspace/vllm-ascend`. + +### 4) Implement minimal code changes (in implementation roots) + +- Touch only files required for this model adaptation. +- Keep weight mapping explicit and auditable. +- Avoid unrelated refactors. + +### 5) Two-stage validation on Ascend (direct run) + +#### Stage A: dummy fast gate (recommended first) + +- Run from `/workspace` with `--load-format dummy`. +- Goal: fast validate architecture path / operator path / API path. +- Do not treat `Application startup complete` as pass by itself; request smoke is mandatory. +- Require at least: + - startup readiness (`/v1/models` 200), + - one text request 200, + - if VL model, one text+image request 200, + - ACLGraph evidence where expected. + +#### Stage B: real-weight mandatory gate (must pass before sign-off) + +- Remove `--load-format dummy` and validate with real checkpoint. +- Goal: validate real-only risks: + - weight key mapping, + - fp8/fp4 dequantization path, + - KV/QK norm sharding with real tensor shapes, + - load-time/runtime stability. +- Require HTTP 200 and non-empty output before declaring success. +- Do not pass Stage B on startup-only evidence. + +### 6) Validate inference and features + +- Send `GET /v1/models` first. +- Send at least one OpenAI-compatible text request. +- For multimodal models, require at least one text+image request. +- Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors). +- Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation. +- If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation. +- For `torch._dynamo` + `interpolate` + `NPU contiguous` failures on VL paths, try `TORCHDYNAMO_DISABLE=1` as diagnostic/stability fallback. +- For multimodal processor API mismatch (for example `skip_tensor_conversion` signature mismatch), use text-only isolation (`--limit-mm-per-prompt` set image/video/audio to 0) to separate processor issues from core weight loading issues. +- Capacity baseline by default (single machine): `max-model-len=128k` + `max-num-seqs=16`. +- Then expand concurrency (e.g., 32/64) if requested or feasible. + +### 7) Backport, generate artifacts, and commit in delivery repo + +- If implementation happened in `/vllm-workspace/*`, backport minimal final diff to current working repo. +- Generate test config YAML at `tests/e2e/models/configs/.yaml` following the schema of existing configs (must include `model_name`, `hardware`, `tasks` with accuracy metrics, and `num_fewshot`). Use accuracy results from evaluation to populate metric values. +- Generate tutorial markdown at `docs/source/tutorials/models/.md` following the standard template (Introduction, Supported Features, Environment Preparation with docker tabs, Deployment with serve script, Functional Verification with curl example, Accuracy Evaluation, Performance). Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, and accuracy table. +- Update `docs/source/tutorials/models/index.md` to include the new tutorial. +- Confirm test config YAML and tutorial doc are included in the staged files. +- Commit code changes once (single signed commit). + +### 8) Prepare handoff artifacts + +- Write comprehensive Chinese analysis report. +- Write compact Chinese runbook for server startup and validation commands. +- Include feature status matrix (supported / unsupported / checkpoint-missing / not-applicable). +- Include dummy-vs-real validation matrix and explicit non-equivalence notes. +- Include changed-file list, key logs, and final commit hash. +- Post the SKILL.md content (or a link to it) as a comment on the originating GitHub issue to document the AI-assisted workflow. + +## Quality gate before final answer + +- Service starts successfully from `/workspace` with direct command. +- OpenAI-compatible inference request succeeds (not startup-only). +- Key feature set is attempted and reported: ACLGraph / EP / flashcomm1 / MTP / multimodal. +- Capacity baseline (`128k + bs16`) result is reported, or explicit reason why not feasible. +- **Dummy stage evidence is present (if used), and real-weight stage evidence is present (mandatory).** +- Test config YAML exists at `tests/e2e/models/configs/.yaml` and follows the established schema (`model_name`, `hardware`, `tasks`, `num_fewshot`). +- Tutorial doc exists at `docs/source/tutorials/models/.md` and follows the standard template (Introduction, Supported Features, Environment Preparation, Deployment, Functional Verification, Accuracy Evaluation, Performance). +- Tutorial index at `docs/source/tutorials/models/index.md` includes the new model entry. +- Exactly one signed commit contains all code changes in current working repo. +- Final response includes commit hash, file paths, key commands, known limits, and failure reasons where applicable. diff --git a/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md b/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md new file mode 100644 index 00000000..361f98e8 --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md @@ -0,0 +1,47 @@ +# Deliverables + +## Required outputs in current repo + +1. One final signed commit (`git commit -sm ...`) containing the adaptation changes. +2. Chinese analysis report(精简但完整): + - model architecture summary + - incompatibility root causes + - code changes and rationale + - startup and inference verification evidence + - feature status matrix(supported / unsupported / checkpoint-missing / not-applicable) + - max model len: config theoretical vs runtime practical + - dummy-vs-real validation matrix(what dummy proved / what only real proved) + - false-ready cases and final resolution path(if any) + - fallback ladder evidence(which fallback was tried, what changed) +3. Chinese compact runbook: + - how to start server in `/workspace` (direct command, default `:8000`) + - how to run OpenAI-compatible validation + - optional eager fallback command + - optional `TORCHDYNAMO_DISABLE=1` fallback command (if relevant) +4. Test config YAML at `tests/e2e/models/configs/.yaml` — must include `model_name`, `hardware`, `tasks` with accuracy metrics (name + value), and `num_fewshot`. Use accuracy results from evaluation to populate metric values. Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`). +5. Tutorial doc at `docs/source/tutorials/models/.md` — must follow the standard template: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. Fill in model-specific details (HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, accuracy table). +6. Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue. + +## Commit discipline + +- Keep one signed commit for code changes in the current working repo. +- If implementation occurred in `/vllm-workspace/*`, backport minimal final diff to current repo before commit. +- Keep diff scoped to target model adaptation. + +## Validation discipline + +- Always provide log file paths for key claims. +- Keep docs synchronized with latest successful test mode (do not leave stale command variants as default). +- Final report must include pass/fail reason for each key feature attempt: ACLGraph / EP / flashcomm1 / MTP / multimodal. +- EP and flashcomm1 are MoE-only checks; for non-MoE models mark as not-applicable with evidence. +- Final report should include baseline capacity result (`128k + bs16`) or explicit reason if not feasible. +- Dummy-first can be used to speed up iterations, but real-weight gate is mandatory before final sign-off. +- Startup-only evidence is insufficient; include first-request smoke results. + +## Suggested final response structure + +- What changed +- What went well / what went wrong +- Validation performed +- Commit hash and changed files +- Optional next step diff --git a/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md b/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md new file mode 100644 index 00000000..59ac7f3d --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md @@ -0,0 +1,57 @@ +# FP8-on-NPU Lessons + +## 1) Recommended debug order + +1. Start with `--load-format dummy` to quickly verify architecture path. +2. Run with real weights to validate weight mapping and load-time stability. +3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path. +4. Validate `/v1/models`, then one text request, then one VL request (if multimodal). + +## 2) FP8 checkpoint on NPU + +Common symptom: + +- `fp8 quantization is currently not supported in npu`. + +Recommended pattern: + +- do not force fp8 execution kernels on NPU; +- dequantize fp8 weights to bf16 during loading using paired tensors: + - `*.weight` + - `*.weight_scale_inv` +- keep strict unpaired scale/weight checks to avoid silent corruption. + +## 3) Typical real-only risks (dummy may not expose) + +- missing fp8 scale keys during real shard loading; +- wrong weight remap path only triggered by real checkpoints; +- KV/QK norm sharding mismatch under TP + replicated KV heads. + +## 4) KV replication + TP pitfalls + +Typical symptom: + +- shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`. + +Recommended pattern: + +- detect KV-head replication explicitly; +- use local norm/shard loader path for replicated KV heads; +- avoid assuming uniform divisibility for all head dimensions. + +## 5) ACLGraph stability for fp8-origin checkpoints + +Recommended pattern: + +- prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode; +- keep practical capture sizes and re-test from small, stable shapes; +- use `--enforce-eager` only as temporary isolation fallback. + +## 6) Reporting discipline + +Always report both: + +- what dummy validated (fast gate), and +- what only real weights validated (mandatory gate). + +Do not sign off fp8-on-NPU adaptation with dummy-only evidence. diff --git a/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md b/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md new file mode 100644 index 00000000..ae5a04d5 --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md @@ -0,0 +1,64 @@ +# Multimodal + EP + ACLGraph Lessons + +This note captures practical patterns that repeatedly matter for VL checkpoints on Ascend. + +## 1) Out-of-box feature expectation + +Try best to validate key features by default: + +- ACLGraph +- MTP +- multimodal (if model supports VL) +- EP (MoE models only) +- flashcomm1 (MoE models only) + +If any feature fails, keep logs and explain the reason in the final report. +For non-MoE models, EP/flashcomm1 should be marked not-applicable. + +## 2) Validate in this order + +1. Single text request success (`/v1/models` + `/v1/chat/completions`). +2. Single text+image request success. +3. Graph evidence (`Replaying aclgraph`) when graph mode is expected. +4. Capacity baseline: `128k + bs16`. +5. Concurrency expansion if needed (`32/64` suggested). + +## 3) EP + graph startup expectations + +- Startup latency is much higher than eager due to: + - compile warmup + - graph capture rounds + - multimodal encoder profiling +- Do not treat slow startup as failure unless logs show hard errors. + +## 4) Always distinguish two max lengths + +- **Theoretical max**: from model config (`max_position_embeddings`). +- **Practical max**: largest value that actually starts and serves on current hardware + TP/EP settings. + +Report both values explicitly. + +## 5) Multimodal testing with temporary layer reduction + +- Reducing `num_hidden_layers` can speed smoke tests. +- This does **not** remove ViT structure itself. +- Still require one full-layer validation before final sign-off. + +## 6) Feature-status semantics + +Use four categories: + +- ✅ supported and verified +- ❌ framework-level unsupported +- ⚠️ checkpoint missing (weights/config do not provide feature) +- N/A not-applicable (for example EP/flashcomm1 on non-MoE models) + +Typical examples: + +- flashcomm1 on non-MoE VL models is often N/A or ❌ depending on framework gate. +- MTP may be ⚠️ checkpoint missing even if framework has code paths. + +## 7) Keep docs and defaults aligned with latest success path + +- If EP+graph is validated and requested/expected, it should be the default runbook path. +- Eager mode should be documented as fallback/troubleshooting only. diff --git a/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md b/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md new file mode 100644 index 00000000..46c30aab --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md @@ -0,0 +1,229 @@ +# Troubleshooting + +## Direct run doesn't pick your code changes + +Symptoms: + +- `vllm serve` behavior still old after code edits. + +Actions: + +1. Check runtime import path: + ```bash + python - <<'PY' + import vllm + print(vllm.__file__) + PY + ``` +2. Ensure edits were made under `/vllm-workspace/vllm` and/or `/vllm-workspace/vllm-ascend`. +3. Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback. + +## Server fails to bind on `:8000` or fails with HCCL bind errors + +Symptoms: + +- Port bind fail on startup. +- HCCL error like `Communication_Error_Bind_IP_Port(EJ0003)`. + +Actions: + +1. Kill stale `vllm serve` processes. +2. Ensure `:8000` is free. +3. Retry clean startup before changing code. + +## Startup appears "stuck" in graph mode + +Symptoms: + +- Process alive, but `curl /v1/models` not ready yet. +- Logs show compile/graph capture messages for a long time. + +Actions: + +1. Keep waiting until graph capture completes. +2. Look for `Capturing CUDA graphs ...` and `Graph capturing finished`. +3. Only declare failure after an explicit error or timeout window. + +## False-ready: startup succeeds but first request crashes + +Symptoms: + +- `Application startup complete` exists. +- `GET /v1/models` may return 200. +- First text or VL request crashes workers/engine. + +Actions: + +1. Always run at least one text smoke request immediately after ready. +2. For VL models, always run one text+image smoke request as well. +3. Treat first-request crash as runtime failure (do not mark as success). +4. Capture first runtime error signature and branch to targeted fallback. + +## Architecture not recognized + +Symptoms: + +- `ValueError` or log shows unresolved architecture. + +Actions: + +1. Verify `architectures` in model `config.json`. +2. Add mapping to `vllm/model_executor/models/registry.py`. +3. Ensure module and class names exactly match. + +## Remote code import fails on transformers symbols + +Symptoms: + +- Missing class/function in current `transformers`. + +Actions: + +1. Do not upgrade `transformers`. +2. Prefer native vLLM implementation. +3. If unavoidable, copy required modeling files from sibling transformers source. + +## Weight loading key mismatch + +Symptoms: + +- Missing/unexpected key warnings during load. + +Actions: + +1. Inspect checkpoint key prefixes. +2. Add explicit mapping logic. +3. Keep mapping minimal and auditable. +4. Re-test with full shards, not only tiny-layer smoke runs. + +## FP8 checkpoint on Ascend A2/A3 (must dequant to bf16) + +Symptoms: + +- fp8 kernels unsupported or unstable on Ascend A2/A3. + +Actions: + +1. Do not force fp8 quantization kernels on Ascend. +2. Use load-time fp8->bf16 dequantization path (weight + scale pairing). +3. Add strict unpaired scale/weight checks to avoid silent corruption. + +## QK norm mismatch (KV heads / TP / head divisibility) + +Symptoms: + +- Shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`. +- Similar mismatch when head topology is not cleanly divisible. + +Actions: + +1. Detect KV-head replication case. +2. Use local `k_norm` shard path for replicated KV heads. +3. Avoid assumptions that all head dimensions split evenly under current TP. +4. Validate both normal and edge topology cases explicitly. + +## MLA attention runtime failures after ready + +Symptoms: + +- First request fails with signatures like `AtbRingMLAGetWorkspaceSize` / `AtbRingMLA`. +- May also show `aclnnFusedInferAttentionScoreV3 ... error code 561002`. + +Actions: + +1. Reproduce with one minimal text request (deterministic payload). +2. Try eager isolation (`--enforce-eager`) once to verify whether issue is graph-only. +3. If eager still fails, prioritize model/backend code fix path (not runtime flags only). +4. Check `vllm-ascend` MLA/rope/platform implementation used by known-good runs. + +## VL + TorchDynamo interpolate contiguous failure + +Symptoms: + +- `torch._dynamo.exc.TorchRuntimeError`. +- Stack contains `torch.nn.functional.interpolate`. +- Error contains `NPU contiguous operator only supported contiguous memory format`. + +Actions: + +1. Add `TORCHDYNAMO_DISABLE=1` and retry with same serve args. +2. Validate both text and text+image after startup. +3. If this stabilizes startup and inference, record it as current fallback path. +4. Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted. + +## Multimodal processor signature mismatch (`skip_tensor_conversion`) + +Symptoms: + +- Early failure before engine ready. +- `convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'`. + +Actions: + +1. Identify processor compatibility mismatch (HF remote processor vs current transformers API). +2. Use text-only isolation (`--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`) only to separate layers, not as final fix. +3. Expect potential follow-up core failures after bypassing processor path; keep logs for both layers. +4. Align to known-good model dispatch and processor compatibility implementation. + +## Text-only isolation triggers meta tensor load errors + +Symptoms: + +- `NotImplementedError: Cannot copy out of meta tensor; no data!` +- May occur after disabling multimodal prompt items. + +Actions: + +1. Treat as secondary failure signature (after bypassing earlier MM-processor failure). +2. Do not assume text-only isolation is universally safe for all VL models. +3. Return to model-specific code-fix path with captured signatures. + +## Config max length works on paper but not in runtime + +Symptoms: + +- `max_position_embeddings` is large, but service fails or OOM with that value. + +Actions: + +1. Record config max (theoretical). +2. Find practical max by successful startup + serving under target TP/EP setup. +3. Report both values explicitly in docs. + +## flashcomm1 / MTP confusion on VL checkpoints + +Symptoms: + +- flashcomm1 enabled but startup fails. +- MTP expected but no effect. + +Actions: + +1. Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable. +2. Verify MTP from both config and weight index (`mtp/nextn` keys). +3. Mark unsupported vs checkpoint-missing clearly. + +## ACL graph capture fails (507903) + +Symptoms: + +- `AclmdlRICaptureEnd ... 507903` +- `rtStreamEndCapture ... invalidated stream capture sequence` + +Actions: + +1. Prefer `HCCL_OP_EXPANSION_MODE=AIV` for graph capture stability. +2. Reduce shape pressure (`--max-model-len`) and retry. +3. Temporarily fallback `--enforce-eager` for isolation. + +## API reachable but output quality odd + +Symptoms: + +- `/v1/models` works but output has template artifacts. + +Actions: + +1. Use deterministic request (`temperature=0`, bounded `max_tokens`). +2. Verify endpoint (`/v1/chat/completions` vs `/v1/completions`) matches model template. +3. Confirm non-empty output and HTTP 200 before success declaration. diff --git a/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md b/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md new file mode 100644 index 00000000..3472126c --- /dev/null +++ b/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md @@ -0,0 +1,255 @@ +# Workflow Checklist + +## 0) Environment prerequisites + +Set these once per session. Defaults match the official vllm-ascend Docker image. + +```bash +# --- configurable paths (adjust if your layout differs) --- +VLLM_SRC=/vllm-workspace/vllm # vLLM source root +VLLM_ASCEND_SRC=/vllm-workspace/vllm-ascend # vllm-ascend source root +WORK_DIR=/workspace # directory to run vllm serve from +MODEL_ROOT=/models # parent directory of model checkpoints +``` + +Expected environment: + +- Hardware: Ascend A2 or A3 server +- Software: official vllm-ascend Docker image (see `./Dockerfile` for full contents) +- TP=16 typical for A3 (16-NPU), TP=8 typical for A2 (8-NPU) + +## 1) Fast triage commands + +```bash +MODEL_PATH=${MODEL_ROOT}/ +echo "MODEL_PATH=$MODEL_PATH" + +# model inventory +ls -la "$MODEL_PATH" + +# architecture + quant hints +rg -n "architectures|model_type|quantization_config|torch_dtype|max_position_embeddings|num_nextn_predict_layers|version|num_attention_heads|num_key_value_heads|num_experts" "$MODEL_PATH/config.json" + +# state-dict key layout hints (if index exists) +ls -la "$MODEL_PATH"/*index*.json 2>/dev/null || true + +# model custom code (if exists) +ls -la "$MODEL_PATH"/*.py 2>/dev/null || true +``` + +## 2) Confirm implementation and delivery roots + +```bash +# implementation roots (fixed by Dockerfile) +cd "$VLLM_SRC" && git status -s +cd "$VLLM_ASCEND_SRC" && git status -s + +# runtime import source check (expect vllm-workspace path) +python - <<'PY' +import vllm +print(vllm.__file__) +PY + +# direct-run working directory +cd "$WORK_DIR" && pwd + +# delivery root (current repo) +cd +git status -s +``` + +## 3) Session hygiene (before rerun) + +```bash +# stop stale servers +pkill -f "vllm serve|api_server|EngineCore" || true + +# confirm port 8000 is free +netstat -ltnp 2>/dev/null | rg ':8000' || true +``` + +When user explicitly requests reset: + +```bash +cd "$VLLM_SRC" && git reset --hard && git clean -fd +cd "$VLLM_ASCEND_SRC" && git reset --hard && git clean -fd +``` + +## 4) New model onboarding checklist + +```bash +# architecture mapping check in vLLM +rg -n "|registry" "$VLLM_SRC"/vllm/model_executor/models/registry.py + +# optional: inspect model config and weight index quickly +cat "$MODEL_PATH/config.json" +cat "$MODEL_PATH"/*index*.json 2>/dev/null || true +``` + +If architecture is missing/incompatible, minimally do: + +1. Add model adapter under `$VLLM_SRC/vllm/model_executor/models/.py`. +2. Add processor under `$VLLM_SRC/vllm/transformers_utils/processors/.py` when needed. +3. Register architecture in `$VLLM_SRC/vllm/model_executor/models/registry.py`. +4. Add explicit loader/remap rules for checkpoint key patterns (qkv/norm/rope/fp8 scales). +5. Touch `$VLLM_ASCEND_SRC` only when backend-specific errors are confirmed. + +## 5) Typical implementation touch points + +- `$VLLM_SRC/vllm/model_executor/models/.py` +- `$VLLM_SRC/vllm/transformers_utils/processors/.py` +- `$VLLM_SRC/vllm/model_executor/models/registry.py` +- `$VLLM_ASCEND_SRC/vllm_ascend/...` (only if backend behavior requires it) + +## 6) Syntax sanity checks + +```bash +python -m py_compile \ + "$VLLM_SRC"/vllm/model_executor/models/.py + +python -m py_compile \ + "$VLLM_SRC"/vllm/transformers_utils/processors/.py 2>/dev/null || true +``` + +## 7) Two-stage serve templates (direct run, default `:8000`) + +### Stage A: dummy fast gate (first try) + +```bash +cd "$WORK_DIR" +MODEL_PATH=${MODEL_ROOT}/ + +HCCL_OP_EXPANSION_MODE=AIV \ +VLLM_ASCEND_ENABLE_FLASHCOMM1=0 \ +vllm serve "$MODEL_PATH" \ + --served-model-name \ + --trust-remote-code \ + --dtype bfloat16 \ + --max-model-len \ + --tensor-parallel-size \ + --max-num-seqs 16 \ + --load-format dummy \ + --port 8000 +``` + +### Stage B: real-weight mandatory gate + +```bash +# remove this from Stage A: +--load-format dummy +``` + +> Note: dummy is not equivalent to real weights. Real gate is mandatory before sign-off. + +### EP + ACLGraph (feature-first, MoE only) + +```bash +# add to Stage B when model is MoE and validating EP: +--enable-expert-parallel +``` + +### flashcomm1 check (MoE only) + +```bash +# only evaluate flashcomm1 when model is MoE +VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +``` + +### Eager fallback (isolation) + +```bash +# add to command for isolation only: +--enforce-eager +``` + +### TorchDynamo fallback (for VL interpolate-contiguous failures) + +```bash +# add env var when logs contain: +# torch._dynamo.exc.TorchRuntimeError + interpolate + +# "NPU contiguous operator only supported contiguous memory format" +TORCHDYNAMO_DISABLE=1 +``` + +## 8) Readiness + smoke checks (must verify true-ready) + +```bash +# readiness +for i in $(seq 1 200); do + curl -sf http://127.0.0.1:8000/v1/models >/tmp/models.json && break + sleep 3 +done + +# text smoke (required) +curl -s http://127.0.0.1:8000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{"model":"","messages":[{"role":"user","content":"say hi"}],"temperature":0,"max_tokens":16}' + +# VL smoke (required for multimodal models) +# send one text+image OpenAI-compatible request and require non-empty choices. +``` + +> `Application startup complete` alone is not success. If first request crashes, treat as runtime failure (false-ready). + +## 9) Feature validation checklist (default out-of-box) + +1. `GET /v1/models` returns 200. +2. Text request returns 200 and non-empty output. +3. If VL model: text+image request returns 200. +4. ACLGraph evidence exists (`Replaying aclgraph`) where expected. +5. EP path is validated only for MoE models; non-MoE must be marked not-applicable. +6. flashcomm1 is validated only for MoE models; non-MoE must be marked not-applicable. +7. MTP status verified from config + weight index (enabled vs checkpoint-missing). +8. Dummy-vs-real differences are explicitly reported (if any). +9. Any false-ready case is explicitly marked as failure (with log signature). + +## 10) Fallback ladder (recommended order) + +1. Keep same params and reproduce once to ensure deterministic failure signature. +2. Add `--enforce-eager` to isolate graph-capture influence. +3. For VL + dynamo/interpolate/contiguous failures, add `TORCHDYNAMO_DISABLE=1`. +4. For multimodal-processor suspicion, isolate text-only by: + - `--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'` + - then check whether failure moves from processor layer to model core. +5. If issue persists, map failure signature to known-good implementation and patch minimal code. + +## 11) Capacity baseline + sweep + +- Baseline (single machine): **`max-model-len=128k` + `max-num-seqs=16`**. +- If baseline passes, expand to `max-num-seqs=32/64` when requested. +- If baseline cannot pass due hardware/runtime limits, report explicit root cause. + +## 12) Delivery checklist + +```bash +# in current working repo (delivery root) +git add +git commit -sm "" +``` + +Confirm: + +- one signed commit only +- Chinese analysis + Chinese runbook present +- feature status matrix included with pass/fail reason +- dummy stage and real stage validation evidence included +- false-ready cases (if any) documented with final fallback status + +### Test config generation + +- Generate `tests/e2e/models/configs/.yaml` using accuracy results from evaluation. +- Must include: `model_name` (HF path), `hardware` (e.g. "Atlas A2 Series"), `tasks` (list with `name` and `metrics` containing `name` + `value`), `num_fewshot`. +- Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`). + +### Tutorial doc generation + +- Generate `docs/source/tutorials/models/.md` from the standard template. +- Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl request, accuracy table. +- Must include sections: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. +- Update `docs/source/tutorials/models/index.md` to include the new tutorial entry. + +### GitHub issue comment + +- Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue. + +Confirm both test config YAML and tutorial doc are included in the signed commit.