[Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731)

### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-02-26 08:48:15 +08:00
parent 3b59d0ebe9
commit 29e3cdde20
7 changed files with 838 additions and 0 deletions
--- a/.agents/skills/vllm-ascend-model-adapter/README.md
+++ b/.agents/skills/vllm-ascend-model-adapter/README.md
@@ -0,0 +1,46 @@
 # vLLM Ascend Model Adapter Skill
 Adapt and debug models for vLLM on Ascend NPU — covering both already-supported
 architectures and new models not yet registered in vLLM.
 ## What it does
 This skill guides an AI agent through a deterministic workflow to:
 1. Triage a model checkpoint (architecture, quant type, multimodal capability).
 2. Implement minimal code changes in `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend`.
 3. Validate via a two-stage gate (dummy fast gate + real-weight mandatory gate).
 4. Deliver one signed commit with code, test config, and tutorial doc.
 ## File layout
 | File | Purpose |
 | ---- | ------- |
 | `SKILL.md` | Skill definition, constraints, and execution playbook |
 | `references/workflow-checklist.md` | Step-by-step commands and templates |
 | `references/troubleshooting.md` | Symptom-action pairs for common failures |
 | `references/fp8-on-npu-lessons.md` | FP8 checkpoint handling on Ascend |
 | `references/multimodal-ep-aclgraph-lessons.md` | VL, EP, and ACLGraph patterns |
 | `references/deliverables.md` | Required outputs and commit discipline |
 ## Quick start
 1. Open a conversation with the AI agent inside the vllm-ascend dev container.
 2. Invoke the skill (e.g. `/vllm-ascend-model-adapter`).
 3. Provide the model path (default `/models/<model-name>`) and the originating issue number.
 4. The agent follows the playbook in `SKILL.md` and produces a ready-to-merge commit.
 ## Key constraints
 - Never upgrade `transformers`.
 - Start `vllm serve` from `/workspace` (direct command, port 8000).
 - Dummy-only evidence is not sufficient — real-weight validation is mandatory.
 - Final delivery is exactly one signed commit in the current repo.
 ## Two-stage validation
 - **Stage A (dummy)**: fast architecture / operator / API path check with `--load-format dummy`.
 - **Stage B (real)**: real-weight loading, fp8/quant path, KV sharding, runtime stability.
 Both stages require request-level verification (`/v1/models` + at least one chat request),
 not just startup success.
--- a/.agents/skills/vllm-ascend-model-adapter/SKILL.md
+++ b/.agents/skills/vllm-ascend-model-adapter/SKILL.md
@@ -0,0 +1,140 @@
 ---
 name: vllm-ascend-model-adapter
 description: "Adapt and debug existing or new models for vLLM on Ascend NPU. Implement in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend, validate via direct vllm serve from /workspace, and deliver one signed commit in the current repo."
 ---
 # vLLM Ascend Model Adapter
 ## Overview
 Adapt Hugging Face or local models to run on `vllm-ascend` with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM.
 ## Read order
 1. Start with `references/workflow-checklist.md`.
 2. Read `references/multimodal-ep-aclgraph-lessons.md` (feature-first checklist).
 3. If startup/inference fails, read `references/troubleshooting.md`.
 4. If checkpoint is fp8-on-NPU, read `references/fp8-on-npu-lessons.md`.
 5. Before handoff, read `references/deliverables.md`.
 ## Hard constraints
 - Never upgrade `transformers`.
 - Primary implementation roots are fixed by Dockerfile:
    - `/vllm-workspace/vllm`
    - `/vllm-workspace/vllm-ascend`
 - Start `vllm serve` from `/workspace` with direct command by default.
 - Default API port is `8000` unless user explicitly asks otherwise.
 - Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box.
 - `--enable-expert-parallel` and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence.
 - If any feature cannot be enabled, keep evidence and explain reason in final report.
 - Do not rely on `PYTHONPATH=<modified-src>:$PYTHONPATH` unless debugging fallback is strictly needed.
 - Keep code changes minimal and focused on the target model.
 - Final deliverable commit must be one single signed commit in the current working repo (`git commit -sm ...`).
 - Keep final docs in Chinese and compact.
 - **Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights.**
 - **Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.**
 ## Execution playbook
 ### 1) Collect context
 - Confirm model path (default `/models/<model-name>`; if environment differs, confirm with user explicitly).
 - Confirm implementation roots (`/vllm-workspace/vllm`, `/vllm-workspace/vllm-ascend`).
 - Confirm delivery root (the current git repo where the final commit is expected).
 - Confirm runtime import path points to `/vllm-workspace/*` install.
 - Use default expected feature set: ACLGraph + EP + flashcomm1 + MTP + multimodal (if model has VL capability).
 - User requirements extend this baseline, not replace it.
 ### 2) Analyze model first
 - Inspect `config.json`, processor files, modeling files, tokenizer files.
 - Identify architecture class, attention variant, quantization type, and multimodal requirements.
 - Check state-dict key prefixes (and safetensors index) to infer mapping needs.
 - Decide whether support already exists in `vllm/model_executor/models/registry.py`.
 ### 3) Choose adaptation strategy (new-model capable)
 - Reuse existing vLLM architecture if compatible.
 - If architecture is missing or incompatible, implement native support:
    - add model adapter under `vllm/model_executor/models/`;
    - add processor under `vllm/transformers_utils/processors/` when needed;
    - register architecture in `vllm/model_executor/models/registry.py`;
    - implement explicit weight loading/remap rules (including fp8 scale pairing, KV/QK norm sharding, rope variants).
 - If remote code needs newer transformers symbols, do not upgrade dependency.
 - If unavoidable, copy required modeling files from sibling transformers source and keep scope explicit.
 - If failure is backend-specific (kernel/op/platform), patch minimal required code in `/vllm-workspace/vllm-ascend`.
 ### 4) Implement minimal code changes (in implementation roots)
 - Touch only files required for this model adaptation.
 - Keep weight mapping explicit and auditable.
 - Avoid unrelated refactors.
 ### 5) Two-stage validation on Ascend (direct run)
 #### Stage A: dummy fast gate (recommended first)
 - Run from `/workspace` with `--load-format dummy`.
 - Goal: fast validate architecture path / operator path / API path.
 - Do not treat `Application startup complete` as pass by itself; request smoke is mandatory.
 - Require at least:
    - startup readiness (`/v1/models` 200),
    - one text request 200,
    - if VL model, one text+image request 200,
    - ACLGraph evidence where expected.
 #### Stage B: real-weight mandatory gate (must pass before sign-off)
 - Remove `--load-format dummy` and validate with real checkpoint.
 - Goal: validate real-only risks:
    - weight key mapping,
    - fp8/fp4 dequantization path,
    - KV/QK norm sharding with real tensor shapes,
    - load-time/runtime stability.
 - Require HTTP 200 and non-empty output before declaring success.
 - Do not pass Stage B on startup-only evidence.
 ### 6) Validate inference and features
 - Send `GET /v1/models` first.
 - Send at least one OpenAI-compatible text request.
 - For multimodal models, require at least one text+image request.
 - Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors).
 - Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation.
 - If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation.
 - For `torch._dynamo` + `interpolate` + `NPU contiguous` failures on VL paths, try `TORCHDYNAMO_DISABLE=1` as diagnostic/stability fallback.
 - For multimodal processor API mismatch (for example `skip_tensor_conversion` signature mismatch), use text-only isolation (`--limit-mm-per-prompt` set image/video/audio to 0) to separate processor issues from core weight loading issues.
 - Capacity baseline by default (single machine): `max-model-len=128k` + `max-num-seqs=16`.
 - Then expand concurrency (e.g., 32/64) if requested or feasible.
 ### 7) Backport, generate artifacts, and commit in delivery repo
 - If implementation happened in `/vllm-workspace/*`, backport minimal final diff to current working repo.
 - Generate test config YAML at `tests/e2e/models/configs/<ModelName>.yaml` following the schema of existing configs (must include `model_name`, `hardware`, `tasks` with accuracy metrics, and `num_fewshot`). Use accuracy results from evaluation to populate metric values.
 - Generate tutorial markdown at `docs/source/tutorials/models/<ModelName>.md` following the standard template (Introduction, Supported Features, Environment Preparation with docker tabs, Deployment with serve script, Functional Verification with curl example, Accuracy Evaluation, Performance). Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, and accuracy table.
 - Update `docs/source/tutorials/models/index.md` to include the new tutorial.
 - Confirm test config YAML and tutorial doc are included in the staged files.
 - Commit code changes once (single signed commit).
 ### 8) Prepare handoff artifacts
 - Write comprehensive Chinese analysis report.
 - Write compact Chinese runbook for server startup and validation commands.
 - Include feature status matrix (supported / unsupported / checkpoint-missing / not-applicable).
 - Include dummy-vs-real validation matrix and explicit non-equivalence notes.
 - Include changed-file list, key logs, and final commit hash.
 - Post the SKILL.md content (or a link to it) as a comment on the originating GitHub issue to document the AI-assisted workflow.
 ## Quality gate before final answer
 - Service starts successfully from `/workspace` with direct command.
 - OpenAI-compatible inference request succeeds (not startup-only).
 - Key feature set is attempted and reported: ACLGraph / EP / flashcomm1 / MTP / multimodal.
 - Capacity baseline (`128k + bs16`) result is reported, or explicit reason why not feasible.
 - **Dummy stage evidence is present (if used), and real-weight stage evidence is present (mandatory).**
 - Test config YAML exists at `tests/e2e/models/configs/<ModelName>.yaml` and follows the established schema (`model_name`, `hardware`, `tasks`, `num_fewshot`).
 - Tutorial doc exists at `docs/source/tutorials/models/<ModelName>.md` and follows the standard template (Introduction, Supported Features, Environment Preparation, Deployment, Functional Verification, Accuracy Evaluation, Performance).
 - Tutorial index at `docs/source/tutorials/models/index.md` includes the new model entry.
 - Exactly one signed commit contains all code changes in current working repo.
 - Final response includes commit hash, file paths, key commands, known limits, and failure reasons where applicable.
--- a/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md
@@ -0,0 +1,47 @@
 # Deliverables
 ## Required outputs in current repo
 1. One final signed commit (`git commit -sm ...`) containing the adaptation changes.
 2. Chinese analysis report（精简但完整）:
   - model architecture summary
   - incompatibility root causes
   - code changes and rationale
   - startup and inference verification evidence
   - feature status matrix（supported / unsupported / checkpoint-missing / not-applicable）
   - max model len: config theoretical vs runtime practical
   - dummy-vs-real validation matrix（what dummy proved / what only real proved）
   - false-ready cases and final resolution path（if any）
   - fallback ladder evidence（which fallback was tried, what changed）
 3. Chinese compact runbook:
   - how to start server in `/workspace` (direct command, default `:8000`)
   - how to run OpenAI-compatible validation
   - optional eager fallback command
   - optional `TORCHDYNAMO_DISABLE=1` fallback command (if relevant)
 4. Test config YAML at `tests/e2e/models/configs/<ModelName>.yaml` — must include `model_name`, `hardware`, `tasks` with accuracy metrics (name + value), and `num_fewshot`. Use accuracy results from evaluation to populate metric values. Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
 5. Tutorial doc at `docs/source/tutorials/models/<ModelName>.md` — must follow the standard template: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. Fill in model-specific details (HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, accuracy table).
 6. Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
 ## Commit discipline
 - Keep one signed commit for code changes in the current working repo.
 - If implementation occurred in `/vllm-workspace/*`, backport minimal final diff to current repo before commit.
 - Keep diff scoped to target model adaptation.
 ## Validation discipline
 - Always provide log file paths for key claims.
 - Keep docs synchronized with latest successful test mode (do not leave stale command variants as default).
 - Final report must include pass/fail reason for each key feature attempt: ACLGraph / EP / flashcomm1 / MTP / multimodal.
 - EP and flashcomm1 are MoE-only checks; for non-MoE models mark as not-applicable with evidence.
 - Final report should include baseline capacity result (`128k + bs16`) or explicit reason if not feasible.
 - Dummy-first can be used to speed up iterations, but real-weight gate is mandatory before final sign-off.
 - Startup-only evidence is insufficient; include first-request smoke results.
 ## Suggested final response structure
 - What changed
 - What went well / what went wrong
 - Validation performed
 - Commit hash and changed files
 - Optional next step
--- a/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md
@@ -0,0 +1,57 @@
 # FP8-on-NPU Lessons
 ## 1) Recommended debug order
 1. Start with `--load-format dummy` to quickly verify architecture path.
 2. Run with real weights to validate weight mapping and load-time stability.
 3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
 4. Validate `/v1/models`, then one text request, then one VL request (if multimodal).
 ## 2) FP8 checkpoint on NPU
 Common symptom:
 - `fp8 quantization is currently not supported in npu`.
 Recommended pattern:
 - do not force fp8 execution kernels on NPU;
 - dequantize fp8 weights to bf16 during loading using paired tensors:
    - `*.weight`
    - `*.weight_scale_inv`
 - keep strict unpaired scale/weight checks to avoid silent corruption.
 ## 3) Typical real-only risks (dummy may not expose)
 - missing fp8 scale keys during real shard loading;
 - wrong weight remap path only triggered by real checkpoints;
 - KV/QK norm sharding mismatch under TP + replicated KV heads.
 ## 4) KV replication + TP pitfalls
 Typical symptom:
 - shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
 Recommended pattern:
 - detect KV-head replication explicitly;
 - use local norm/shard loader path for replicated KV heads;
 - avoid assuming uniform divisibility for all head dimensions.
 ## 5) ACLGraph stability for fp8-origin checkpoints
 Recommended pattern:
 - prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode;
 - keep practical capture sizes and re-test from small, stable shapes;
 - use `--enforce-eager` only as temporary isolation fallback.
 ## 6) Reporting discipline
 Always report both:
 - what dummy validated (fast gate), and
 - what only real weights validated (mandatory gate).
 Do not sign off fp8-on-NPU adaptation with dummy-only evidence.
--- a/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
@@ -0,0 +1,64 @@
 # Multimodal + EP + ACLGraph Lessons
 This note captures practical patterns that repeatedly matter for VL checkpoints on Ascend.
 ## 1) Out-of-box feature expectation
 Try best to validate key features by default:
 - ACLGraph
 - MTP
 - multimodal (if model supports VL)
 - EP (MoE models only)
 - flashcomm1 (MoE models only)
 If any feature fails, keep logs and explain the reason in the final report.
 For non-MoE models, EP/flashcomm1 should be marked not-applicable.
 ## 2) Validate in this order
 1. Single text request success (`/v1/models` + `/v1/chat/completions`).
 2. Single text+image request success.
 3. Graph evidence (`Replaying aclgraph`) when graph mode is expected.
 4. Capacity baseline: `128k + bs16`.
 5. Concurrency expansion if needed (`32/64` suggested).
 ## 3) EP + graph startup expectations
 - Startup latency is much higher than eager due to:
    - compile warmup
    - graph capture rounds
    - multimodal encoder profiling
 - Do not treat slow startup as failure unless logs show hard errors.
 ## 4) Always distinguish two max lengths
 - **Theoretical max**: from model config (`max_position_embeddings`).
 - **Practical max**: largest value that actually starts and serves on current hardware + TP/EP settings.
 Report both values explicitly.
 ## 5) Multimodal testing with temporary layer reduction
 - Reducing `num_hidden_layers` can speed smoke tests.
 - This does **not** remove ViT structure itself.
 - Still require one full-layer validation before final sign-off.
 ## 6) Feature-status semantics
 Use four categories:
 - ✅ supported and verified
 - ❌ framework-level unsupported
 - ⚠️ checkpoint missing (weights/config do not provide feature)
 - N/A not-applicable (for example EP/flashcomm1 on non-MoE models)
 Typical examples:
 - flashcomm1 on non-MoE VL models is often N/A or ❌ depending on framework gate.
 - MTP may be ⚠️ checkpoint missing even if framework has code paths.
 ## 7) Keep docs and defaults aligned with latest success path
 - If EP+graph is validated and requested/expected, it should be the default runbook path.
 - Eager mode should be documented as fallback/troubleshooting only.
--- a/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md
@@ -0,0 +1,229 @@
 # Troubleshooting
 ## Direct run doesn't pick your code changes
 Symptoms:
 - `vllm serve` behavior still old after code edits.
 Actions:
 1. Check runtime import path:
   ```bash
   python - <<'PY'
   import vllm
   print(vllm.__file__)
   PY
   ```
 2. Ensure edits were made under `/vllm-workspace/vllm` and/or `/vllm-workspace/vllm-ascend`.
 3. Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.
 ## Server fails to bind on `:8000` or fails with HCCL bind errors
 Symptoms:
 - Port bind fail on startup.
 - HCCL error like `Communication_Error_Bind_IP_Port(EJ0003)`.
 Actions:
 1. Kill stale `vllm serve` processes.
 2. Ensure `:8000` is free.
 3. Retry clean startup before changing code.
 ## Startup appears "stuck" in graph mode
 Symptoms:
 - Process alive, but `curl /v1/models` not ready yet.
 - Logs show compile/graph capture messages for a long time.
 Actions:
 1. Keep waiting until graph capture completes.
 2. Look for `Capturing CUDA graphs ...` and `Graph capturing finished`.
 3. Only declare failure after an explicit error or timeout window.
 ## False-ready: startup succeeds but first request crashes
 Symptoms:
 - `Application startup complete` exists.
 - `GET /v1/models` may return 200.
 - First text or VL request crashes workers/engine.
 Actions:
 1. Always run at least one text smoke request immediately after ready.
 2. For VL models, always run one text+image smoke request as well.
 3. Treat first-request crash as runtime failure (do not mark as success).
 4. Capture first runtime error signature and branch to targeted fallback.
 ## Architecture not recognized
 Symptoms:
 - `ValueError` or log shows unresolved architecture.
 Actions:
 1. Verify `architectures` in model `config.json`.
 2. Add mapping to `vllm/model_executor/models/registry.py`.
 3. Ensure module and class names exactly match.
 ## Remote code import fails on transformers symbols
 Symptoms:
 - Missing class/function in current `transformers`.
 Actions:
 1. Do not upgrade `transformers`.
 2. Prefer native vLLM implementation.
 3. If unavoidable, copy required modeling files from sibling transformers source.
 ## Weight loading key mismatch
 Symptoms:
 - Missing/unexpected key warnings during load.
 Actions:
 1. Inspect checkpoint key prefixes.
 2. Add explicit mapping logic.
 3. Keep mapping minimal and auditable.
 4. Re-test with full shards, not only tiny-layer smoke runs.
 ## FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)
 Symptoms:
 - fp8 kernels unsupported or unstable on Ascend A2/A3.
 Actions:
 1. Do not force fp8 quantization kernels on Ascend.
 2. Use load-time fp8->bf16 dequantization path (weight + scale pairing).
 3. Add strict unpaired scale/weight checks to avoid silent corruption.
 ## QK norm mismatch (KV heads / TP / head divisibility)
 Symptoms:
 - Shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
 - Similar mismatch when head topology is not cleanly divisible.
 Actions:
 1. Detect KV-head replication case.
 2. Use local `k_norm` shard path for replicated KV heads.
 3. Avoid assumptions that all head dimensions split evenly under current TP.
 4. Validate both normal and edge topology cases explicitly.
 ## MLA attention runtime failures after ready
 Symptoms:
 - First request fails with signatures like `AtbRingMLAGetWorkspaceSize` / `AtbRingMLA`.
 - May also show `aclnnFusedInferAttentionScoreV3 ... error code 561002`.
 Actions:
 1. Reproduce with one minimal text request (deterministic payload).
 2. Try eager isolation (`--enforce-eager`) once to verify whether issue is graph-only.
 3. If eager still fails, prioritize model/backend code fix path (not runtime flags only).
 4. Check `vllm-ascend` MLA/rope/platform implementation used by known-good runs.
 ## VL + TorchDynamo interpolate contiguous failure
 Symptoms:
 - `torch._dynamo.exc.TorchRuntimeError`.
 - Stack contains `torch.nn.functional.interpolate`.
 - Error contains `NPU contiguous operator only supported contiguous memory format`.
 Actions:
 1. Add `TORCHDYNAMO_DISABLE=1` and retry with same serve args.
 2. Validate both text and text+image after startup.
 3. If this stabilizes startup and inference, record it as current fallback path.
 4. Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.
 ## Multimodal processor signature mismatch (`skip_tensor_conversion`)
 Symptoms:
 - Early failure before engine ready.
 - `convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'`.
 Actions:
 1. Identify processor compatibility mismatch (HF remote processor vs current transformers API).
 2. Use text-only isolation (`--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`) only to separate layers, not as final fix.
 3. Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
 4. Align to known-good model dispatch and processor compatibility implementation.
 ## Text-only isolation triggers meta tensor load errors
 Symptoms:
 - `NotImplementedError: Cannot copy out of meta tensor; no data!`
 - May occur after disabling multimodal prompt items.
 Actions:
 1. Treat as secondary failure signature (after bypassing earlier MM-processor failure).
 2. Do not assume text-only isolation is universally safe for all VL models.
 3. Return to model-specific code-fix path with captured signatures.
 ## Config max length works on paper but not in runtime
 Symptoms:
 - `max_position_embeddings` is large, but service fails or OOM with that value.
 Actions:
 1. Record config max (theoretical).
 2. Find practical max by successful startup + serving under target TP/EP setup.
 3. Report both values explicitly in docs.
 ## flashcomm1 / MTP confusion on VL checkpoints
 Symptoms:
 - flashcomm1 enabled but startup fails.
 - MTP expected but no effect.
 Actions:
 1. Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
 2. Verify MTP from both config and weight index (`mtp/nextn` keys).
 3. Mark unsupported vs checkpoint-missing clearly.
 ## ACL graph capture fails (507903)
 Symptoms:
 - `AclmdlRICaptureEnd ... 507903`
 - `rtStreamEndCapture ... invalidated stream capture sequence`
 Actions:
 1. Prefer `HCCL_OP_EXPANSION_MODE=AIV` for graph capture stability.
 2. Reduce shape pressure (`--max-model-len`) and retry.
 3. Temporarily fallback `--enforce-eager` for isolation.
 ## API reachable but output quality odd
 Symptoms:
 - `/v1/models` works but output has template artifacts.
 Actions:
 1. Use deterministic request (`temperature=0`, bounded `max_tokens`).
 2. Verify endpoint (`/v1/chat/completions` vs `/v1/completions`) matches model template.
 3. Confirm non-empty output and HTTP 200 before success declaration.
--- a/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md
@@ -0,0 +1,255 @@
 # Workflow Checklist
 ## 0) Environment prerequisites
 Set these once per session. Defaults match the official vllm-ascend Docker image.
 ```bash
 # --- configurable paths (adjust if your layout differs) ---
 VLLM_SRC=/vllm-workspace/vllm              # vLLM source root
 VLLM_ASCEND_SRC=/vllm-workspace/vllm-ascend # vllm-ascend source root
 WORK_DIR=/workspace                         # directory to run vllm serve from
 MODEL_ROOT=/models                          # parent directory of model checkpoints
 ```
 Expected environment:
 - Hardware: Ascend A2 or A3 server
 - Software: official vllm-ascend Docker image (see `./Dockerfile` for full contents)
 - TP=16 typical for A3 (16-NPU), TP=8 typical for A2 (8-NPU)
 ## 1) Fast triage commands
 ```bash
 MODEL_PATH=${MODEL_ROOT}/<model-name>
 echo "MODEL_PATH=$MODEL_PATH"
 # model inventory
 ls -la "$MODEL_PATH"
 # architecture + quant hints
 rg -n "architectures|model_type|quantization_config|torch_dtype|max_position_embeddings|num_nextn_predict_layers|version|num_attention_heads|num_key_value_heads|num_experts" "$MODEL_PATH/config.json"
 # state-dict key layout hints (if index exists)
 ls -la "$MODEL_PATH"/*index*.json 2>/dev/null || true
 # model custom code (if exists)
 ls -la "$MODEL_PATH"/*.py 2>/dev/null || true
 ```
 ## 2) Confirm implementation and delivery roots
 ```bash
 # implementation roots (fixed by Dockerfile)
 cd "$VLLM_SRC" && git status -s
 cd "$VLLM_ASCEND_SRC" && git status -s
 # runtime import source check (expect vllm-workspace path)
 python - <<'PY'
 import vllm
 print(vllm.__file__)
 PY
 # direct-run working directory
 cd "$WORK_DIR" && pwd
 # delivery root (current repo)
 cd <current-repo>
 git status -s
 ```
 ## 3) Session hygiene (before rerun)
 ```bash
 # stop stale servers
 pkill -f "vllm serve|api_server|EngineCore" || true
 # confirm port 8000 is free
 netstat -ltnp 2>/dev/null | rg ':8000' || true
 ```
 When user explicitly requests reset:
 ```bash
 cd "$VLLM_SRC" && git reset --hard && git clean -fd
 cd "$VLLM_ASCEND_SRC" && git reset --hard && git clean -fd
 ```
 ## 4) New model onboarding checklist
 ```bash
 # architecture mapping check in vLLM
 rg -n "<ArchitectureClass>|registry" "$VLLM_SRC"/vllm/model_executor/models/registry.py
 # optional: inspect model config and weight index quickly
 cat "$MODEL_PATH/config.json"
 cat "$MODEL_PATH"/*index*.json 2>/dev/null || true
 ```
 If architecture is missing/incompatible, minimally do:
 1. Add model adapter under `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`.
 2. Add processor under `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py` when needed.
 3. Register architecture in `$VLLM_SRC/vllm/model_executor/models/registry.py`.
 4. Add explicit loader/remap rules for checkpoint key patterns (qkv/norm/rope/fp8 scales).
 5. Touch `$VLLM_ASCEND_SRC` only when backend-specific errors are confirmed.
 ## 5) Typical implementation touch points
 - `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`
 - `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py`
 - `$VLLM_SRC/vllm/model_executor/models/registry.py`
 - `$VLLM_ASCEND_SRC/vllm_ascend/...` (only if backend behavior requires it)
 ## 6) Syntax sanity checks
 ```bash
 python -m py_compile \
  "$VLLM_SRC"/vllm/model_executor/models/<new_model>.py
 python -m py_compile \
  "$VLLM_SRC"/vllm/transformers_utils/processors/<new_model>.py 2>/dev/null || true
 ```
 ## 7) Two-stage serve templates (direct run, default `:8000`)
 ### Stage A: dummy fast gate (first try)
 ```bash
 cd "$WORK_DIR"
 MODEL_PATH=${MODEL_ROOT}/<model-name>
 HCCL_OP_EXPANSION_MODE=AIV \
 VLLM_ASCEND_ENABLE_FLASHCOMM1=0 \
 vllm serve "$MODEL_PATH" \
  --served-model-name <served-name> \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len <practical-max-len-or-131072> \
  --tensor-parallel-size <TP-size> \
  --max-num-seqs 16 \
  --load-format dummy \
  --port 8000
 ```
 ### Stage B: real-weight mandatory gate
 ```bash
 # remove this from Stage A:
 --load-format dummy
 ```
 > Note: dummy is not equivalent to real weights. Real gate is mandatory before sign-off.
 ### EP + ACLGraph (feature-first, MoE only)
 ```bash
 # add to Stage B when model is MoE and validating EP:
 --enable-expert-parallel
 ```
 ### flashcomm1 check (MoE only)
 ```bash
 # only evaluate flashcomm1 when model is MoE
 VLLM_ASCEND_ENABLE_FLASHCOMM1=1
 ```
 ### Eager fallback (isolation)
 ```bash
 # add to command for isolation only:
 --enforce-eager
 ```
 ### TorchDynamo fallback (for VL interpolate-contiguous failures)
 ```bash
 # add env var when logs contain:
 # torch._dynamo.exc.TorchRuntimeError + interpolate +
 # "NPU contiguous operator only supported contiguous memory format"
 TORCHDYNAMO_DISABLE=1
 ```
 ## 8) Readiness + smoke checks (must verify true-ready)
 ```bash
 # readiness
 for i in $(seq 1 200); do
  curl -sf http://127.0.0.1:8000/v1/models >/tmp/models.json && break
  sleep 3
 done
 # text smoke (required)
 curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"<served-name>","messages":[{"role":"user","content":"say hi"}],"temperature":0,"max_tokens":16}'
 # VL smoke (required for multimodal models)
 # send one text+image OpenAI-compatible request and require non-empty choices.
 ```
 > `Application startup complete` alone is not success. If first request crashes, treat as runtime failure (false-ready).
 ## 9) Feature validation checklist (default out-of-box)
 1. `GET /v1/models` returns 200.
 2. Text request returns 200 and non-empty output.
 3. If VL model: text+image request returns 200.
 4. ACLGraph evidence exists (`Replaying aclgraph`) where expected.
 5. EP path is validated only for MoE models; non-MoE must be marked not-applicable.
 6. flashcomm1 is validated only for MoE models; non-MoE must be marked not-applicable.
 7. MTP status verified from config + weight index (enabled vs checkpoint-missing).
 8. Dummy-vs-real differences are explicitly reported (if any).
 9. Any false-ready case is explicitly marked as failure (with log signature).
 ## 10) Fallback ladder (recommended order)
 1. Keep same params and reproduce once to ensure deterministic failure signature.
 2. Add `--enforce-eager` to isolate graph-capture influence.
 3. For VL + dynamo/interpolate/contiguous failures, add `TORCHDYNAMO_DISABLE=1`.
 4. For multimodal-processor suspicion, isolate text-only by:
   - `--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`
   - then check whether failure moves from processor layer to model core.
 5. If issue persists, map failure signature to known-good implementation and patch minimal code.
 ## 11) Capacity baseline + sweep
 - Baseline (single machine): **`max-model-len=128k` + `max-num-seqs=16`**.
 - If baseline passes, expand to `max-num-seqs=32/64` when requested.
 - If baseline cannot pass due hardware/runtime limits, report explicit root cause.
 ## 12) Delivery checklist
 ```bash
 # in current working repo (delivery root)
 git add <changed-files>
 git commit -sm "<message>"
 ```
 Confirm:
 - one signed commit only
 - Chinese analysis + Chinese runbook present
 - feature status matrix included with pass/fail reason
 - dummy stage and real stage validation evidence included
 - false-ready cases (if any) documented with final fallback status
 ### Test config generation
 - Generate `tests/e2e/models/configs/<ModelName>.yaml` using accuracy results from evaluation.
 - Must include: `model_name` (HF path), `hardware` (e.g. "Atlas A2 Series"), `tasks` (list with `name` and `metrics` containing `name` + `value`), `num_fewshot`.
 - Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
 ### Tutorial doc generation
 - Generate `docs/source/tutorials/models/<ModelName>.md` from the standard template.
 - Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl request, accuracy table.
 - Must include sections: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance.
 - Update `docs/source/tutorials/models/index.md` to include the new tutorial entry.
 ### GitHub issue comment
 - Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
 Confirm both test config YAML and tutorial doc are included in the signed commit.