[Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731)

### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-02-26 08:48:15 +08:00
parent 3b59d0ebe9
commit 29e3cdde20
7 changed files with 838 additions and 0 deletions
--- a/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md
@@ -0,0 +1,47 @@
+# Deliverables
+
+## Required outputs in current repo
+
+1. One final signed commit (`git commit -sm ...`) containing the adaptation changes.
+2. Chinese analysis report（精简但完整）:
+   - model architecture summary
+   - incompatibility root causes
+   - code changes and rationale
+   - startup and inference verification evidence
+   - feature status matrix（supported / unsupported / checkpoint-missing / not-applicable）
+   - max model len: config theoretical vs runtime practical
+   - dummy-vs-real validation matrix（what dummy proved / what only real proved）
+   - false-ready cases and final resolution path（if any）
+   - fallback ladder evidence（which fallback was tried, what changed）
+3. Chinese compact runbook:
+   - how to start server in `/workspace` (direct command, default `:8000`)
+   - how to run OpenAI-compatible validation
+   - optional eager fallback command
+   - optional `TORCHDYNAMO_DISABLE=1` fallback command (if relevant)
+4. Test config YAML at `tests/e2e/models/configs/<ModelName>.yaml` — must include `model_name`, `hardware`, `tasks` with accuracy metrics (name + value), and `num_fewshot`. Use accuracy results from evaluation to populate metric values. Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
+5. Tutorial doc at `docs/source/tutorials/models/<ModelName>.md` — must follow the standard template: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance. Fill in model-specific details (HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl, accuracy table).
+6. Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
+
+## Commit discipline
+
+- Keep one signed commit for code changes in the current working repo.
+- If implementation occurred in `/vllm-workspace/*`, backport minimal final diff to current repo before commit.
+- Keep diff scoped to target model adaptation.
+
+## Validation discipline
+
+- Always provide log file paths for key claims.
+- Keep docs synchronized with latest successful test mode (do not leave stale command variants as default).
+- Final report must include pass/fail reason for each key feature attempt: ACLGraph / EP / flashcomm1 / MTP / multimodal.
+- EP and flashcomm1 are MoE-only checks; for non-MoE models mark as not-applicable with evidence.
+- Final report should include baseline capacity result (`128k + bs16`) or explicit reason if not feasible.
+- Dummy-first can be used to speed up iterations, but real-weight gate is mandatory before final sign-off.
+- Startup-only evidence is insufficient; include first-request smoke results.
+
+## Suggested final response structure
+
+- What changed
+- What went well / what went wrong
+- Validation performed
+- Commit hash and changed files
+- Optional next step
--- a/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md
@@ -0,0 +1,57 @@
+# FP8-on-NPU Lessons
+
+## 1) Recommended debug order
+
+1. Start with `--load-format dummy` to quickly verify architecture path.
+2. Run with real weights to validate weight mapping and load-time stability.
+3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
+4. Validate `/v1/models`, then one text request, then one VL request (if multimodal).
+
+## 2) FP8 checkpoint on NPU
+
+Common symptom:
+
+- `fp8 quantization is currently not supported in npu`.
+
+Recommended pattern:
+
+- do not force fp8 execution kernels on NPU;
+- dequantize fp8 weights to bf16 during loading using paired tensors:
+    - `*.weight`
+    - `*.weight_scale_inv`
+- keep strict unpaired scale/weight checks to avoid silent corruption.
+
+## 3) Typical real-only risks (dummy may not expose)
+
+- missing fp8 scale keys during real shard loading;
+- wrong weight remap path only triggered by real checkpoints;
+- KV/QK norm sharding mismatch under TP + replicated KV heads.
+
+## 4) KV replication + TP pitfalls
+
+Typical symptom:
+
+- shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
+
+Recommended pattern:
+
+- detect KV-head replication explicitly;
+- use local norm/shard loader path for replicated KV heads;
+- avoid assuming uniform divisibility for all head dimensions.
+
+## 5) ACLGraph stability for fp8-origin checkpoints
+
+Recommended pattern:
+
+- prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode;
+- keep practical capture sizes and re-test from small, stable shapes;
+- use `--enforce-eager` only as temporary isolation fallback.
+
+## 6) Reporting discipline
+
+Always report both:
+
+- what dummy validated (fast gate), and
+- what only real weights validated (mandatory gate).
+
+Do not sign off fp8-on-NPU adaptation with dummy-only evidence.
--- a/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md
@@ -0,0 +1,64 @@
+# Multimodal + EP + ACLGraph Lessons
+
+This note captures practical patterns that repeatedly matter for VL checkpoints on Ascend.
+
+## 1) Out-of-box feature expectation
+
+Try best to validate key features by default:
+
+- ACLGraph
+- MTP
+- multimodal (if model supports VL)
+- EP (MoE models only)
+- flashcomm1 (MoE models only)
+
+If any feature fails, keep logs and explain the reason in the final report.
+For non-MoE models, EP/flashcomm1 should be marked not-applicable.
+
+## 2) Validate in this order
+
+1. Single text request success (`/v1/models` + `/v1/chat/completions`).
+2. Single text+image request success.
+3. Graph evidence (`Replaying aclgraph`) when graph mode is expected.
+4. Capacity baseline: `128k + bs16`.
+5. Concurrency expansion if needed (`32/64` suggested).
+
+## 3) EP + graph startup expectations
+
+- Startup latency is much higher than eager due to:
+    - compile warmup
+    - graph capture rounds
+    - multimodal encoder profiling
+- Do not treat slow startup as failure unless logs show hard errors.
+
+## 4) Always distinguish two max lengths
+
+- **Theoretical max**: from model config (`max_position_embeddings`).
+- **Practical max**: largest value that actually starts and serves on current hardware + TP/EP settings.
+
+Report both values explicitly.
+
+## 5) Multimodal testing with temporary layer reduction
+
+- Reducing `num_hidden_layers` can speed smoke tests.
+- This does **not** remove ViT structure itself.
+- Still require one full-layer validation before final sign-off.
+
+## 6) Feature-status semantics
+
+Use four categories:
+
+- ✅ supported and verified
+- ❌ framework-level unsupported
+- ⚠️ checkpoint missing (weights/config do not provide feature)
+- N/A not-applicable (for example EP/flashcomm1 on non-MoE models)
+
+Typical examples:
+
+- flashcomm1 on non-MoE VL models is often N/A or ❌ depending on framework gate.
+- MTP may be ⚠️ checkpoint missing even if framework has code paths.
+
+## 7) Keep docs and defaults aligned with latest success path
+
+- If EP+graph is validated and requested/expected, it should be the default runbook path.
+- Eager mode should be documented as fallback/troubleshooting only.
--- a/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md
@@ -0,0 +1,229 @@
+# Troubleshooting
+
+## Direct run doesn't pick your code changes
+
+Symptoms:
+
+- `vllm serve` behavior still old after code edits.
+
+Actions:
+
+1. Check runtime import path:
+   ```bash
+   python - <<'PY'
+   import vllm
+   print(vllm.__file__)
+   PY
+   ```
+2. Ensure edits were made under `/vllm-workspace/vllm` and/or `/vllm-workspace/vllm-ascend`.
+3. Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.
+
+## Server fails to bind on `:8000` or fails with HCCL bind errors
+
+Symptoms:
+
+- Port bind fail on startup.
+- HCCL error like `Communication_Error_Bind_IP_Port(EJ0003)`.
+
+Actions:
+
+1. Kill stale `vllm serve` processes.
+2. Ensure `:8000` is free.
+3. Retry clean startup before changing code.
+
+## Startup appears "stuck" in graph mode
+
+Symptoms:
+
+- Process alive, but `curl /v1/models` not ready yet.
+- Logs show compile/graph capture messages for a long time.
+
+Actions:
+
+1. Keep waiting until graph capture completes.
+2. Look for `Capturing CUDA graphs ...` and `Graph capturing finished`.
+3. Only declare failure after an explicit error or timeout window.
+
+## False-ready: startup succeeds but first request crashes
+
+Symptoms:
+
+- `Application startup complete` exists.
+- `GET /v1/models` may return 200.
+- First text or VL request crashes workers/engine.
+
+Actions:
+
+1. Always run at least one text smoke request immediately after ready.
+2. For VL models, always run one text+image smoke request as well.
+3. Treat first-request crash as runtime failure (do not mark as success).
+4. Capture first runtime error signature and branch to targeted fallback.
+
+## Architecture not recognized
+
+Symptoms:
+
+- `ValueError` or log shows unresolved architecture.
+
+Actions:
+
+1. Verify `architectures` in model `config.json`.
+2. Add mapping to `vllm/model_executor/models/registry.py`.
+3. Ensure module and class names exactly match.
+
+## Remote code import fails on transformers symbols
+
+Symptoms:
+
+- Missing class/function in current `transformers`.
+
+Actions:
+
+1. Do not upgrade `transformers`.
+2. Prefer native vLLM implementation.
+3. If unavoidable, copy required modeling files from sibling transformers source.
+
+## Weight loading key mismatch
+
+Symptoms:
+
+- Missing/unexpected key warnings during load.
+
+Actions:
+
+1. Inspect checkpoint key prefixes.
+2. Add explicit mapping logic.
+3. Keep mapping minimal and auditable.
+4. Re-test with full shards, not only tiny-layer smoke runs.
+
+## FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)
+
+Symptoms:
+
+- fp8 kernels unsupported or unstable on Ascend A2/A3.
+
+Actions:
+
+1. Do not force fp8 quantization kernels on Ascend.
+2. Use load-time fp8->bf16 dequantization path (weight + scale pairing).
+3. Add strict unpaired scale/weight checks to avoid silent corruption.
+
+## QK norm mismatch (KV heads / TP / head divisibility)
+
+Symptoms:
+
+- Shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
+- Similar mismatch when head topology is not cleanly divisible.
+
+Actions:
+
+1. Detect KV-head replication case.
+2. Use local `k_norm` shard path for replicated KV heads.
+3. Avoid assumptions that all head dimensions split evenly under current TP.
+4. Validate both normal and edge topology cases explicitly.
+
+## MLA attention runtime failures after ready
+
+Symptoms:
+
+- First request fails with signatures like `AtbRingMLAGetWorkspaceSize` / `AtbRingMLA`.
+- May also show `aclnnFusedInferAttentionScoreV3 ... error code 561002`.
+
+Actions:
+
+1. Reproduce with one minimal text request (deterministic payload).
+2. Try eager isolation (`--enforce-eager`) once to verify whether issue is graph-only.
+3. If eager still fails, prioritize model/backend code fix path (not runtime flags only).
+4. Check `vllm-ascend` MLA/rope/platform implementation used by known-good runs.
+
+## VL + TorchDynamo interpolate contiguous failure
+
+Symptoms:
+
+- `torch._dynamo.exc.TorchRuntimeError`.
+- Stack contains `torch.nn.functional.interpolate`.
+- Error contains `NPU contiguous operator only supported contiguous memory format`.
+
+Actions:
+
+1. Add `TORCHDYNAMO_DISABLE=1` and retry with same serve args.
+2. Validate both text and text+image after startup.
+3. If this stabilizes startup and inference, record it as current fallback path.
+4. Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.
+
+## Multimodal processor signature mismatch (`skip_tensor_conversion`)
+
+Symptoms:
+
+- Early failure before engine ready.
+- `convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'`.
+
+Actions:
+
+1. Identify processor compatibility mismatch (HF remote processor vs current transformers API).
+2. Use text-only isolation (`--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`) only to separate layers, not as final fix.
+3. Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
+4. Align to known-good model dispatch and processor compatibility implementation.
+
+## Text-only isolation triggers meta tensor load errors
+
+Symptoms:
+
+- `NotImplementedError: Cannot copy out of meta tensor; no data!`
+- May occur after disabling multimodal prompt items.
+
+Actions:
+
+1. Treat as secondary failure signature (after bypassing earlier MM-processor failure).
+2. Do not assume text-only isolation is universally safe for all VL models.
+3. Return to model-specific code-fix path with captured signatures.
+
+## Config max length works on paper but not in runtime
+
+Symptoms:
+
+- `max_position_embeddings` is large, but service fails or OOM with that value.
+
+Actions:
+
+1. Record config max (theoretical).
+2. Find practical max by successful startup + serving under target TP/EP setup.
+3. Report both values explicitly in docs.
+
+## flashcomm1 / MTP confusion on VL checkpoints
+
+Symptoms:
+
+- flashcomm1 enabled but startup fails.
+- MTP expected but no effect.
+
+Actions:
+
+1. Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
+2. Verify MTP from both config and weight index (`mtp/nextn` keys).
+3. Mark unsupported vs checkpoint-missing clearly.
+
+## ACL graph capture fails (507903)
+
+Symptoms:
+
+- `AclmdlRICaptureEnd ... 507903`
+- `rtStreamEndCapture ... invalidated stream capture sequence`
+
+Actions:
+
+1. Prefer `HCCL_OP_EXPANSION_MODE=AIV` for graph capture stability.
+2. Reduce shape pressure (`--max-model-len`) and retry.
+3. Temporarily fallback `--enforce-eager` for isolation.
+
+## API reachable but output quality odd
+
+Symptoms:
+
+- `/v1/models` works but output has template artifacts.
+
+Actions:
+
+1. Use deterministic request (`temperature=0`, bounded `max_tokens`).
+2. Verify endpoint (`/v1/chat/completions` vs `/v1/completions`) matches model template.
+3. Confirm non-empty output and HTTP 200 before success declaration.
--- a/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md
+++ b/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md
@@ -0,0 +1,255 @@
+# Workflow Checklist
+
+## 0) Environment prerequisites
+
+Set these once per session. Defaults match the official vllm-ascend Docker image.
+
+```bash
+# --- configurable paths (adjust if your layout differs) ---
+VLLM_SRC=/vllm-workspace/vllm              # vLLM source root
+VLLM_ASCEND_SRC=/vllm-workspace/vllm-ascend # vllm-ascend source root
+WORK_DIR=/workspace                         # directory to run vllm serve from
+MODEL_ROOT=/models                          # parent directory of model checkpoints
+```
+
+Expected environment:
+
+- Hardware: Ascend A2 or A3 server
+- Software: official vllm-ascend Docker image (see `./Dockerfile` for full contents)
+- TP=16 typical for A3 (16-NPU), TP=8 typical for A2 (8-NPU)
+
+## 1) Fast triage commands
+
+```bash
+MODEL_PATH=${MODEL_ROOT}/<model-name>
+echo "MODEL_PATH=$MODEL_PATH"
+
+# model inventory
+ls -la "$MODEL_PATH"
+
+# architecture + quant hints
+rg -n "architectures|model_type|quantization_config|torch_dtype|max_position_embeddings|num_nextn_predict_layers|version|num_attention_heads|num_key_value_heads|num_experts" "$MODEL_PATH/config.json"
+
+# state-dict key layout hints (if index exists)
+ls -la "$MODEL_PATH"/*index*.json 2>/dev/null || true
+
+# model custom code (if exists)
+ls -la "$MODEL_PATH"/*.py 2>/dev/null || true
+```
+
+## 2) Confirm implementation and delivery roots
+
+```bash
+# implementation roots (fixed by Dockerfile)
+cd "$VLLM_SRC" && git status -s
+cd "$VLLM_ASCEND_SRC" && git status -s
+
+# runtime import source check (expect vllm-workspace path)
+python - <<'PY'
+import vllm
+print(vllm.__file__)
+PY
+
+# direct-run working directory
+cd "$WORK_DIR" && pwd
+
+# delivery root (current repo)
+cd <current-repo>
+git status -s
+```
+
+## 3) Session hygiene (before rerun)
+
+```bash
+# stop stale servers
+pkill -f "vllm serve|api_server|EngineCore" || true
+
+# confirm port 8000 is free
+netstat -ltnp 2>/dev/null | rg ':8000' || true
+```
+
+When user explicitly requests reset:
+
+```bash
+cd "$VLLM_SRC" && git reset --hard && git clean -fd
+cd "$VLLM_ASCEND_SRC" && git reset --hard && git clean -fd
+```
+
+## 4) New model onboarding checklist
+
+```bash
+# architecture mapping check in vLLM
+rg -n "<ArchitectureClass>|registry" "$VLLM_SRC"/vllm/model_executor/models/registry.py
+
+# optional: inspect model config and weight index quickly
+cat "$MODEL_PATH/config.json"
+cat "$MODEL_PATH"/*index*.json 2>/dev/null || true
+```
+
+If architecture is missing/incompatible, minimally do:
+
+1. Add model adapter under `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`.
+2. Add processor under `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py` when needed.
+3. Register architecture in `$VLLM_SRC/vllm/model_executor/models/registry.py`.
+4. Add explicit loader/remap rules for checkpoint key patterns (qkv/norm/rope/fp8 scales).
+5. Touch `$VLLM_ASCEND_SRC` only when backend-specific errors are confirmed.
+
+## 5) Typical implementation touch points
+
+- `$VLLM_SRC/vllm/model_executor/models/<new_model>.py`
+- `$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py`
+- `$VLLM_SRC/vllm/model_executor/models/registry.py`
+- `$VLLM_ASCEND_SRC/vllm_ascend/...` (only if backend behavior requires it)
+
+## 6) Syntax sanity checks
+
+```bash
+python -m py_compile \
+  "$VLLM_SRC"/vllm/model_executor/models/<new_model>.py
+
+python -m py_compile \
+  "$VLLM_SRC"/vllm/transformers_utils/processors/<new_model>.py 2>/dev/null || true
+```
+
+## 7) Two-stage serve templates (direct run, default `:8000`)
+
+### Stage A: dummy fast gate (first try)
+
+```bash
+cd "$WORK_DIR"
+MODEL_PATH=${MODEL_ROOT}/<model-name>
+
+HCCL_OP_EXPANSION_MODE=AIV \
+VLLM_ASCEND_ENABLE_FLASHCOMM1=0 \
+vllm serve "$MODEL_PATH" \
+  --served-model-name <served-name> \
+  --trust-remote-code \
+  --dtype bfloat16 \
+  --max-model-len <practical-max-len-or-131072> \
+  --tensor-parallel-size <TP-size> \
+  --max-num-seqs 16 \
+  --load-format dummy \
+  --port 8000
+```
+
+### Stage B: real-weight mandatory gate
+
+```bash
+# remove this from Stage A:
+--load-format dummy
+```
+
+> Note: dummy is not equivalent to real weights. Real gate is mandatory before sign-off.
+
+### EP + ACLGraph (feature-first, MoE only)
+
+```bash
+# add to Stage B when model is MoE and validating EP:
+--enable-expert-parallel
+```
+
+### flashcomm1 check (MoE only)
+
+```bash
+# only evaluate flashcomm1 when model is MoE
+VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+```
+
+### Eager fallback (isolation)
+
+```bash
+# add to command for isolation only:
+--enforce-eager
+```
+
+### TorchDynamo fallback (for VL interpolate-contiguous failures)
+
+```bash
+# add env var when logs contain:
+# torch._dynamo.exc.TorchRuntimeError + interpolate +
+# "NPU contiguous operator only supported contiguous memory format"
+TORCHDYNAMO_DISABLE=1
+```
+
+## 8) Readiness + smoke checks (must verify true-ready)
+
+```bash
+# readiness
+for i in $(seq 1 200); do
+  curl -sf http://127.0.0.1:8000/v1/models >/tmp/models.json && break
+  sleep 3
+done
+
+# text smoke (required)
+curl -s http://127.0.0.1:8000/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"model":"<served-name>","messages":[{"role":"user","content":"say hi"}],"temperature":0,"max_tokens":16}'
+
+# VL smoke (required for multimodal models)
+# send one text+image OpenAI-compatible request and require non-empty choices.
+```
+
+> `Application startup complete` alone is not success. If first request crashes, treat as runtime failure (false-ready).
+
+## 9) Feature validation checklist (default out-of-box)
+
+1. `GET /v1/models` returns 200.
+2. Text request returns 200 and non-empty output.
+3. If VL model: text+image request returns 200.
+4. ACLGraph evidence exists (`Replaying aclgraph`) where expected.
+5. EP path is validated only for MoE models; non-MoE must be marked not-applicable.
+6. flashcomm1 is validated only for MoE models; non-MoE must be marked not-applicable.
+7. MTP status verified from config + weight index (enabled vs checkpoint-missing).
+8. Dummy-vs-real differences are explicitly reported (if any).
+9. Any false-ready case is explicitly marked as failure (with log signature).
+
+## 10) Fallback ladder (recommended order)
+
+1. Keep same params and reproduce once to ensure deterministic failure signature.
+2. Add `--enforce-eager` to isolate graph-capture influence.
+3. For VL + dynamo/interpolate/contiguous failures, add `TORCHDYNAMO_DISABLE=1`.
+4. For multimodal-processor suspicion, isolate text-only by:
+   - `--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`
+   - then check whether failure moves from processor layer to model core.
+5. If issue persists, map failure signature to known-good implementation and patch minimal code.
+
+## 11) Capacity baseline + sweep
+
+- Baseline (single machine): **`max-model-len=128k` + `max-num-seqs=16`**.
+- If baseline passes, expand to `max-num-seqs=32/64` when requested.
+- If baseline cannot pass due hardware/runtime limits, report explicit root cause.
+
+## 12) Delivery checklist
+
+```bash
+# in current working repo (delivery root)
+git add <changed-files>
+git commit -sm "<message>"
+```
+
+Confirm:
+
+- one signed commit only
+- Chinese analysis + Chinese runbook present
+- feature status matrix included with pass/fail reason
+- dummy stage and real stage validation evidence included
+- false-ready cases (if any) documented with final fallback status
+
+### Test config generation
+
+- Generate `tests/e2e/models/configs/<ModelName>.yaml` using accuracy results from evaluation.
+- Must include: `model_name` (HF path), `hardware` (e.g. "Atlas A2 Series"), `tasks` (list with `name` and `metrics` containing `name` + `value`), `num_fewshot`.
+- Follow the schema of existing configs (e.g. `Qwen3-8B.yaml`).
+
+### Tutorial doc generation
+
+- Generate `docs/source/tutorials/models/<ModelName>.md` from the standard template.
+- Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl request, accuracy table.
+- Must include sections: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance.
+- Update `docs/source/tutorials/models/index.md` to include the new tutorial entry.
+
+### GitHub issue comment
+
+- Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.
+
+Confirm both test config YAML and tutorial doc are included in the signed commit.