### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
1.7 KiB
1.7 KiB
FP8-on-NPU Lessons
1) Recommended debug order
- Start with
--load-format dummyto quickly verify architecture path. - Run with real weights to validate weight mapping and load-time stability.
- If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
- Validate
/v1/models, then one text request, then one VL request (if multimodal).
2) FP8 checkpoint on NPU
Common symptom:
fp8 quantization is currently not supported in npu.
Recommended pattern:
- do not force fp8 execution kernels on NPU;
- dequantize fp8 weights to bf16 during loading using paired tensors:
*.weight*.weight_scale_inv
- keep strict unpaired scale/weight checks to avoid silent corruption.
3) Typical real-only risks (dummy may not expose)
- missing fp8 scale keys during real shard loading;
- wrong weight remap path only triggered by real checkpoints;
- KV/QK norm sharding mismatch under TP + replicated KV heads.
4) KV replication + TP pitfalls
Typical symptom:
- shape mismatch like
128 vs 64whentp_size > num_key_value_heads.
Recommended pattern:
- detect KV-head replication explicitly;
- use local norm/shard loader path for replicated KV heads;
- avoid assuming uniform divisibility for all head dimensions.
5) ACLGraph stability for fp8-origin checkpoints
Recommended pattern:
- prefer
HCCL_OP_EXPANSION_MODE=AIVwhen using graph mode; - keep practical capture sizes and re-test from small, stable shapes;
- use
--enforce-eageronly as temporary isolation fallback.
6) Reporting discipline
Always report both:
- what dummy validated (fast gate), and
- what only real weights validated (mandatory gate).
Do not sign off fp8-on-NPU adaptation with dummy-only evidence.