### What this PR does / why we need it This PR introduces the **first AI-assisted model-adaptation skill package** for `vllm-ascend`. The goal is to make model adaptation work (especially for recurring feature-request issues) **repeatable, auditable, and easier to hand off**. ### Scope in this PR This PR adds only skill/workflow assets under: - `.agents/skills/vllm-ascend-model-adapter/SKILL.md` - `.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md` - `.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md` - `.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md` - `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md` ### Workflow improvements The skill standardizes: 1. **Environment assumptions** used in our Docker setup - implementation roots: `/vllm-workspace/vllm` and `/vllm-workspace/vllm-ascend` - serving root: `/workspace` - model path convention: `/models/<model-name>` 2. **Validation strategy** - Stage A: fast `--load-format dummy` gate - Stage B: mandatory real-weight gate before sign-off - avoid false-ready by requiring request-level checks (not startup log only) 3. **Feature-first verification checklist** - ACLGraph / EP / flashcomm1 / MTP / multimodal - explicit `supported / unsupported / not-applicable / checkpoint-missing` outcomes 4. **Delivery contract** - minimal scoped code changes - required artifacts (Chinese report + runbook, e2e config YAML, tutorial doc) - one signed commit in delivery repo ### What this PR does NOT do - No runtime/kernel/model patch is included in this PR. - No direct model support claim is made by this PR alone. - Model-specific adaptation/fix work should be submitted in follow-up PRs using this skill as the workflow baseline. ### Why this matters for maintainers This gives the repo a shared, explicit AI-assistance protocol, so future model-adaptation PRs are easier to review, compare, and reproduce. --------- Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
6.5 KiB
6.5 KiB
Troubleshooting
Direct run doesn't pick your code changes
Symptoms:
vllm servebehavior still old after code edits.
Actions:
- Check runtime import path:
python - <<'PY' import vllm print(vllm.__file__) PY - Ensure edits were made under
/vllm-workspace/vllmand/or/vllm-workspace/vllm-ascend. - Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.
Server fails to bind on :8000 or fails with HCCL bind errors
Symptoms:
- Port bind fail on startup.
- HCCL error like
Communication_Error_Bind_IP_Port(EJ0003).
Actions:
- Kill stale
vllm serveprocesses. - Ensure
:8000is free. - Retry clean startup before changing code.
Startup appears "stuck" in graph mode
Symptoms:
- Process alive, but
curl /v1/modelsnot ready yet. - Logs show compile/graph capture messages for a long time.
Actions:
- Keep waiting until graph capture completes.
- Look for
Capturing CUDA graphs ...andGraph capturing finished. - Only declare failure after an explicit error or timeout window.
False-ready: startup succeeds but first request crashes
Symptoms:
Application startup completeexists.GET /v1/modelsmay return 200.- First text or VL request crashes workers/engine.
Actions:
- Always run at least one text smoke request immediately after ready.
- For VL models, always run one text+image smoke request as well.
- Treat first-request crash as runtime failure (do not mark as success).
- Capture first runtime error signature and branch to targeted fallback.
Architecture not recognized
Symptoms:
ValueErroror log shows unresolved architecture.
Actions:
- Verify
architecturesin modelconfig.json. - Add mapping to
vllm/model_executor/models/registry.py. - Ensure module and class names exactly match.
Remote code import fails on transformers symbols
Symptoms:
- Missing class/function in current
transformers.
Actions:
- Do not upgrade
transformers. - Prefer native vLLM implementation.
- If unavoidable, copy required modeling files from sibling transformers source.
Weight loading key mismatch
Symptoms:
- Missing/unexpected key warnings during load.
Actions:
- Inspect checkpoint key prefixes.
- Add explicit mapping logic.
- Keep mapping minimal and auditable.
- Re-test with full shards, not only tiny-layer smoke runs.
FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)
Symptoms:
- fp8 kernels unsupported or unstable on Ascend A2/A3.
Actions:
- Do not force fp8 quantization kernels on Ascend.
- Use load-time fp8->bf16 dequantization path (weight + scale pairing).
- Add strict unpaired scale/weight checks to avoid silent corruption.
QK norm mismatch (KV heads / TP / head divisibility)
Symptoms:
- Shape mismatch like
128 vs 64whentp_size > num_key_value_heads. - Similar mismatch when head topology is not cleanly divisible.
Actions:
- Detect KV-head replication case.
- Use local
k_normshard path for replicated KV heads. - Avoid assumptions that all head dimensions split evenly under current TP.
- Validate both normal and edge topology cases explicitly.
MLA attention runtime failures after ready
Symptoms:
- First request fails with signatures like
AtbRingMLAGetWorkspaceSize/AtbRingMLA. - May also show
aclnnFusedInferAttentionScoreV3 ... error code 561002.
Actions:
- Reproduce with one minimal text request (deterministic payload).
- Try eager isolation (
--enforce-eager) once to verify whether issue is graph-only. - If eager still fails, prioritize model/backend code fix path (not runtime flags only).
- Check
vllm-ascendMLA/rope/platform implementation used by known-good runs.
VL + TorchDynamo interpolate contiguous failure
Symptoms:
torch._dynamo.exc.TorchRuntimeError.- Stack contains
torch.nn.functional.interpolate. - Error contains
NPU contiguous operator only supported contiguous memory format.
Actions:
- Add
TORCHDYNAMO_DISABLE=1and retry with same serve args. - Validate both text and text+image after startup.
- If this stabilizes startup and inference, record it as current fallback path.
- Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.
Multimodal processor signature mismatch (skip_tensor_conversion)
Symptoms:
- Early failure before engine ready.
convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'.
Actions:
- Identify processor compatibility mismatch (HF remote processor vs current transformers API).
- Use text-only isolation (
--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}') only to separate layers, not as final fix. - Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
- Align to known-good model dispatch and processor compatibility implementation.
Text-only isolation triggers meta tensor load errors
Symptoms:
NotImplementedError: Cannot copy out of meta tensor; no data!- May occur after disabling multimodal prompt items.
Actions:
- Treat as secondary failure signature (after bypassing earlier MM-processor failure).
- Do not assume text-only isolation is universally safe for all VL models.
- Return to model-specific code-fix path with captured signatures.
Config max length works on paper but not in runtime
Symptoms:
max_position_embeddingsis large, but service fails or OOM with that value.
Actions:
- Record config max (theoretical).
- Find practical max by successful startup + serving under target TP/EP setup.
- Report both values explicitly in docs.
flashcomm1 / MTP confusion on VL checkpoints
Symptoms:
- flashcomm1 enabled but startup fails.
- MTP expected but no effect.
Actions:
- Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
- Verify MTP from both config and weight index (
mtp/nextnkeys). - Mark unsupported vs checkpoint-missing clearly.
ACL graph capture fails (507903)
Symptoms:
AclmdlRICaptureEnd ... 507903rtStreamEndCapture ... invalidated stream capture sequence
Actions:
- Prefer
HCCL_OP_EXPANSION_MODE=AIVfor graph capture stability. - Reduce shape pressure (
--max-model-len) and retry. - Temporarily fallback
--enforce-eagerfor isolation.
API reachable but output quality odd
Symptoms:
/v1/modelsworks but output has template artifacts.
Actions:
- Use deterministic request (
temperature=0, boundedmax_tokens). - Verify endpoint (
/v1/chat/completionsvs/v1/completions) matches model template. - Confirm non-empty output and HTTP 200 before success declaration.