Files

jack 29e3cdde20 [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731 )

### What this PR does / why we need it

This PR introduces the **first AI-assisted model-adaptation skill
package** for `vllm-ascend`.

The goal is to make model adaptation work (especially for recurring
feature-request issues) **repeatable, auditable, and easier to hand
off**.

### Scope in this PR

This PR adds only skill/workflow assets under:

- `.agents/skills/vllm-ascend-model-adapter/SKILL.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md`
- `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md`

### Workflow improvements

The skill standardizes:

1. **Environment assumptions** used in our Docker setup
- implementation roots: `/vllm-workspace/vllm` and
`/vllm-workspace/vllm-ascend`
- serving root: `/workspace`
- model path convention: `/models/<model-name>`

2. **Validation strategy**
- Stage A: fast `--load-format dummy` gate
- Stage B: mandatory real-weight gate before sign-off
- avoid false-ready by requiring request-level checks (not startup log
only)

3. **Feature-first verification checklist**
- ACLGraph / EP / flashcomm1 / MTP / multimodal
- explicit `supported / unsupported / not-applicable /
checkpoint-missing` outcomes

4. **Delivery contract**
- minimal scoped code changes
- required artifacts (Chinese report + runbook, e2e config YAML,
tutorial doc)
- one signed commit in delivery repo

### What this PR does NOT do

- No runtime/kernel/model patch is included in this PR.
- No direct model support claim is made by this PR alone.
- Model-specific adaptation/fix work should be submitted in follow-up
PRs using this skill as the workflow baseline.

### Why this matters for maintainers

This gives the repo a shared, explicit AI-assistance protocol, so future
model-adaptation PRs are easier to review, compare, and reproduce.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-02-26 08:48:15 +08:00

6.5 KiB

Raw Permalink Blame History

Troubleshooting

Direct run doesn't pick your code changes

Symptoms:

vllm serve behavior still old after code edits.

Actions:

Check runtime import path:

python - <<'PY'
import vllm
print(vllm.__file__)
PY

Ensure edits were made under /vllm-workspace/vllm and/or /vllm-workspace/vllm-ascend.
Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.

Server fails to bind on `:8000` or fails with HCCL bind errors

Symptoms:

Port bind fail on startup.
HCCL error like Communication_Error_Bind_IP_Port(EJ0003).

Actions:

Kill stale vllm serve processes.
Ensure :8000 is free.
Retry clean startup before changing code.

Startup appears "stuck" in graph mode

Symptoms:

Process alive, but curl /v1/models not ready yet.
Logs show compile/graph capture messages for a long time.

Actions:

Keep waiting until graph capture completes.
Look for Capturing CUDA graphs ... and Graph capturing finished.
Only declare failure after an explicit error or timeout window.

False-ready: startup succeeds but first request crashes

Symptoms:

Application startup complete exists.
GET /v1/models may return 200.
First text or VL request crashes workers/engine.

Actions:

Always run at least one text smoke request immediately after ready.
For VL models, always run one text+image smoke request as well.
Treat first-request crash as runtime failure (do not mark as success).
Capture first runtime error signature and branch to targeted fallback.

Architecture not recognized

Symptoms:

ValueError or log shows unresolved architecture.

Actions:

Verify architectures in model config.json.
Add mapping to vllm/model_executor/models/registry.py.
Ensure module and class names exactly match.

Remote code import fails on transformers symbols

Symptoms:

Missing class/function in current transformers.

Actions:

Do not upgrade transformers.
Prefer native vLLM implementation.
If unavoidable, copy required modeling files from sibling transformers source.

Weight loading key mismatch

Symptoms:

Missing/unexpected key warnings during load.

Actions:

Inspect checkpoint key prefixes.
Add explicit mapping logic.
Keep mapping minimal and auditable.
Re-test with full shards, not only tiny-layer smoke runs.

FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)

Symptoms:

fp8 kernels unsupported or unstable on Ascend A2/A3.

Actions:

Do not force fp8 quantization kernels on Ascend.
Use load-time fp8->bf16 dequantization path (weight + scale pairing).
Add strict unpaired scale/weight checks to avoid silent corruption.

QK norm mismatch (KV heads / TP / head divisibility)

Symptoms:

Shape mismatch like 128 vs 64 when tp_size > num_key_value_heads.
Similar mismatch when head topology is not cleanly divisible.

Actions:

Detect KV-head replication case.
Use local k_norm shard path for replicated KV heads.
Avoid assumptions that all head dimensions split evenly under current TP.
Validate both normal and edge topology cases explicitly.

MLA attention runtime failures after ready

Symptoms:

First request fails with signatures like AtbRingMLAGetWorkspaceSize / AtbRingMLA.
May also show aclnnFusedInferAttentionScoreV3 ... error code 561002.

Actions:

Reproduce with one minimal text request (deterministic payload).
Try eager isolation (--enforce-eager) once to verify whether issue is graph-only.
If eager still fails, prioritize model/backend code fix path (not runtime flags only).
Check vllm-ascend MLA/rope/platform implementation used by known-good runs.

VL + TorchDynamo interpolate contiguous failure

Symptoms:

torch._dynamo.exc.TorchRuntimeError.
Stack contains torch.nn.functional.interpolate.
Error contains NPU contiguous operator only supported contiguous memory format.

Actions:

Add TORCHDYNAMO_DISABLE=1 and retry with same serve args.
Validate both text and text+image after startup.
If this stabilizes startup and inference, record it as current fallback path.
Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.

Multimodal processor signature mismatch (`skip_tensor_conversion`)

Symptoms:

Early failure before engine ready.
convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'.

Actions:

Identify processor compatibility mismatch (HF remote processor vs current transformers API).
Use text-only isolation (--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}') only to separate layers, not as final fix.
Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
Align to known-good model dispatch and processor compatibility implementation.

Text-only isolation triggers meta tensor load errors

Symptoms:

NotImplementedError: Cannot copy out of meta tensor; no data!
May occur after disabling multimodal prompt items.

Actions:

Treat as secondary failure signature (after bypassing earlier MM-processor failure).
Do not assume text-only isolation is universally safe for all VL models.
Return to model-specific code-fix path with captured signatures.

Config max length works on paper but not in runtime

Symptoms:

max_position_embeddings is large, but service fails or OOM with that value.

Actions:

Record config max (theoretical).
Find practical max by successful startup + serving under target TP/EP setup.
Report both values explicitly in docs.

flashcomm1 / MTP confusion on VL checkpoints

Symptoms:

flashcomm1 enabled but startup fails.
MTP expected but no effect.

Actions:

Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
Verify MTP from both config and weight index (mtp/nextn keys).
Mark unsupported vs checkpoint-missing clearly.

ACL graph capture fails (507903)

Symptoms:

AclmdlRICaptureEnd ... 507903
rtStreamEndCapture ... invalidated stream capture sequence

Actions:

Prefer HCCL_OP_EXPANSION_MODE=AIV for graph capture stability.
Reduce shape pressure (--max-model-len) and retry.
Temporarily fallback --enforce-eager for isolation.

API reachable but output quality odd

Symptoms:

/v1/models works but output has template artifacts.

Actions:

Use deterministic request (temperature=0, bounded max_tokens).
Verify endpoint (/v1/chat/completions vs /v1/completions) matches model template.
Confirm non-empty output and HTTP 200 before success declaration.

6.5 KiB Raw Permalink Blame History

Troubleshooting

Direct run doesn't pick your code changes

Server fails to bind on :8000 or fails with HCCL bind errors

Startup appears "stuck" in graph mode

False-ready: startup succeeds but first request crashes

Architecture not recognized

Remote code import fails on transformers symbols

Weight loading key mismatch

FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)

QK norm mismatch (KV heads / TP / head divisibility)

MLA attention runtime failures after ready

VL + TorchDynamo interpolate contiguous failure

Multimodal processor signature mismatch (skip_tensor_conversion)

Text-only isolation triggers meta tensor load errors

Config max length works on paper but not in runtime

flashcomm1 / MTP confusion on VL checkpoints

ACL graph capture fails (507903)

API reachable but output quality odd

6.5 KiB

Raw Permalink Blame History

Server fails to bind on `:8000` or fails with HCCL bind errors

Multimodal processor signature mismatch (`skip_tensor_conversion`)