230 lines
6.5 KiB
Markdown
230 lines
6.5 KiB
Markdown
|
|
# Troubleshooting
|
||
|
|
|
||
|
|
## Direct run doesn't pick your code changes
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `vllm serve` behavior still old after code edits.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Check runtime import path:
|
||
|
|
```bash
|
||
|
|
python - <<'PY'
|
||
|
|
import vllm
|
||
|
|
print(vllm.__file__)
|
||
|
|
PY
|
||
|
|
```
|
||
|
|
2. Ensure edits were made under `/vllm-workspace/vllm` and/or `/vllm-workspace/vllm-ascend`.
|
||
|
|
3. Avoid PYTHONPATH-overlay workflow unless as temporary debugging fallback.
|
||
|
|
|
||
|
|
## Server fails to bind on `:8000` or fails with HCCL bind errors
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Port bind fail on startup.
|
||
|
|
- HCCL error like `Communication_Error_Bind_IP_Port(EJ0003)`.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Kill stale `vllm serve` processes.
|
||
|
|
2. Ensure `:8000` is free.
|
||
|
|
3. Retry clean startup before changing code.
|
||
|
|
|
||
|
|
## Startup appears "stuck" in graph mode
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Process alive, but `curl /v1/models` not ready yet.
|
||
|
|
- Logs show compile/graph capture messages for a long time.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Keep waiting until graph capture completes.
|
||
|
|
2. Look for `Capturing CUDA graphs ...` and `Graph capturing finished`.
|
||
|
|
3. Only declare failure after an explicit error or timeout window.
|
||
|
|
|
||
|
|
## False-ready: startup succeeds but first request crashes
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `Application startup complete` exists.
|
||
|
|
- `GET /v1/models` may return 200.
|
||
|
|
- First text or VL request crashes workers/engine.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Always run at least one text smoke request immediately after ready.
|
||
|
|
2. For VL models, always run one text+image smoke request as well.
|
||
|
|
3. Treat first-request crash as runtime failure (do not mark as success).
|
||
|
|
4. Capture first runtime error signature and branch to targeted fallback.
|
||
|
|
|
||
|
|
## Architecture not recognized
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `ValueError` or log shows unresolved architecture.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Verify `architectures` in model `config.json`.
|
||
|
|
2. Add mapping to `vllm/model_executor/models/registry.py`.
|
||
|
|
3. Ensure module and class names exactly match.
|
||
|
|
|
||
|
|
## Remote code import fails on transformers symbols
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Missing class/function in current `transformers`.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Do not upgrade `transformers`.
|
||
|
|
2. Prefer native vLLM implementation.
|
||
|
|
3. If unavoidable, copy required modeling files from sibling transformers source.
|
||
|
|
|
||
|
|
## Weight loading key mismatch
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Missing/unexpected key warnings during load.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Inspect checkpoint key prefixes.
|
||
|
|
2. Add explicit mapping logic.
|
||
|
|
3. Keep mapping minimal and auditable.
|
||
|
|
4. Re-test with full shards, not only tiny-layer smoke runs.
|
||
|
|
|
||
|
|
## FP8 checkpoint on Ascend A2/A3 (must dequant to bf16)
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- fp8 kernels unsupported or unstable on Ascend A2/A3.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Do not force fp8 quantization kernels on Ascend.
|
||
|
|
2. Use load-time fp8->bf16 dequantization path (weight + scale pairing).
|
||
|
|
3. Add strict unpaired scale/weight checks to avoid silent corruption.
|
||
|
|
|
||
|
|
## QK norm mismatch (KV heads / TP / head divisibility)
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
|
||
|
|
- Similar mismatch when head topology is not cleanly divisible.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Detect KV-head replication case.
|
||
|
|
2. Use local `k_norm` shard path for replicated KV heads.
|
||
|
|
3. Avoid assumptions that all head dimensions split evenly under current TP.
|
||
|
|
4. Validate both normal and edge topology cases explicitly.
|
||
|
|
|
||
|
|
## MLA attention runtime failures after ready
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- First request fails with signatures like `AtbRingMLAGetWorkspaceSize` / `AtbRingMLA`.
|
||
|
|
- May also show `aclnnFusedInferAttentionScoreV3 ... error code 561002`.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Reproduce with one minimal text request (deterministic payload).
|
||
|
|
2. Try eager isolation (`--enforce-eager`) once to verify whether issue is graph-only.
|
||
|
|
3. If eager still fails, prioritize model/backend code fix path (not runtime flags only).
|
||
|
|
4. Check `vllm-ascend` MLA/rope/platform implementation used by known-good runs.
|
||
|
|
|
||
|
|
## VL + TorchDynamo interpolate contiguous failure
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `torch._dynamo.exc.TorchRuntimeError`.
|
||
|
|
- Stack contains `torch.nn.functional.interpolate`.
|
||
|
|
- Error contains `NPU contiguous operator only supported contiguous memory format`.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Add `TORCHDYNAMO_DISABLE=1` and retry with same serve args.
|
||
|
|
2. Validate both text and text+image after startup.
|
||
|
|
3. If this stabilizes startup and inference, record it as current fallback path.
|
||
|
|
4. Keep code-level fix exploration as next step, but do not block delivery if fallback is accepted.
|
||
|
|
|
||
|
|
## Multimodal processor signature mismatch (`skip_tensor_conversion`)
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- Early failure before engine ready.
|
||
|
|
- `convert_to_tensors() got an unexpected keyword argument 'skip_tensor_conversion'`.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Identify processor compatibility mismatch (HF remote processor vs current transformers API).
|
||
|
|
2. Use text-only isolation (`--limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'`) only to separate layers, not as final fix.
|
||
|
|
3. Expect potential follow-up core failures after bypassing processor path; keep logs for both layers.
|
||
|
|
4. Align to known-good model dispatch and processor compatibility implementation.
|
||
|
|
|
||
|
|
## Text-only isolation triggers meta tensor load errors
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `NotImplementedError: Cannot copy out of meta tensor; no data!`
|
||
|
|
- May occur after disabling multimodal prompt items.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Treat as secondary failure signature (after bypassing earlier MM-processor failure).
|
||
|
|
2. Do not assume text-only isolation is universally safe for all VL models.
|
||
|
|
3. Return to model-specific code-fix path with captured signatures.
|
||
|
|
|
||
|
|
## Config max length works on paper but not in runtime
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `max_position_embeddings` is large, but service fails or OOM with that value.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Record config max (theoretical).
|
||
|
|
2. Find practical max by successful startup + serving under target TP/EP setup.
|
||
|
|
3. Report both values explicitly in docs.
|
||
|
|
|
||
|
|
## flashcomm1 / MTP confusion on VL checkpoints
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- flashcomm1 enabled but startup fails.
|
||
|
|
- MTP expected but no effect.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Only validate flashcomm1 for MoE models; non-MoE mark as not-applicable.
|
||
|
|
2. Verify MTP from both config and weight index (`mtp/nextn` keys).
|
||
|
|
3. Mark unsupported vs checkpoint-missing clearly.
|
||
|
|
|
||
|
|
## ACL graph capture fails (507903)
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `AclmdlRICaptureEnd ... 507903`
|
||
|
|
- `rtStreamEndCapture ... invalidated stream capture sequence`
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Prefer `HCCL_OP_EXPANSION_MODE=AIV` for graph capture stability.
|
||
|
|
2. Reduce shape pressure (`--max-model-len`) and retry.
|
||
|
|
3. Temporarily fallback `--enforce-eager` for isolation.
|
||
|
|
|
||
|
|
## API reachable but output quality odd
|
||
|
|
|
||
|
|
Symptoms:
|
||
|
|
|
||
|
|
- `/v1/models` works but output has template artifacts.
|
||
|
|
|
||
|
|
Actions:
|
||
|
|
|
||
|
|
1. Use deterministic request (`temperature=0`, bounded `max_tokens`).
|
||
|
|
2. Verify endpoint (`/v1/chat/completions` vs `/v1/completions`) matches model template.
|
||
|
|
3. Confirm non-empty output and HTTP 200 before success declaration.
|