Files

jack 29e3cdde20 [Doc][Skill] Introduce AI-assisted model-adaptation workflow for vllm-ascend (#6731 )

### What this PR does / why we need it

This PR introduces the **first AI-assisted model-adaptation skill
package** for `vllm-ascend`.

The goal is to make model adaptation work (especially for recurring
feature-request issues) **repeatable, auditable, and easier to hand
off**.

### Scope in this PR

This PR adds only skill/workflow assets under:

- `.agents/skills/vllm-ascend-model-adapter/SKILL.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md`
-
`.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md`
- `.agents/skills/vllm-ascend-model-adapter/references/deliverables.md`

### Workflow improvements

The skill standardizes:

1. **Environment assumptions** used in our Docker setup
- implementation roots: `/vllm-workspace/vllm` and
`/vllm-workspace/vllm-ascend`
- serving root: `/workspace`
- model path convention: `/models/<model-name>`

2. **Validation strategy**
- Stage A: fast `--load-format dummy` gate
- Stage B: mandatory real-weight gate before sign-off
- avoid false-ready by requiring request-level checks (not startup log
only)

3. **Feature-first verification checklist**
- ACLGraph / EP / flashcomm1 / MTP / multimodal
- explicit `supported / unsupported / not-applicable /
checkpoint-missing` outcomes

4. **Delivery contract**
- minimal scoped code changes
- required artifacts (Chinese report + runbook, e2e config YAML,
tutorial doc)
- one signed commit in delivery repo

### What this PR does NOT do

- No runtime/kernel/model patch is included in this PR.
- No direct model support claim is made by this PR alone.
- Model-specific adaptation/fix work should be submitted in follow-up
PRs using this skill as the workflow baseline.

### Why this matters for maintainers

This gives the repo a shared, explicit AI-assistance protocol, so future
model-adaptation PRs are easier to review, compare, and reproduce.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>

2026-02-26 08:48:15 +08:00

8.1 KiB

Raw Blame History

Workflow Checklist

0) Environment prerequisites

Set these once per session. Defaults match the official vllm-ascend Docker image.

# --- configurable paths (adjust if your layout differs) ---
VLLM_SRC=/vllm-workspace/vllm              # vLLM source root
VLLM_ASCEND_SRC=/vllm-workspace/vllm-ascend # vllm-ascend source root
WORK_DIR=/workspace                         # directory to run vllm serve from
MODEL_ROOT=/models                          # parent directory of model checkpoints

Expected environment:

Hardware: Ascend A2 or A3 server
Software: official vllm-ascend Docker image (see ./Dockerfile for full contents)
TP=16 typical for A3 (16-NPU), TP=8 typical for A2 (8-NPU)

1) Fast triage commands

MODEL_PATH=${MODEL_ROOT}/<model-name>
echo "MODEL_PATH=$MODEL_PATH"

# model inventory
ls -la "$MODEL_PATH"

# architecture + quant hints
rg -n "architectures|model_type|quantization_config|torch_dtype|max_position_embeddings|num_nextn_predict_layers|version|num_attention_heads|num_key_value_heads|num_experts" "$MODEL_PATH/config.json"

# state-dict key layout hints (if index exists)
ls -la "$MODEL_PATH"/*index*.json 2>/dev/null || true

# model custom code (if exists)
ls -la "$MODEL_PATH"/*.py 2>/dev/null || true

2) Confirm implementation and delivery roots

# implementation roots (fixed by Dockerfile)
cd "$VLLM_SRC" && git status -s
cd "$VLLM_ASCEND_SRC" && git status -s

# runtime import source check (expect vllm-workspace path)
python - <<'PY'
import vllm
print(vllm.__file__)
PY

# direct-run working directory
cd "$WORK_DIR" && pwd

# delivery root (current repo)
cd <current-repo>
git status -s

3) Session hygiene (before rerun)

# stop stale servers
pkill -f "vllm serve|api_server|EngineCore" || true

# confirm port 8000 is free
netstat -ltnp 2>/dev/null | rg ':8000' || true

When user explicitly requests reset:

cd "$VLLM_SRC" && git reset --hard && git clean -fd
cd "$VLLM_ASCEND_SRC" && git reset --hard && git clean -fd

4) New model onboarding checklist

# architecture mapping check in vLLM
rg -n "<ArchitectureClass>|registry" "$VLLM_SRC"/vllm/model_executor/models/registry.py

# optional: inspect model config and weight index quickly
cat "$MODEL_PATH/config.json"
cat "$MODEL_PATH"/*index*.json 2>/dev/null || true

If architecture is missing/incompatible, minimally do:

Add model adapter under $VLLM_SRC/vllm/model_executor/models/<new_model>.py.
Add processor under $VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py when needed.
Register architecture in $VLLM_SRC/vllm/model_executor/models/registry.py.
Add explicit loader/remap rules for checkpoint key patterns (qkv/norm/rope/fp8 scales).
Touch $VLLM_ASCEND_SRC only when backend-specific errors are confirmed.

5) Typical implementation touch points

$VLLM_SRC/vllm/model_executor/models/<new_model>.py
$VLLM_SRC/vllm/transformers_utils/processors/<new_model>.py
$VLLM_SRC/vllm/model_executor/models/registry.py
$VLLM_ASCEND_SRC/vllm_ascend/... (only if backend behavior requires it)

6) Syntax sanity checks

python -m py_compile \
  "$VLLM_SRC"/vllm/model_executor/models/<new_model>.py

python -m py_compile \
  "$VLLM_SRC"/vllm/transformers_utils/processors/<new_model>.py 2>/dev/null || true

7) Two-stage serve templates (direct run, default `:8000`)

Stage A: dummy fast gate (first try)

cd "$WORK_DIR"
MODEL_PATH=${MODEL_ROOT}/<model-name>

HCCL_OP_EXPANSION_MODE=AIV \
VLLM_ASCEND_ENABLE_FLASHCOMM1=0 \
vllm serve "$MODEL_PATH" \
  --served-model-name <served-name> \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len <practical-max-len-or-131072> \
  --tensor-parallel-size <TP-size> \
  --max-num-seqs 16 \
  --load-format dummy \
  --port 8000

Stage B: real-weight mandatory gate

# remove this from Stage A:
--load-format dummy

Note: dummy is not equivalent to real weights. Real gate is mandatory before sign-off.

EP + ACLGraph (feature-first, MoE only)

# add to Stage B when model is MoE and validating EP:
--enable-expert-parallel

flashcomm1 check (MoE only)

# only evaluate flashcomm1 when model is MoE
VLLM_ASCEND_ENABLE_FLASHCOMM1=1

Eager fallback (isolation)

# add to command for isolation only:
--enforce-eager

TorchDynamo fallback (for VL interpolate-contiguous failures)

# add env var when logs contain:
# torch._dynamo.exc.TorchRuntimeError + interpolate +
# "NPU contiguous operator only supported contiguous memory format"
TORCHDYNAMO_DISABLE=1

8) Readiness + smoke checks (must verify true-ready)

# readiness
for i in $(seq 1 200); do
  curl -sf http://127.0.0.1:8000/v1/models >/tmp/models.json && break
  sleep 3
done

# text smoke (required)
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"<served-name>","messages":[{"role":"user","content":"say hi"}],"temperature":0,"max_tokens":16}'

# VL smoke (required for multimodal models)
# send one text+image OpenAI-compatible request and require non-empty choices.

Application startup complete alone is not success. If first request crashes, treat as runtime failure (false-ready).

9) Feature validation checklist (default out-of-box)

GET /v1/models returns 200.
Text request returns 200 and non-empty output.
If VL model: text+image request returns 200.
ACLGraph evidence exists (Replaying aclgraph) where expected.
EP path is validated only for MoE models; non-MoE must be marked not-applicable.
flashcomm1 is validated only for MoE models; non-MoE must be marked not-applicable.
MTP status verified from config + weight index (enabled vs checkpoint-missing).
Dummy-vs-real differences are explicitly reported (if any).
Any false-ready case is explicitly marked as failure (with log signature).

10) Fallback ladder (recommended order)

Keep same params and reproduce once to ensure deterministic failure signature.
Add --enforce-eager to isolate graph-capture influence.
For VL + dynamo/interpolate/contiguous failures, add TORCHDYNAMO_DISABLE=1.
For multimodal-processor suspicion, isolate text-only by:
- --limit-mm-per-prompt '{"image":0,"video":0,"audio":0}'
- then check whether failure moves from processor layer to model core.
If issue persists, map failure signature to known-good implementation and patch minimal code.

11) Capacity baseline + sweep

Baseline (single machine): max-model-len=128k + max-num-seqs=16.
If baseline passes, expand to max-num-seqs=32/64 when requested.
If baseline cannot pass due hardware/runtime limits, report explicit root cause.

12) Delivery checklist

# in current working repo (delivery root)
git add <changed-files>
git commit -sm "<message>"

Confirm:

one signed commit only
Chinese analysis + Chinese runbook present
feature status matrix included with pass/fail reason
dummy stage and real stage validation evidence included
false-ready cases (if any) documented with final fallback status

Test config generation

Generate tests/e2e/models/configs/<ModelName>.yaml using accuracy results from evaluation.
Must include: model_name (HF path), hardware (e.g. "Atlas A2 Series"), tasks (list with name and metrics containing name + value), num_fewshot.
Follow the schema of existing configs (e.g. Qwen3-8B.yaml).

Tutorial doc generation

Generate docs/source/tutorials/models/<ModelName>.md from the standard template.
Fill in model-specific details: HF path, hardware requirements, TP size, max-model-len, served-model-name, sample curl request, accuracy table.
Must include sections: Introduction, Supported Features, Environment Preparation (with docker tabs for A2/A3), Deployment (with serve script), Functional Verification (with curl example), Accuracy Evaluation, Performance.
Update docs/source/tutorials/models/index.md to include the new tutorial entry.

GitHub issue comment

Post SKILL.md content or AI-assisted workflow summary as a comment on the originating GitHub issue.

Confirm both test config YAML and tutorial doc are included in the signed commit.

8.1 KiB Raw Blame History