# FP8-on-NPU Lessons

## 1) Recommended debug order

1. Start with `--load-format dummy` to quickly verify architecture path.
2. Run with real weights to validate weight mapping and load-time stability.
3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
4. Validate `/v1/models`, then one text request, then one VL request (if multimodal).

## 2) FP8 checkpoint on NPU

Common symptom:

- `fp8 quantization is currently not supported in npu`.

Recommended pattern:

- do not force fp8 execution kernels on NPU;
- dequantize fp8 weights to bf16 during loading using paired tensors:
    - `*.weight`
    - `*.weight_scale_inv`
- keep strict unpaired scale/weight checks to avoid silent corruption.

## 3) Typical real-only risks (dummy may not expose)

- missing fp8 scale keys during real shard loading;
- wrong weight remap path only triggered by real checkpoints;
- KV/QK norm sharding mismatch under TP + replicated KV heads.

## 4) KV replication + TP pitfalls

Typical symptom:

- shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.

Recommended pattern:

- detect KV-head replication explicitly;
- use local norm/shard loader path for replicated KV heads;
- avoid assuming uniform divisibility for all head dimensions.

## 5) ACLGraph stability for fp8-origin checkpoints

Recommended pattern:

- prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode;
- keep practical capture sizes and re-test from small, stable shapes;
- use `--enforce-eager` only as temporary isolation fallback.

## 6) Reporting discipline

Always report both:

- what dummy validated (fast gate), and
- what only real weights validated (mandatory gate).

Do not sign off fp8-on-NPU adaptation with dummy-only evidence.