58 lines
1.7 KiB
Markdown
58 lines
1.7 KiB
Markdown
|
|
# FP8-on-NPU Lessons
|
||
|
|
|
||
|
|
## 1) Recommended debug order
|
||
|
|
|
||
|
|
1. Start with `--load-format dummy` to quickly verify architecture path.
|
||
|
|
2. Run with real weights to validate weight mapping and load-time stability.
|
||
|
|
3. If blocked by fp8 execution limits on NPU, use fp8->bf16 dequantization loading path.
|
||
|
|
4. Validate `/v1/models`, then one text request, then one VL request (if multimodal).
|
||
|
|
|
||
|
|
## 2) FP8 checkpoint on NPU
|
||
|
|
|
||
|
|
Common symptom:
|
||
|
|
|
||
|
|
- `fp8 quantization is currently not supported in npu`.
|
||
|
|
|
||
|
|
Recommended pattern:
|
||
|
|
|
||
|
|
- do not force fp8 execution kernels on NPU;
|
||
|
|
- dequantize fp8 weights to bf16 during loading using paired tensors:
|
||
|
|
- `*.weight`
|
||
|
|
- `*.weight_scale_inv`
|
||
|
|
- keep strict unpaired scale/weight checks to avoid silent corruption.
|
||
|
|
|
||
|
|
## 3) Typical real-only risks (dummy may not expose)
|
||
|
|
|
||
|
|
- missing fp8 scale keys during real shard loading;
|
||
|
|
- wrong weight remap path only triggered by real checkpoints;
|
||
|
|
- KV/QK norm sharding mismatch under TP + replicated KV heads.
|
||
|
|
|
||
|
|
## 4) KV replication + TP pitfalls
|
||
|
|
|
||
|
|
Typical symptom:
|
||
|
|
|
||
|
|
- shape mismatch like `128 vs 64` when `tp_size > num_key_value_heads`.
|
||
|
|
|
||
|
|
Recommended pattern:
|
||
|
|
|
||
|
|
- detect KV-head replication explicitly;
|
||
|
|
- use local norm/shard loader path for replicated KV heads;
|
||
|
|
- avoid assuming uniform divisibility for all head dimensions.
|
||
|
|
|
||
|
|
## 5) ACLGraph stability for fp8-origin checkpoints
|
||
|
|
|
||
|
|
Recommended pattern:
|
||
|
|
|
||
|
|
- prefer `HCCL_OP_EXPANSION_MODE=AIV` when using graph mode;
|
||
|
|
- keep practical capture sizes and re-test from small, stable shapes;
|
||
|
|
- use `--enforce-eager` only as temporary isolation fallback.
|
||
|
|
|
||
|
|
## 6) Reporting discipline
|
||
|
|
|
||
|
|
Always report both:
|
||
|
|
|
||
|
|
- what dummy validated (fast gate), and
|
||
|
|
- what only real weights validated (mandatory gate).
|
||
|
|
|
||
|
|
Do not sign off fp8-on-NPU adaptation with dummy-only evidence.
|