[Docs] Add vLLM-Kunlun New Model Adaptation Manual and Update Model Support (#211)

* [Docs] Fix app.readthedocs buliding Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com> * [Docs] Add vLLM-Kunlun New Model Adaptation Manual and Update Model Support Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-26 10:06:58 +08:00
parent b82b6026d6
commit d425a0d0e9
3 changed files with 792 additions and 9 deletions
--- a/docs/source/developer_guide/developer_guide.md
+++ b/docs/source/developer_guide/developer_guide.md
@@ -0,0 +1,760 @@
+# 📖 vLLM-Kunlun New Model Adaptation Manual
+
+> Based on in-depth analysis of [baidu/vLLM-Kunlun](https://github.com/baidu/vLLM-Kunlun) and [vllm-project/vllm](https://github.com/vllm-project/vllm) repositories.
+>
+> Applicable Versions: vLLM v0.15.1+ / vLLM-Kunlun main branch
+
+---
+
+## Table of Contents
+
+- [I. Understanding the Overall Architecture](#i-understanding-the-overall-architecture)
+  - [1.1 Plugin System](#11-plugin-system)
+  - [1.2 Startup Process](#12-startup-process)
+  - [1.3 Import Hook Mechanism](#13-import-hook-mechanism)
+  - [1.4 Code Architecture](#14-code-architecture)
+- [II. New Model Adaptation Step-by-Step](#ii-new-model-adaptation-step-by-step)
+  - [Step 0: Pre-assessment](#step-0-pre-assessment)
+  - [Step 1: Implement Model Files](#step-1-implement-model-files)
+  - [Step 2: Register the Model](#step-2-register-the-model)
+  - [Step 3: Verify Registration](#step-3-verify-registration)
+  - [Step 4: Testing](#step-4-testing)
+- [III. Adaptation Guide for Special Model Types](#iii-adaptation-guide-for-special-model-types)
+  - [3.1 MoE Models](#31-moe-models-eg-qwen3-moe-deepseek-v3)
+  - [3.2 MLA Models](#32-mla-multi-latent-attention-models-eg-deepseek-v3)
+  - [3.3 Multi-modal Models](#33-multi-modal-models-eg-qwen2-vl-internvl)
+  - [3.4 Hybrid Attention Models](#34-hybrid-attention-models-eg-qwen3-next)
+- [IV. Quantized Model Adaptation](#iv-quantized-model-adaptation)
+  - [4.1 Supported Quantization Methods](#41-supported-quantization-methods)
+  - [4.2 Special Handling for Quantization](#42-special-handling-for-quantization)
+- [V. Custom Operators](#v-custom-operators-if-new-low-level-ops-are-needed)
+- [VI. Common Pitfalls Checklist](#vi-common-pitfalls-checklist)
+- [VII. Reference Template Quick Look-up](#vii-reference-template-quick-look-up)
+- [VIII. Debugging Tips](#viii-debugging-tips)
+- [IX. Environment Variables Cheat Sheet](#ix-environment-variables-cheat-sheet)
+- [X. PR Submission Standards](#x-pr-submission-standards)
+
+---
+
+## I. Understanding the Overall Architecture
+
+### 1.1 Plugin System
+
+vLLM-Kunlun uses the **OOT (Out-of-Tree) Plugin** approach to integrate with vLLM, primarily registered via `entry_points` in `setup.py`:
+
+```python
+# setup.py
+entry_points={
+    'vllm.platform_plugins': ["kunlun = vllm_kunlun:register"],       # Platform Plugin
+    'vllm.general_plugins': [
+        "kunlun_model = vllm_kunlun:register_model",                   # Model Registration
+        "kunlun_quant = vllm_kunlun:register_quant_method"             # Quantization Method
+    ],
+    "console_scripts": [
+        "vllm_kunlun = vllm_kunlun.entrypoints.main:main"
+    ]
+}
+```
+
+### 1.2 Startup Process
+
+```
+vllm Startup
+  ├─ 1. Discover platform_plugin → Call vllm_kunlun:register()
+  │      ├─ Register KunlunPlatform (defines Attention Backend, Worker, etc.)
+  │      ├─ Apply import hook (module redirection)
+  │      └─ Register custom operators (custom_op)
+  ├─ 2. Discover general_plugin → Call vllm_kunlun:register_model()
+  │      └─ Register all Kunlun-adapted models via ModelRegistry.register_model()
+  └─ 3. Model Loading → Match registered model classes based on the architectures field in config.json
+```
+
+### 1.3 Import Hook Mechanism
+
+vLLM-Kunlun uses a custom import hook to **transparently replace** certain vLLM modules with Kunlun-customized versions:
+
+```python
+# vllm_kunlun/__init__.py
+def _custom_import(module_name, globals=None, locals=None, fromlist=(), level=0):
+    try:
+        module_mappings = {
+            "vllm.compilation.wrapper":                        "vllm_kunlun.compilation.wrapper",
+            "vllm.v1.worker.utils":                            "vllm_kunlun.v1.worker.utils",
+            "vllm.model_executor.model_loader.bitsandbytes_loader": "vllm_kunlun.models.model_loader.bitsandbytes_loader",
+            "vllm.v1.sample.ops.topk_topp_sampler":            "vllm_kunlun.v1.sample.ops.topk_topp_sampler",
+            "vllm.model_executor.layers.sampler":              "vllm_kunlun.ops.sample.sampler",
+            "vllm.v1.sample.rejection_sampler":                "vllm_kunlun.v1.sample.rejection_sampler",
+            "vllm.attention.ops.merge_attn_states":            "vllm_kunlun.ops.attention.merge_attn_states",
+        }
+
+        if module_name in module_mappings:
+            if module_name in sys.modules:
+                return sys.modules[module_name]
+            target_module = module_mappings[module_name]
+            module = importlib.import_module(target_module)
+            sys.modules[module_name] = module
+            sys.modules[target_module] = module
+    except Exception:
+        pass
+
+    return OLD_IMPORT_HOOK(module_name, globals=globals, locals=locals, fromlist=fromlist, level=level)
+```
+
+> **⚠️ Understanding this mechanism is crucial**: Even if you use `from vllm.xxx import YYY` in your model code, what you actually get might be `vllm_kunlun.xxx.YYY`.
+
+### 1.4 Code Architecture
+
+```
+vllm_kunlun/
+├── __init__.py                    # Plugin Entry: register() + import_hook()
+├── platforms/kunlun.py            # KunlunPlatform: Defines Attention Backend, Worker, etc.
+├── models/                        # ⭐ Model Implementation Directory (where you add files)
+│   ├── __init__.py                # ⭐ Model Registration Entry
+│   ├── deepseek_v2.py             # DeepSeek V2/V3 Reference Implementation
+│   ├── deepseek_mtp.py            # DeepSeek MTP (Speculative Decoding)
+│   ├── qwen3.py                   # Qwen3 Reference Implementation (Dense Model)
+│   ├── qwen3_moe.py               # Qwen3 MoE Reference Implementation
+│   ├── qwen3_next.py              # Qwen3-Next (Hybrid Attention)
+│   ├── qwen3_vl.py                # Qwen3 VL (Multi-modal)
+│   ├── qwen3_vl_moe.py            # Qwen3 VL MoE (Multi-modal + MoE)
+│   ├── qwen2_vl.py                # Qwen2 VL
+│   ├── qwen2_5_vl.py              # Qwen2.5 VL
+│   ├── internlm2.py               # InternLM2 Reference Implementation
+│   ├── internvl.py                # InternVL (Multi-modal)
+│   ├── interns1.py                # InternS1
+│   ├── seed_oss.py                # SeedOss
+│   ├── gpt_oss.py                 # GptOss
+│   └── mimo_v2_flash.py           # MiMo-V2-Flash
+├── ops/                           # Kunlun Custom Operators
+│   ├── _kunlun_ops.py             # KunlunOps: paged_attention, rms_norm, silu...
+│   ├── _custom_ops.py             # vllm custom_op registration
+│   ├── activation.py              # Activation functions like SiluAndMul, GeluAndMul
+│   ├── attention/                 # Attention Operators
+│   │   ├── layer.py               # Attention Layer Wrapper
+│   │   └── backends/kunlun_attn.py # KunlunAttentionBackend + KunlunAttentionImpl
+│   ├── quantization/              # Quantization related: AWQ, GPTQ, CompressedTensors...
+│   ├── vocab_parallel_embedding.py # Custom Embedding
+│   └── rotary_embedding.py        # Split_Norm_Rope (QKNorm + RoPE Fusion)
+├── v1/attention/backends/         # Attention Backend for v1 Engine
+│   ├── kunlun_attn.py             # Standard Attention
+│   └── mla/                       # MLA (Multi-Latent Attention) Implementation
+│       ├── flashmla.py
+│       ├── flashmla_sparse.py
+│       └── common.py
+├── compilation/wrapper.py         # torch.compile Wrapper
+├── config/                        # Model Configuration Overrides
+│   └── model.py                   # Patch for attributes like is_deepseek_mla
+├── distributed/                   # Communication related
+│   └── kunlun_communicator.py     # Kunlun Device Communication
+└── csrc/                          # C++ Extensions
+    └── utils.cpp
+```
+
+---
+
+## II. New Model Adaptation Step-by-Step
+
+### Step 0: Pre-assessment
+
+Before starting, confirm which scenario your model falls into:
+
+| Scenario | Description | Effort |
+|------|------|--------|
+| **Case A: vLLM already supports the model** | Only need to replace Attention / Activation with Kunlun versions | ⭐ Minimal |
+| **Case B: vLLM does not support, new architecture needed** | Requires full implementation of model class + registration | ⭐⭐⭐ High |
+| **Case C: MoE variant of an existing model** | Add MoE layer on top of the Dense version | ⭐⭐ Medium |
+| **Case D: Multi-modal model** | Language Model + Vision Encoder + Projector | ⭐⭐⭐⭐ Maximum |
+
+**Recommended Workflow:**
+
+1. Check the [vLLM Supported Models List](https://docs.vllm.ai/en/stable/models/supported_models.html) to see if the model is already there.
+2. If yes → Copy the corresponding file from `vllm/model_executor/models/` to `vllm_kunlun/models/` and perform replacements.
+3. If no → Refer to the [vLLM Adding a New Model Documentation](https://docs.vllm.ai/en/stable/contributing/model/) to understand the principles first, then follow this manual.
+
+---
+
+### Step 1: Implement Model Files
+
+Create a model file in the `vllm_kunlun/models/` directory, e.g., `my_new_model.py`.
+
+#### 1.1 Key Replacement Comparison Table
+
+| Component | vLLM Native Import | vLLM-Kunlun Replacement Import | Required? |
+|------|-----------------|------------------------|---------|
+| **Attention Layer** | `from vllm.attention import Attention` | `from vllm_kunlun.ops.attention.layer import Attention` | ✅ **Yes** |
+| **SiluAndMul** | `from vllm.model_executor.layers.activation import SiluAndMul` | `from vllm_kunlun.ops.activation import SiluAndMul` | ✅ **Yes** |
+| **GeluAndMul** | `...activation import GeluAndMul` | `from vllm_kunlun.ops.activation import GeluAndMul` | ⚠️ As needed |
+| **QuickGELU** | `...activation import QuickGELU` | `from vllm_kunlun.ops.activation import QuickGELU` | ⚠️ As needed |
+| **VocabParallelEmbedding** | `from vllm...vocab_parallel_embedding import VocabParallelEmbedding` | `from vllm_kunlun.ops.vocab_parallel_embedding import VocabParallelEmbedding` | ⚠️ Some models |
+| **ParallelLMHead** | Same as above | `from vllm_kunlun.ops.vocab_parallel_embedding import ParallelLMHead` | ⚠️ Some models |
+| **RoPE (Special)** | `from vllm...rotary_embedding import get_rope` | `from vllm_kunlun.ops.rotary_embedding import Split_Norm_Rope` | ⚠️ MoE+QKNorm |
+| **Linear / RMSNorm, etc.** | Use vLLM native directly | **No replacement needed** | — |
+
+> 💡 **Core Principle**: Any component involving **CUDA kernel calls** (Attention, Activation, Sampling) must be replaced with the Kunlun version; pure PyTorch components (Linear, RMSNorm, RoPE, etc.) can use vLLM native directly.
+
+#### 1.2 Standard Dense Decoder-Only Model Template
+
+Refer to `qwen3.py` or `internlm2.py`:
+
+```python
+"""Inference-only MyNewModel compatible with HuggingFace weights."""
+from collections.abc import Iterable
+from typing import Optional, Union
+
+import torch
+from torch import nn
+from transformers import MyNewModelConfig  # HuggingFace config
+
+# ==========================================
+# ⭐ Key Replacement 1: Use Kunlun-customized Attention
+# ==========================================
+# Do not use from vllm.attention import Attention
+from vllm_kunlun.ops.attention.layer import Attention
+
+# ==========================================
+# ⭐ Key Replacement 2: Use Kunlun-customized Activation
+# ==========================================
+# Do not use from vllm.model_executor.layers.activation import SiluAndMul
+from vllm_kunlun.ops.activation import SiluAndMul
+
+# Other layers can use vLLM native directly
+from vllm.compilation.decorators import support_torch_compile
+from vllm.config import CacheConfig, VllmConfig
+from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size
+from vllm.model_executor.layers.layernorm import RMSNorm
+from vllm.model_executor.layers.linear import (
+    QKVParallelLinear, RowParallelLinear, MergedColumnParallelLinear
+)
+from vllm.model_executor.layers.logits_processor import LogitsProcessor
+from vllm.model_executor.layers.quantization import QuantizationConfig
+from vllm.model_executor.layers.rotary_embedding import get_rope
+from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+from vllm.sequence import IntermediateTensors
+from vllm.model_executor.models.interfaces import SupportsPP, SupportsLoRA
+from vllm.model_executor.models.utils import (
+    AutoWeightsLoader, PPMissingLayer, extract_layer_index,
+    is_pp_missing_parameter, make_empty_intermediate_tensors_factory,
+    make_layers, maybe_prefix
+)
+
+
+# ============================
+# 1. MLP Layer
+# ============================
+class MyNewModelMLP(nn.Module):
+    def __init__(self, hidden_size, intermediate_size, hidden_act,
+                 quant_config=None, prefix=""):
+        super().__init__()
+        self.gate_up_proj = MergedColumnParallelLinear(
+            hidden_size, [intermediate_size] * 2,
+            bias=False, quant_config=quant_config,
+            prefix=f"{prefix}.gate_up_proj",
+        )
+        self.down_proj = RowParallelLinear(
+            intermediate_size, hidden_size,
+            bias=False, quant_config=quant_config,
+            prefix=f"{prefix}.down_proj",
+        )
+        self.act_fn = SiluAndMul()  # ⭐ Use Kunlun version
+
+    def forward(self, x):
+        # Implementation...
+```
+
+#### 1.3 Key Implementation Requirements
+
+- **All modules must include the `prefix` parameter**, passed in `__init__()`.
+- **`@support_torch_compile` decorator** must be added to the main model class (e.g., `MyNewModel`).
+- **`load_weights()` method** must correctly handle weight name mapping (`stacked_params_mapping`).
+- **Pipeline Parallelism (PP)** requires using tools like `PPMissingLayer`, `is_pp_missing_parameter`, etc.
+
+---
+
+## Step 2: Register the Model
+
+Add registration code in `vllm_kunlun/models/__init__.py`:
+
+```python
+# vllm_kunlun/models/__init__.py
+
+from vllm import ModelRegistry
+
+def register_model():
+    # ... Existing model registrations ...
+
+    # ⭐ Add your new model (using lazy loading string format)
+    ModelRegistry.register_model(
+        "MyNewModelForCausalLM",                                    # ← Must match architectures in config.json
+        "vllm_kunlun.models.my_new_model:MyNewModelForCausalLM"    # ← Module path:Class name
+    )
+```
+
+**⚠️ Key Considerations:**
+
+1. The **first parameter** of `register_model()` is the model's `architecture` identifier, which **must exactly match the `"architectures"` field in the HuggingFace model's `config.json`**.
+
+2. Use the **string format** for the module path (`"module:class"`) to implement **lazy loading**, avoiding CUDA initialization conflicts (`RuntimeError: Cannot re-initialize CUDA in forked subprocess`).
+
+3. If the model already exists in vLLM (e.g., `Qwen3ForCausalLM`), the Kunlun version will **overwrite** the original vLLM version upon registration.
+
+---
+
+## Step 3: Verify Registration
+
+### Case A: Overwriting an Existing vLLM Model Architecture
+
+If your model architecture name (e.g., `"Qwen3ForCausalLM"`) already exists in vLLM, vLLM will output the following log during registration:
+
+```
+WARNING [...] Model architecture Qwen3ForCausalLM is already registered,
+and will be overwritten by the new model class
+vllm_kunlun.models.qwen3:Qwen3ForCausalLM.
+```
+
+Seeing this log indicates a successful overwrite ✅.
+
+### Case B: Brand New Model Architecture
+
+If you are registering an architecture that does not exist in vLLM, there is no default log confirmation. It is recommended to verify manually during the debugging phase:
+
+```python
+from vllm import ModelRegistry
+assert "MyNewModelForCausalLM" in ModelRegistry.get_supported_archs()
+print("✅ Model registration successful!")
+```
+
+---
+
+## Step 4: Testing
+
+### 4.1 Offline Inference Test
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(
+    model="/path/to/MyNewModel",
+    trust_remote_code=True,
+    dtype="float16",
+    tensor_parallel_size=1,  # Verify with single card first
+)
+
+outputs = llm.generate(
+    ["Hello, please introduce yourself."],
+    SamplingParams(temperature=0.7, max_tokens=256),
+)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+
+#### 4.2 Online Service Test
+
+```bash
+XPU_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 --port 8888 \
+    --model /path/to/MyNewModel \
+    --trust-remote-code \
+    --dtype float16 \
+    --max-model-len 4096 \
+    --block-size 64
+```
+
+#### 4.3 Accuracy Verification
+
+It is recommended to compare results with HuggingFace Transformers CPU/GPU inference:
+
+```python
+# Transformers reference output
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("/path/to/MyNewModel", torch_dtype=torch.float16)
+tokenizer = AutoTokenizer.from_pretrained("/path/to/MyNewModel")
+# ... Generate and compare output
+```
+
+---
+
+## III. Adaptation Guide for Special Model Types
+
+### 3.1 MoE Models (e.g., Qwen3-MoE, DeepSeek-V3)
+
+**Reference Files:**
+- `vllm_kunlun/models/qwen3_moe.py`
+- `vllm_kunlun/models/deepseek_v2.py`
+
+**Additional Points:**
+
+- Use `vllm.model_executor.layers.fused_moe.layer.FusedMoE`; Kunlun has replaced the underlying kernel via import hook.
+- MoE's `load_weights()` is more complex, requiring expert parameter mapping:
+
+```python
+expert_params_mapping = FusedMoE.make_expert_params_mapping(
+    ckpt_gate_proj_name="gate_proj",
+    ckpt_down_proj_name="down_proj",
+    ckpt_up_proj_name="up_proj",
+    num_experts=config.n_routed_experts,
+)
+```
+
+- Recommended environment variables:
+
+```bash
+export KUNLUN_USE_MOE_FFN_BLOCK=True
+export XPU_USE_MOE_SORTED_THRES=120
+```
+
+### 3.2 MLA (Multi-Latent Attention) Models (e.g., DeepSeek-V3)
+
+**Reference File:** `vllm_kunlun/models/deepseek_v2.py`
+
+**MLA Special Handling:**
+- KV compression dimensions: `kv_lora_rank`, `qk_nope_head_dim`, `qk_rope_head_dim`.
+- Platform layer automatically selects `FlashMLABackend`:
+
+```python
+# vllm_kunlun/platforms/kunlun.py
+if use_mla:
+    if use_sparse:
+        return "vllm_kunlun.v1.attention.backends.mla.flashmla_sparse.FlashMLASparseBackend"
+    return "vllm_kunlun.v1.attention.backends.mla.flashmla.FlashMLABackend"
+```
+
+- `block_size` usually needs to be set to **64**.
+- Recommended setting: `export USE_ORI_ROPE=1`.
+
+### 3.3 Multi-modal Models (e.g., Qwen2-VL, InternVL)
+
+**Reference Files:**
+- `vllm_kunlun/models/qwen3_vl.py`
+- `vllm_kunlun/models/internvl.py`
+- `vllm_kunlun/models/interns1.py`
+
+**Additional Components to Implement:**
+
+| Component | Description |
+|------|------|
+| `SupportsMultiModal` Interface | Declares that the model supports multi-modal input |
+| Vision Encoder | Usually `InternVisionModel` or custom ViT |
+| Projector | Vision → Language mapping (e.g., MLP) |
+| `@MULTIMODAL_REGISTRY.register_processor(...)` | Register multi-modal processor |
+| `BaseMultiModalProcessor` | Handles multi-modal input |
+| `BaseProcessingInfo` | Handles processing info |
+| `BaseDummyInputsBuilder` | Dummy inputs for the profiling phase |
+
+### 3.4 Hybrid Attention Models (e.g., Qwen3-Next)
+
+**Reference File:** `vllm_kunlun/models/qwen3_next.py`
+
+This model contains both **Linear Attention** and **Full Attention** layer types:
+
+```python
+# Select different attention calculations based on layer_type
+if self.layer_type == "linear_attention":
+    self.linear_attn(hidden_states=hidden_states, output=self_attention_output)
+elif self.layer_type == "full_attention":
+    self.self_attn(hidden_states=hidden_states, output=self_attention_output, positions=positions)
+```
+
+Note:
+- Linear Attention uses `GatedDeltaNet` or similar implementations.
+- Need to register custom `custom_op` (e.g., `vllm.gdn_attention`) for `splitting_ops` in `torch.compile`.
+
+---
+
+## IV. Quantized Model Adaptation
+
+### 4.1 Supported Quantization Methods
+
+| Quantization Method | Adaptation File | Status |
+|---------|---------|------|
+| **INT8 Dynamic (W8A8)** | `ops/quantization/kernels/kunlun_scale_mm.py` | ✅ Recommended |
+| **AWQ (INT4)** | `ops/quantization/awq.py` | ✅ Supported |
+| **GPTQ (INT4)** | `ops/quantization/gptq.py` | ✅ Supported |
+| **CompressedTensors (INT8 MoE)** | `ops/quantization/compressed_tensors/` | ✅ Supported |
+| **FP8** | — | ⚠️ Partial Support |
+| **bfloat16** | — | ⚠️ Double VRAM bug |
+
+### 4.2 Special Handling for Quantization
+
+Kunlun chips use the **max value** for scale calculation instead of vLLM's default absmax:
+
+```python
+# ops/quantization/kernels/kunlun_scale_mm.py
+class KunlunScaledMMLinearKernel(CutlassScaledMMLinearKernel):
+    def process_weights_after_loading(self, layer):
+        super().process_weights_after_loading(layer)
+        # ⭐ Key: Multiply scale by 127.0 to convert to max format
+        with torch.no_grad():
+            getattr(layer, self.w_s_name).mul_(127.0)
+```
+
+INT4 weights need to be **repacked** into the Kunlun layout order:
+
+```python
+# AWQ repack example
+AWQ_TO_KUNLUN_ORDER_NORMAL = [4, 0, 5, 1, 6, 2, 7, 3]
+unpacked_kunlun = unpacked_awq[..., AWQ_TO_KUNLUN_ORDER_NORMAL]
+```
+
+---
+
+## V. Custom Operators (if new low-level Ops are needed)
+
+If your model requires new low-level operators:
+
+### 5.1 Wrap kunlun_ops calls in `_kunlun_ops.py`
+
+```python
+# vllm_kunlun/ops/_kunlun_ops.py
+class KunlunOps:
+    @staticmethod
+    def my_new_op(input, weight, out):
+        """Call underlying kunlun_ops implementation"""
+        kunlun_ops.my_new_op(input, weight, out=out)
+```
+
+### 5.2 Register to vLLM in `_custom_ops.py`
+
+Follow the **three-piece pattern**:
+
+```python
+# vllm_kunlun/ops/_custom_ops.py
+
+# 1. Define the actual implementation of the op
+def my_new_op_impl(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
+    output = torch.empty_like(input)
+    KunlunOps.my_new_op(input, weight, output)
+    return output
+
+# 2. Define fake tensor implementation (for torch.compile)
+def my_new_op_fake(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
+    return torch.empty_like(input)
+
+# 3. Register
+direct_register_custom_op(
+    op_name="my_new_op",
+    op_func=my_new_op_impl,
+    mutates_args=[],
+    fake_impl=my_new_op_fake,
+)
+```
+
+---
+
+## VI. Common Pitfalls Checklist
+
+Before submitting a PR, please check each item:
+
+- [ ] **Attention** uses `vllm_kunlun.ops.attention.layer.Attention`?
+- [ ] **Activation functions** use `vllm_kunlun.ops.activation.SiluAndMul`, etc.?
+- [ ] All submodules in `__init__()` have the `prefix` parameter passed?
+- [ ] `load_weights()` correctly handles weight name mapping (`stacked_params_mapping`)?
+- [ ] `@support_torch_compile` decorator is added to the main model class?
+- [ ] The first parameter of `ModelRegistry.register_model()` exactly matches `architectures` in `config.json`?
+- [ ] No use of `VLLM_USE_V1` environment variable for logic (deprecated, v0.15.1 is V1-only)?
+- [ ] Type annotations use `Optional[T]` instead of `T | None` (to avoid `infer_schema` failure)?
+- [ ] Quantized model scales are correctly multiplied by `127.0`?
+- [ ] Supports Pipeline Parallelism (using `PPMissingLayer`, `is_pp_missing_parameter`)?
+- [ ] Ran `pre-commit` format checks?
+- [ ] Commits use `-s` signature (DCO compliance)?
+
+---
+
+## VII. Reference Template Quick Look-up
+
+| Model Type | Best Reference File | Features |
+|---------|------------|------|
+| Standard Dense LLM | `qwen3.py` | Simplest, recommended for beginners |
+| Dense LLM (Custom Embedding) | `seed_oss.py`, `internlm2.py` | Custom VocabParallelEmbedding |
+| MoE LLM | `qwen3_moe.py` | FusedMoE + EP + SharedExpert |
+| MLA + MoE (DeepSeek) | `deepseek_v2.py` | MLA attention + MoE + Indexer |
+| Hybrid Attention | `qwen3_next.py` | Linear + Full attention |
+| Multi-modal (VL) | `qwen3_vl.py`, `internvl.py` | ViT + Projector + LLM |
+| Speculative Decoding (MTP) | `deepseek_mtp.py` | Multi-Token Prediction |
+
+---
+
+## VIII. Debugging Tips
+
+### 8.1 Startup Failure
+
+- **`ModuleNotFoundError`**: Check if the import hook mapping table in `__init__.py` covers the corresponding module.
+- **`circular import`**: Check if your new code introduces heavy dependencies during the `register()` phase.
+- **`Model architecture XXX is not supported`**: Check if the first parameter of `register_model()` matches `config.json`.
+
+### 8.2 Abnormal Output
+
+- **Garbage output**: Compare with HF transformers output on CPU; likely an operator precision issue or weight loading mapping error.
+- **Repeated tokens**: Check if `rotary_embedding` is applied correctly and if the `is_neox_style` parameter is correct.
+- **Truncated output**: Check `max_model_len` settings and if KV cache is sufficient.
+
+### 8.3 VRAM Issues
+
+- Use `--dtype float16` (avoid bfloat16 due to double VRAM bug).
+- Set `VLLM_KUNLUN_ENABLE_INT8_BMM=1` (saves ~0.1GB).
+- Lower `--gpu-memory-utilization` (default is 0.9).
+- Use INT8 quantized models.
+
+### 8.4 Weight Loading Failure
+
+```python
+# Debugging method: Print parameter names for comparison
+params_dict = dict(self.named_parameters())
+print("=== Model params ===")
+for k in sorted(params_dict.keys()):
+    print(f"  {k}: {params_dict[k].shape}")
+
+# Print in load_weights
+for name, loaded_weight in weights:
+    if name not in params_dict:
+        print(f"  ⚠️ Skipped: {name}")
+```
+
+### 8.5 Kunlun Graph Failure
+
+Confirm that `splitting_ops` in `compilation-config` includes your attention op name:
+
+```json
+{
+  "splitting_ops": [
+    "vllm.unified_attention",
+    "vllm.unified_attention_with_output",
+    "vllm.unified_attention_with_output_kunlun",
+    "vllm.sparse_attn_indexer_vllm_kunlun"
+  ],
+  "cudagraph_mode": "PIECEWISE"
+}
+```
+
+---
+
+## IX. Environment Variables Cheat Sheet
+
+```bash
+# === Required ===
+export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7   # Specify Kunlun cards to use
+export VLLM_HOST_IP=$(hostname -i)             # IP for distributed communication
+
+# === Recommended ===
+export XMLIR_FORCE_USE_XPU_GRAPH=1             # Enable XPU Graph acceleration
+export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false   # Disable mock compile
+export XMLIR_CUDNN_ENABLED=1                   # Enable cuDNN equivalent acceleration
+export XPU_USE_DEFAULT_CTX=1                   # Default context
+export BKCL_FORCE_SYNC=1                       # BKCL forced sync (multi-card stability)
+
+# === Model Specific ===
+export USE_ORI_ROPE=1                          # DeepSeek series uses original RoPE
+export XFT_USE_FAST_SWIGLU=1                   # Fast SwiGLU activation
+export XPU_USE_FAST_SWIGLU=1                   # Same as above (some versions)
+export XPU_USE_MOE_SORTED_THRES=120            # MoE sorting threshold
+export KUNLUN_USE_MOE_FFN_BLOCK=True           # MoE FFN block optimization
+
+# === Optional Tuning ===
+export VLLM_KUNLUN_ENABLE_INT8_BMM=1           # Enable INT8 BMM (saves ~0.1GB)
+```
+
+---
+
+## X. PR Submission Standards
+
+### 10.1 Branch Naming
+
+```
+feature/add-my-new-model
+bugfix/fix-attention-output
+```
+
+### 10.2 Commit Message Prefix
+
+| Prefix | Description |
+|------|------|
+| `[Feature]` | New functionality / New model |
+| `[Bugfix]` | Bug fix |
+| `[CI/Build]` | CI / Build related |
+| `[Doc]` | Documentation update |
+| `[Misc]` | Others |
+
+### 10.3 Before Submission
+
+```bash
+# 1. Install pre-commit
+pre-commit install
+
+# 2. Run checks
+pre-commit run --all-files
+
+# 3. Signed commit (DCO compliance)
+git commit -s -m "[Feature] Add MyNewModel support for Kunlun"
+```
+
+### 10.4 PR Checklist
+
+- [ ] Code passes `pre-commit` checks.
+- [ ] Single-card offline inference test passed.
+- [ ] Multi-card TP test passed (if applicable).
+- [ ] Quantized model test passed (if applicable).
+- [ ] Updated `vllm_kunlun/models/__init__.py` registration.
+- [ ] Updated supported models list in README (if applicable).
+
+---
+
+## Appendix: Standard Startup Command Templates
+
+### A. Standard Dense Model (Single Card)
+
+```bash
+XPU_VISIBLE_DEVICES=0 \
+python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 --port 8888 \
+    --model /path/to/model \
+    --trust-remote-code \
+    --dtype float16 \
+    --max-model-len 8192 \
+    --block-size 64
+```
+
+### B. MoE Model (8-card TP)
+
+```bash
+XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+XMLIR_FORCE_USE_XPU_GRAPH=1 \
+KUNLUN_USE_MOE_FFN_BLOCK=True \
+XPU_USE_MOE_SORTED_THRES=120 \
+python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 --port 8888 \
+    --model /path/to/moe-model-int8 \
+    --trust-remote-code \
+    --dtype float16 \
+    --max-model-len 32768 \
+    --tensor-parallel-size 8 \
+    --max_num_seqs 4 \
+    --block-size 64 \
+    --no-enable-chunked-prefill \
+    --distributed-executor-backend mp \
+    --no-enable-prefix-caching
+```
+
+### C. DeepSeek-V3 (MLA + MoE, W8A8)
+
+```bash
+XMLIR_ENABLE_MOCK_TORCH_COMPILE=false \
+USE_ORI_ROPE=1 \
+XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 --port 8806 \
+    --model /path/to/DeepSeek-V3-w8a8 \
+    --gpu-memory-utilization 0.98 \
+    --trust-remote-code \
+    --max-model-len 32768 \
+    --tensor-parallel-size 8 \
+    --dtype float16 \
+    --max_num_seqs 4 \
+    --block-size 64 \
+    --no-enable-chunked-prefill \
+    --distributed-executor-backend mp \
+    --no-enable-prefix-caching
+```
+
+---
+
+> 📝 **Document Maintenance**: If you have questions or suggestions, please provide feedback in [GitHub Issues](https://github.com/baidu/vLLM-Kunlun/issues).