2025-06-27 16:46:49 +08:00
# Adding a New Model
This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to
[vllm official doc: Adding a New Model ](https://docs.vllm.ai/en/stable/contributing/model/ ) first.
## Step 1: Implementing Models with `torch` and `torch_npu`
2025-10-29 11:32:12 +08:00
This section provides instructions for implementing new models compatible with vLLM and vllm-ascend.
2025-06-27 16:46:49 +08:00
**Before starting:**
2025-10-29 11:32:12 +08:00
- Verify whether your model already exists in vLLM's [models ](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models ) directory.
2025-06-27 16:46:49 +08:00
- Use existing models' implementation as templates to accelerate your development.
### Method 1: Implementing New Models from Scratch
2025-10-29 11:32:12 +08:00
Follow vLLM's [OPT model adaptation ](https://docs.vllm.ai/en/stable/contributing/model/basic.html ) example for guidance.
2025-06-27 16:46:49 +08:00
**Key implementation requirements:**
1. Place model files in `vllm_ascend/models/` directory.
2025-10-29 11:32:12 +08:00
2. Standard module structure for decoder-only LLMs (please checkout vLLM's implementations for other kinds of models):
2025-06-27 16:46:49 +08:00
- `*ModelForCausalLM` (top-level wrapper)
- `*Model` (main architecture)
- `*DecoderLayer` (transformer block)
- `*Attention` and `*MLP` (specific computation unit)
:::{note}
`*` denotes your model's unique identifier.
:::
2025-10-29 11:32:12 +08:00
3. Critical implementation details:
2025-06-27 16:46:49 +08:00
All modules must include a `prefix` argument in `__init__()` .
**Required interfaces:**
| Module Type | Required Methods |
| :------------------- | :---------------------------------------- |
| `*ModelForCausalLM` | `get_input_embeddings` , `compute_logits` , `load_weights` |
| `*Model` | `get_input_embeddings` , `load_weights` |
2025-10-29 11:32:12 +08:00
4. Attention backend integration:
2025-06-27 16:46:49 +08:00
Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py` ).
2025-10-29 11:32:12 +08:00
5. Tensor parallelism:
2025-06-27 16:46:49 +08:00
2025-10-29 11:32:12 +08:00
Use vLLM's parallel layers (`ColumnParallelLinear` , `VocabParallelEmbedding` , etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.).
2025-06-27 16:46:49 +08:00
**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py` ):
```python
from collections.abc import Iterable
from typing import Optional, Union
import torch
from torch import nn
from vllm.attention import Attention
from vllm.config import VllmConfig
from vllm.sequence import IntermediateTensors
class CustomAttention(nn.Module):
def __init__ (self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement attention logic
...
class CustomDecoderLayer(nn.Module):
def __init__ (self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn")
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
# Implement decoder layer
...
class CustomModel(nn.Module):
def __init__ (self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList([
CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}")
for i in range(vllm_config.model_config.hf_config.num_hidden_layers)
])
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
class CustomModelForCausalLM(nn.Module):
def __init__ (self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = CustomModel(vllm_config, prefix=f"{prefix}.model")
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
...
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
...
def compute_logits(self,
hidden_states: torch.Tensor,
sampling_metadata: SamplingMetadata) -> torch.Tensor:
...
def load_weights(self,
weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
...
```
### Method 2: Customizing Existing vLLM Models
2025-10-29 11:32:12 +08:00
For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom DeepSeek model below (assumed path: `vllm_ascend/models/deepseek_v2.py` ).
2025-06-27 16:46:49 +08:00
```python
from typing import List, Optional
import torch
from vllm.attention import AttentionMetadata
from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM
from vllm.sequence import IntermediateTensors
class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM):
# Define merged weights for quantization/efficiency
packed_modules_mapping = {
"gate_up_proj": ["gate_proj", "up_proj"],
"experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"]
}
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: Optional[List[torch.Tensor]] = None,
attn_metadata: Optional[AttentionMetadata] = None,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
# Custom forward logic
hidden_states = self.model(
input_ids,
positions,
kv_caches,
attn_metadata,
intermediate_tensors,
inputs_embeds
)
return hidden_states
```
:::{note}
2025-10-29 11:32:12 +08:00
For a complete implementation reference, see `vllm_ascend/models/deepseek_v2.py` .
2025-06-27 16:46:49 +08:00
:::
## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM
2025-10-29 11:32:12 +08:00
vLLM provides a plugin mechanism for registering externally implemented models without modifying the codebase.
2025-06-27 16:46:49 +08:00
To integrate your implemented model from `vllm_ascend/models/` directory:
1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports.
2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function.
**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py` ):
```python
from vllm import ModelRegistry
def register_model():
from .custom_model import CustomModelForCausalLM # New custom model
from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM # Customized Deepseek
# For NEW architectures: Register with unique name
ModelRegistry.register_model(
"CustomModelForCausalLM", # Must match config.json's 'architectures'
"vllm_ascend.models.custom_model:CustomModelForCausalLM"
)
# For MODIFIED architectures: Use original name
ModelRegistry.register_model(
"DeepseekV2ForCausalLM", # Original architecture identifier in vLLM
"vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM "
)
```
:::{note}
The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model.
```json
{
"architectures": [
"CustomModelForCausalLM"
],
}
```
2025-07-25 22:16:10 +08:00
2025-06-27 16:46:49 +08:00
:::
## Step 3: Verification
2025-10-29 11:32:12 +08:00
### Case 1: Overriding Existing vLLM Model Architectures
2025-06-27 16:46:49 +08:00
2025-10-29 11:32:12 +08:00
If you're registering a customized model architecture based on vLLM's existing implementation (overriding vLLM's original class), when executing vLLM offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py` .
2025-06-27 16:46:49 +08:00
```bash
Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM.
```
2025-10-29 11:32:12 +08:00
### Case 2: Registering New Model Architectures
2025-06-27 16:46:49 +08:00
2025-10-29 11:32:12 +08:00
If you're registering a novel model architecture not present in vLLM (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py` .
2025-06-27 16:46:49 +08:00
```python
logger.info(f"model_arch: {model_arch} has been registered here!")
```
2025-10-29 11:32:12 +08:00
After adding this line, you will see confirmation logs shown below when running vLLM offline/online inference (using any model).
2025-06-27 16:46:49 +08:00
```bash
model_arch: CustomModelForCausalLM has been registered here!
```
2025-10-29 11:32:12 +08:00
This log output confirms your novel model architecture has been successfully registered in vLLM.
2025-06-27 16:46:49 +08:00
## Step 4: Testing
2025-10-29 11:32:12 +08:00
After adding a new model, we should do basic functional test (offline/online inference), accuracy test, and performance benchmark for the model.
2025-06-27 16:46:49 +08:00
Find more details at:
- [Accuracy test guide ](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html )
- [Performance benchmark guide ](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html )
## Step 5: Updating Supported Models Doc
At last, if all the steps above are completed, you should add the new model into our [Supported Models ](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html ) doc.