diff --git a/docs/source/developer_guide/modeling/adding_a_new_model.md b/docs/source/developer_guide/modeling/adding_a_new_model.md new file mode 100644 index 0000000..d883458 --- /dev/null +++ b/docs/source/developer_guide/modeling/adding_a_new_model.md @@ -0,0 +1,258 @@ +# Adding a New Model + +This guide demonstrates how to integrate a novel or customized model into vllm-ascend. For foundational concepts, it is highly recommended to refer to +[vllm official doc: Adding a New Model](https://docs.vllm.ai/en/stable/contributing/model/) first. + +## Step 1: Implementing Models with `torch` and `torch_npu` + +This section provides instructions for implementing new models compatible with vllm and vllm-ascend. + +**Before starting:** + +- Verify whether your model already exists in vllm's [models](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) directory. +- Use existing models' implementation as templates to accelerate your development. + +### Method 1: Implementing New Models from Scratch + +Follow vllm's [OPT model adaptation](https://docs.vllm.ai/en/stable/contributing/model/basic.html) example for guidance. + +**Key implementation requirements:** + +1. Place model files in `vllm_ascend/models/` directory. + +2. Standard module structure for decoder-only LLMs (please checkout vllm's implementations for other kinds of model): + +- `*ModelForCausalLM` (top-level wrapper) +- `*Model` (main architecture) +- `*DecoderLayer` (transformer block) +- `*Attention` and `*MLP` (specific computation unit) + +:::{note} +`*` denotes your model's unique identifier. +::: + +3. Critical Implementation Details: + +All modules must include a `prefix` argument in `__init__()`. + +**Required interfaces:** + +| Module Type | Required Methods | +| :------------------- | :---------------------------------------- | +| `*ModelForCausalLM` | `get_input_embeddings`, `compute_logits`, `load_weights` | +| `*Model` | `get_input_embeddings`, `load_weights` | + +4. Attention Backend Integration: + +Importing attention via `from vllm.attention import Attention` can automatically leverage the attention backend routing of vllm-ascend (see: `get_attn_backend_cls()` in `vllm_ascend/platform.py`). + +5. Tensor Parallelism: + +Use vllm's parallel layers (`ColumnParallelLinear`, `VocabParallelEmbedding`, etc.) to implement models supporting tensor parallelism. Note that Ascend-specific customizations are implemented in `vllm_ascend/ops/` directory (RMSNorm, VocabParallelEmbedding, etc.). + +**Reference Implementation Template** (assumed path: `vllm_ascend/models/custom_model.py`): + +```python +from collections.abc import Iterable +from typing import Optional, Union + +import torch +from torch import nn +from vllm.attention import Attention +from vllm.config import VllmConfig +from vllm.sequence import IntermediateTensors +from vllm.model_executor.sampling_metadata import SamplingMetadata + +class CustomAttention(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.attn = Attention(prefix=f"{prefix}.attn") + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + # Implement attention logic + ... + +class CustomDecoderLayer(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.self_attn = CustomAttention(vllm_config, prefix=f"{prefix}.self_attn") + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + # Implement decoder layer + ... + +class CustomModel(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str): + super().__init__() + self.layers = nn.ModuleList([ + CustomDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") + for i in range(vllm_config.model_config.hf_config.num_hidden_layers) + ]) + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + ... + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + ... + + def load_weights(self, + weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: + ... + +class CustomModelForCausalLM(nn.Module): + def __init__(self, vllm_config: VllmConfig, prefix: str = ""): + super().__init__() + self.model = CustomModel(vllm_config, prefix=f"{prefix}.model") + + def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: + ... + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + ... + + def compute_logits(self, + hidden_states: torch.Tensor, + sampling_metadata: SamplingMetadata) -> torch.Tensor: + ... + + def load_weights(self, + weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: + ... +``` + +### Method 2: Customizing Existing vLLM Models + +For most use cases, extending existing implementations is preferable. We demonstrate an example to inherit from base classes and implement a custom deepseek model below (assumed path: `vllm_ascend/models/deepseek_v2.py`). + +```python +from typing import List, Optional +import torch +from vllm.attention import AttentionMetadata +from vllm.model_executor.models.deepseek_v2 import DeepseekV2ForCausalLM +from vllm.sequence import IntermediateTensors + +class CustomDeepseekV2ForCausalLM(DeepseekV2ForCausalLM): + # Define merged weights for quantization/efficiency + packed_modules_mapping = { + "gate_up_proj": ["gate_proj", "up_proj"], + "experts": ["experts.0.gate_proj", "experts.0.up_proj", "experts.0.down_proj"] + } + + def forward( + self, + input_ids: torch.Tensor, + positions: torch.Tensor, + kv_caches: Optional[List[torch.Tensor]] = None, + attn_metadata: Optional[AttentionMetadata] = None, + intermediate_tensors: Optional[IntermediateTensors] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ) -> Union[torch.Tensor, IntermediateTensors]: + # Custom forward logic + hidden_states = self.model( + input_ids, + positions, + kv_caches, + attn_metadata, + intermediate_tensors, + inputs_embeds + ) + return hidden_states +``` + +:::{note} +For a complete implementation reference, see: `vllm_ascend/models/deepseek_v2.py`. +::: + +## Step 2: Registering Custom Models using ModelRegistry Plugins in vLLM + +vllm provides a plugin mechanism for registering externally implemented models without modifying its codebase. + +To integrate your implemented model from `vllm_ascend/models/` directory: + +1. Import your model implementation in `vllm_ascend/models/__init__.py` using relative imports. +2. Register the model wrapper class via `vllm.ModelRegistry.register_model()` function. + +**Reference Registration Template** (an example of registering new models in `vllm_ascend/models/__init__.py`): + +```python +from vllm import ModelRegistry + +def register_model(): + from .custom_model import CustomModelForCausalLM # New custom model + from .deepseek_v2 import ModifiedDeepseekV2ForCausalLM # Customized Deepseek + + # For NEW architectures: Register with unique name + ModelRegistry.register_model( + "CustomModelForCausalLM", # Must match config.json's 'architectures' + "vllm_ascend.models.custom_model:CustomModelForCausalLM" + ) + + # For MODIFIED architectures: Use original name + ModelRegistry.register_model( + "DeepseekV2ForCausalLM", # Original architecture identifier in vLLM + "vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM " + ) +``` + +:::{note} +The first argument of `vllm.ModelRegistry.register_model()` indicates the unique architecture identifier which must match `architectures` in `config.json` of the model. + +```json +{ + "architectures": [ + "CustomModelForCausalLM" + ], +} +``` +::: + +## Step 3: Verification + +### Case 1: Overriding Existing vLLM Model Architecture + +If you're registering a customized model architecture based on vllm's existing implementation (overriding vllm's original class), when executing vllm offline/online inference (using any model), you'll observe warning logs similar to the following output from `vllm/models_executor/models/registry.py`. + +```bash +Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend/models/deepseek_v2:CustomDeepseekV2ForCausalLM. +``` + +### Case 2: Registering New Model Architecture + +If you're registering a novel model architecture not present in vllm (creating a completely new class), current logs won't provide explicit confirmation by default. It's recommended to add the following logging statement at the end of the `register_model` method in `vllm/models_executor/models/registry.py`. + +```python +logger.info(f"model_arch: {model_arch} has been registered here!") +``` + +After adding this line, you will see confirmation logs shown below when running vllm offline/online inference (using any model). + +```bash +model_arch: CustomModelForCausalLM has been registered here! +``` + +This log output confirms your novel model architecture has been successfully registered in vllm. + +## Step 4: Testing + +After adding a new model, we should do basic functional test (offline/online inference), accuracy test and performance benchmark for the model. + +Find more details at: + +- [Accuracy test guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/index.html) +- [Performance benchmark guide](https://vllm-ascend.readthedocs.io/en/latest/developer_guide/performance/performance_benchmark.html) + +## Step 5: Updating Supported Models Doc + +At last, if all the steps above are completed, you should add the new model into our [Supported Models](https://vllm-ascend.readthedocs.io/en/latest/user_guide/supported_models.html) doc. diff --git a/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md b/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md new file mode 100644 index 0000000..03fadf3 --- /dev/null +++ b/docs/source/developer_guide/modeling/adding_a_new_multimodal_model.md @@ -0,0 +1,3 @@ +# Adding a New Multi-Modal Model + +**_Comming soon ..._** diff --git a/docs/source/developer_guide/modeling/index.md b/docs/source/developer_guide/modeling/index.md new file mode 100644 index 0000000..96eeb50 --- /dev/null +++ b/docs/source/developer_guide/modeling/index.md @@ -0,0 +1,10 @@ +# Modeling + +This section provides tutorials of how to implement and register a new model into vllm-ascend. + +:::{toctree} +:caption: Modeling +:maxdepth: 1 +adding_a_new_model +adding_a_new_multimodal_model +::: diff --git a/docs/source/index.md b/docs/source/index.md index 3a02092..d9756d7 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -61,6 +61,7 @@ developer_guide/contribution/index developer_guide/feature_guide/index developer_guide/evaluation/index developer_guide/performance/index +developer_guide/modeling/index ::: % How to involve vLLM Ascend