[doc] update developer guide (#5060)

Update developer doc for v0.11.0-dev. This PR mainly picks developer doc
from main to v0.11.0-dev. All related Feature work with 0.11.0 already.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
wangxiyuan
2025-12-16 14:09:52 +08:00
committed by GitHub
parent e07abfaa75
commit 11e6d6c291
13 changed files with 549 additions and 3 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 198 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 212 KiB

BIN
docs/source/assets/eplb.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

View File

@@ -0,0 +1,83 @@
# KV Cache Pool
## Why KV Cache Pool?
Prefix caching is an important feature in LLM inference that can reduce prefill computation time drastically.
However, the performance gain from prefix caching is highly dependent on cache hit rate, while cache hit rate can be limited if one only uses HBM for kv cache storage.
Hence, KV Cache Pool is proposed to utilize various types of storages including HBM,DRAM and SSD, making a pool for KV Cache storage, while making the prefix of requests visible across all nodes, increasing the cache hit rate for all requests.
vLLM Ascend currently supports [MooncakeStore](https://github.com/kvcache-ai/Mooncake): one of the most recognized KV Cache storage engine;
While one can utilize mooncake store in vLLM V1 engine by setting it as a remote backend of LMCache with GPU (see [Tutorial](https://github.com/LMCache/LMCache/blob/dev/examples/kv_cache_reuse/remote_backends/mooncakestore/README.md)), we find it would be better to integrate a connector that directly supports mooncake store and can utilize the data transfer strategy to one that is best fit to Huawei NPU hardware.
Hence, we propose to integrate Mooncake Store with a brand new **MooncakeStoreConnectorV1**, which is indeed largly inspired by **LMCacheConnectorV1** (see the `How is MooncakestoreConnectorV1 Implemented?` section).
## Usage
vLLM Ascend Currently supports Mooncake Store for KV Cache Pool. To enable Mooncake Store, one needs to config `kv-transfer-config` and choose `MooncakeStoreConnector` as KV Connector.
For step-by-step deployment and configuration, please refer to the [KV Pool User Guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html).
## How it works?
The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) through a connector-based architecture.
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
When combined with vLLMs Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
### 1. Combining KV Cache Pool with HBM Prefix Caching
Prefix Caching with HBM is already supported by the vLLM V1 Engine.
By introducing KV Connector V1, users can seamlessly combine HBM-based Prefix Caching with Mooncake-backed KV Pool.
The user can enable both features simply by enabling Prefix Caching, which is enabled by default in vLLM V1 unless the --no_enable_prefix_caching flag is set, and setting up the KV Connector for KV Pool(e.g. the MooncakeStoreConnector)
**Workflow**:
1. The engine first checks for prefix hits in the HBM cache.
2. After getting the number of hit tokens on HBM, it queries the KV Pool via the connector, if there is additional hits in KV Pool, we get the **additional blocks only** from KV Pool, and get the rest of the blocks directly from HBM to minimize the data transfer latency.
3. After the KV Caches in KV Pool is load into HBM, the remaining process is the same as Prefix Caching in HBM.
### 2. Combining KV Cache Pool with Mooncake PD Disaggregation
When used together with Mooncake PD (Prefill-Decode) Disaggregation, the KV Cache Pool can further decouple prefill and decode stages across devices or nodes.
Currently, we only perform put and get operation of KV Pool for **Prefiil Nodes**, and Decode Nodes get their KV Cache from Mooncake P2P KV Connector, i.e. MooncakeConnector.
The key benefit of doing this is that we can keep the gain in performance by computing less with Prefix Caching from HBM and KV Pool for Prefill Nodes while not sacrificing the data transfer efficiency between Prefill and Decode nodes with P2P KV Connector that transfer KV Caches between NPU devices directly.
To Enable this feature, we need to setup both Mooncake Connector and Mooncake Store connector with a Multi Connector, which is a KV Connector class provided by vLLM that can call multiple KV Connectors in specific order;
For details, please also refer to the Mooncake Connector Store Deployment Guide.
## How is MooncakestoreConnectorV1 Implemented?
**MooncakestoreConnectorV1** inhereits the KV Connector V1 class in vLLM V1: through implementing the required methods defined in the KV connector V1 base class, one can integrate a thrid-party KV cache transfer/storage backend into the vLLM framework.
MooncakeStoreConnectorV1 is also largly inspried by LMCacheConnectorV1 in term of the `Lookup Engine`/`Lookup Client` design for looking up KV cache keys, and the `ChunkedTokenDatabase` class for processing tokens into prefix-aware hashes as well as other hashing related designs. On top of this, we have also added our own design including `KVTransferThread` that allows async `get` and `put` of KV caches with multi-threading, and NPU-related data transfer optimization such as removing the `LocalBuffer` in LMCache to remove redundant data transfer.
The KV Connector methods that need to be implemented can be categorized into scheduler-side methods that are called in V1 scheduler and worker-side methods that are called in V1 worker, namely:
### KV Connector Scheduler-Side Methods:
`get_num_new_matched_tokens`: Get prefix cache hit in number of tokens through looking up into the KV pool.
`update_states_after_alloc`: Update KVConnector state after temporary buffer alloc.
`build_connector_meta`: Attach the connector metadata to the request object.
`request_finished`: Once a request is finished, determine whether request blocks should be freed now or will be sent asynchronously and freed later.
### Connector Worker-Side Methods:
`register_kv_caches`: Register KV cache buffers needed for KV cache transfer.
`start_load_kv`: Perform KV cache load operation that transfers KV cache from storage to device.
`wait_for_layer_load`: Optional; Wait for layer load in layerwise + async KV load scenario.
`save_kv_layer`: Optional Do layerwise KV cache put into KV Pool.
`wait_for_save`: Wait for KV Save to finish if async KV cache save/put.
`get_finished` Get request that finished KV transfer, `done_sending` if `put` finished, `done_reciving` if `get` finished.
## DFX
1. When looking up a key in KV Pool, if we cannot find the key, there is no Cache Hit for this specific block; we return no hit for this block and do not look up further blocks for current request.
2. Similaly, when we are trying to put a block into KV Pool and failed, we do not put further blocks (subject to change).
## Limitation
1. Currently, Mooncake Store for vLLM-Ascend only supports DRAM as the storage for KV Cache pool.
2. For now, if we successfully looked up a key and found it exists, but failed to get it when calling KV Pool's get function, we just output a log indicating the get operation failed and keep going; hence, the accuracy of that specific request may be affected. We will handle this situation by falling back the request and re-compute everything assuming there's no prefix cache hit (or even better, revert only one block and keep using the Prefix Caches before that).

View File

@@ -55,7 +55,6 @@ There are mainly three types of variables.
**Note**: Both of these two tables are come from the `_update_states` method before **preparing inputs**. You can take a look if you need more inspiration.
### Tips
What is `Token ID`?
Simply put, a `token ID` is an **integer** (usually `int32`), which represents a token.
Example of `Token ID`:

View File

@@ -0,0 +1,109 @@
# Multi Token Prediction (MTP)
## Why We Need MTP
MTP boosts inference performance by parallelizing the prediction of multiple tokens, shifting from single-token to multi-token generation. This approach significantly increases generation throughput and achieves multiplicative acceleration in inference speed—all without compromising output quality.
## How to Use MTP
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
## How It Works
### Module Architecture
```
vllm_ascend
├── sample
│ ├── rejection_sample.py
├── spec_decode
│ ├── mtp_proposer.py
└───────────
```
**1. sample**
- *rejection_sample.py*: During decoding, the main model processes the previous rounds output token and the predicted token together (computing 1+k tokens simultaneously). The first token is always correct, while the second token—referred to as the **bonus token**—is uncertain since it is derived from speculative prediction, thus We employ **Greedy Strategy** and **Rejection Sampling Strategy** to determine whether the bonus token should be accepted. The module structure consists of an `AscendRejectionSampler` class with a forward method that implements the specific sampling logic.
```
rejection_sample.py
├── AscendRejectionSampler
│ ├── forward
```
**2. spec_decode**
This section encompasses the model preprocessing for spec-decode, primarily structured as follows: it includes loading the model, executing a dummy run, and generating token ids. These steps collectively form the model data construction and forward invocation for a single spec-decode operation.
- *mtp_proposer.py*: Configure vLLM-Ascend to use speculative decoding where proposals are generated by deepseek mtp layer.
```
mtp_proposer.py
├── Proposer
│ ├── load_model
│ ├── dummy_run
│ ├── generate_token_ids
│ ├── _prepare_inputs
│ ├── _propose
```
### Algorithm
**1. Reject_Sample**
- *Greedy Strategy*
Verify whether the token generated by the main model matches the speculative token predicted by MTP in the previous round. If they match exactly, accept the bonus token; otherwise, reject it and any subsequent tokens derived from that speculation.
- *Rejection Sampling Strategy*
This method introduces stochasticity in rejection sampling.
For each draft token, acceptance is determined by verifying whether the inequality `P_target / P_draft ≥ U` holds, where `P_target` represents the probability assigned to the current draft token by the target model, `P_draft` denotes the probability assigned by the draft model, and `U` is a random number sampled uniformly from the interval [0, 1).
The decision logic for each draft token is as follows: if the inequality `P_target / P_draft ≥ U` holds, the draft token is accepted as output; conversely, if `P_target / P_draft < U`, the draft token is rejected.
When a draft token is rejected, a recovery sampling process is triggered where a "recovered token" is resampled from the adjusted probability distribution defined as `Q = max(P_target - P_draft, 0)`. In the current MTP implementation, since `P_draft` is not provided and defaults to 1, the formulas simplify such that token acceptance occurs when `P_target ≥ U,` and the recovery distribution becomes `Q = max(P_target - 1, 0)`.
**2. Performance**
If the bonus token is accepted, the MTP model performs inference for (num_speculative +1) tokens, including original main model output token and bonus token. If rejected, inference is performed for less token, determining on how many tokens accepted.
## DFX
### Method Validation
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
```
def get_spec_decode_method(method,
vllm_config,
device,
runner):
if method == "ngram":
return NgramProposer(vllm_config, device, runner)
elif method in ["eagle", "eagle3"]:
return EagleProposer(vllm_config, device, runner)
elif method == 'mtp':
return MtpProposer(vllm_config, device, runner)
else:
raise ValueError("Unknown speculative decoding method: "
f"{method}")
```
### Integer Validation
- The current npu_fused_infer_attention_score operator only supports integers less than 16 per decode round. Therefore, the maximum supported value for MTP is 15. If a value greater than 15 is provided, the code will raise an error and alert the user.
```
if self.speculative_config:
spec_token_num = self.speculative_config.num_speculative_tokens
self.decode_threshold += spec_token_num
assert self.decode_threshold <= 16, f"decode_threshold exceeded \
npu_fused_infer_attention_score TND layout's limit of 16, \
got {self.decode_threshold}"
```
## Limitation
- Due to the fact that only a single layer of weights is exposed in DeepSeek's MTP, the accuracy and performance are not effectively guaranteed in scenarios where MTP > 1 (especially MTP ≥ 3). Moreover, due to current operator limitations, MTP supports a maximum of 15.
- In the fullgraph mode with MTP > 1, the capture size of each aclgraph must be an integer multiple of (num_speculative_tokens + 1).

View File

@@ -0,0 +1,25 @@
# Adding a custom aclnn operation
This document describes how to add a custom aclnn operation to vllm-ascend.
## How custom aclnn operation works in vllm-ascend?
Custom aclnn operations are built and installed into `vllm_ascend/cann_ops_custom` directory during the build process of vllm-ascend. Then the aclnn operators are bound to `torch.ops._C_ascend` module, enabling users to invoke them in vllm-ascend python code.
To enable custom operations, use the following code:
```python
from vllm_ascend.utils import enable_custom_op
enable_custom_op()
```
## How to add a custom aclnn operation?
- Create a new operation folder under `csrc` directory
- Create `op_host` and `op_kernel` directories for host and kernel source code
- Add build options in `csrc/build_aclnn.sh` for supported SOC. Note that multiple ops should be separated with `;`, i.e. `CUSTOM_OPS=op1;op2;op3`
- Bind aclnn operators to torch.ops._C_ascend module in `csrc/torch_binding.cpp`
- Write a meta implementation in `csrc/torch_binding_meta.cpp` for op being captured into aclgraph
After a successful build of vllm-ascend, the custom aclnn operation can be invoked in python code.

View File

@@ -0,0 +1,103 @@
# Disaggregated-prefill
## Why disaggregated-prefill?
This feature addresses the need to optimize the **Time Per Output Token (TPOT)** and **Time To First Token (TTFT)** in large-scale inference tasks. The motivation is two-fold:
1. **Adjusting Parallel Strategy and Instance Count for P and D Nodes**
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
2. **Optimizing TPOT**
Without disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. disaggregated-prefill solves this by allowing for better control over the systems **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
---
## Usage
vLLM Ascend currently supports two types of connectors for handling KV cache management:
- **MooncakeConnector**: D nodes pull KV cache from P nodes.
- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner.
For step-by-step deployment and configuration, refer to the following guide:
[https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
---
## How It Works
### 1. Design Approach
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (keyvalue cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
### 2. Implementation Design
Our design diagram is shown below, illustrating the pull and push schemes respectively.
![alt text](../../assets/disaggregated_prefill_pull.png)
![alt text](../../assets/disaggregated_prefill_push.png)
#### Mooncake Connector:
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
3. After the P nodes scheduler finishes prefill, `update_from_output` invokes the schedule connectors `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
#### Mooncake Layerwise Connector:
1. The request is sent to the Proxys `_handle_completions` endpoint.
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
4. The Proxys `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`.
5. During processing, the P nodes scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
6. The D node performs decoding and returns the result.
### 3. Interface Design
Taking MooncakeConnector as an example, the system is organized into three primary classes:
- **MooncakeConnector**: Base class that provides core interfaces.
- **MooncakeConnectorScheduler**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion.
- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes.
### 4. Specifications Design
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes.
| Feature | Status |
|-------------------------------|----------------|
| A2 | 🟢 Functional |
| A3 | 🟢 Functional |
| equal TP configuration | 🟢 Functional |
| unequal TP configuration | 🟢 Functional |
| MLA | 🟢 Functional |
| GQA | 🟢 Functional |
- 🟢 Functional: Fully operational, with ongoing optimizations.
- 🔵 Experimental: Experimental support, interfaces and functions may change.
- 🚧 WIP: Under active development, will be supported soon.
- 🟡 Planned: Scheduled for future implementation (some may have open PRs/RFCs).
- 🔴 NO plan/Deprecated: No plan or deprecated by vLLM.
---
## DFX Analysis
### 1. Config Parameter Validation
Validate KV transfer config by checking whether the kv_connector type is supported and whether kv_connector_module_path exists and is loadable. On transfer failures, emit clear error logs for diagnostics.
### 2. Port Conflict Detection
Before startup, perform a port-usage check on configured ports (e.g., rpc_port, metrics_port, http_port/metaserver) by attempting to bind. If a port is already in use, fail fast and log an error.
### 3. PD Ratio Validation
Under non-symmetric PD scenarios, validate the P-to-D tp ratio against expected and scheduling constraints to ensure correct and reliable operation.
---
## Limitations
- Heterogeneous P and D nodes are not supported—for example, running P nodes on A2 and D nodes on A3.
- In non-symmetric TP configurations, only cases where the P nodes have a higher TP degree than the D nodes and the P TP count is an integer multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % D_tp = 0).

View File

@@ -0,0 +1,222 @@
# Expert Parallelism Load Balancer (EPLB)
## Why We Need EPLB?
When using Expert Parallelism (EP), different experts are assigned to different NPUs. Given that the load of various experts may vary depending on the current workload, it is crucial to maintain balanced loads across different NPUs. We adopt a redundant experts strategy by duplicating heavily-loaded experts. Then, we heuristically pack these duplicated experts onto NPUs to ensure load balancing across them. Moreover, thanks to the group-limited expert routing used in MoE models, we also attempt to place experts of the same group on the same node to reduce inter-node data traffic, whenever possible.
To facilitate reproduction and deployment, Vllm Ascend supported deployed EP load balancing algorithm in `vllm_ascend/eplb/core/policy`. The algorithm computes a balanced expert replication and placement plan based on the estimated expert loads. Note that the exact method for predicting expert loads is outside the scope of this repository. A common method is to use a moving average of historical statistics.
![eplb](../../assets/eplb.png)
## How to Use EPLB?
Please refer to the EPLB section of the user guide for detailed information: [How to Use EPLB](../../user_guide/feature_guide/eplb_swift_balancer.md)
## How It Works?
**EPLB Module Architecture**
```
vllm_ascend
├── eplb
│ ├── adaptor
│ │ ├── abstract_adaptor.py
│ │ ├── vllm_adaptor.py
│ ├── core
│ │ ├── policy
│ │ │ ├── policy_abstract.py
│ │ │ ├── policy_dynamic_ep.py
│ │ │ ├── policy_dynamic_ep_v2.py
│ │ │ ├── policy_factory.py
│ │ │ ├── policy_flashlb.py
│ │ ├── eplb_device_transfer_loader.py
│ │ ├── eplb_utils.py
│ │ ├── eplb_worker.py
│ ├── eplb_updator.py
│ ├── utils.py
└───────────
```
**1. Adaptor Module**
*Handles registration and adaptation for different MoE model types*
- `abstract_adaptor.py`
Abstract base class defining unified registration interfaces for EPLB adapters
- `vllm_adaptor.py`
Implementation supporting Qwen3-MoE and DeepSeek models, standardizing parameter handling for policy algorithms
**2. Core Module**
*Implements core algorithms, updates, and asynchronous processing*
- **Policy Submodule**
*Load balancing algorithms with factory pattern instantiation*
- `policy_abstract.py`
Abstract class for load balancing strategy interfaces
- `policy_dynamic_ep.py`
Default implementation of open-source EPLB paper algorithm
- `policy_dynamic_ep_v2.py`
Enhanced version optimizing expert swaps for low-bandwidth devices (e.g., A2)
- `policy_flashlb.py`
Threshold-based adjustment reducing operational costs through layer-wise fluctuation detection
- `policy_factory.py`
Strategy factory for automatic algorithm instantiation
- `eplb_device_transfer_loader.py`
Manages expert table/weight transmission and updates
- `eplb_utils.py`
Utilities for expert table initialization and mapping
- `eplb_worker.py`
Asynchronous algorithm orchestration and result processing
**3. System Components**
- `eplb_updator.py`
Central coordinator for load balancing during inference workflows
- `utils.py`
General utilities for EPLB interface registration
*Key Optimizations:*
1. Maintained original structure while improving technical clarity
2. Standardized terminology
3. Enhanced algorithm differentiation through concise descriptors
4. Improved scoping through hierarchical presentation
5. Preserved file/class relationships while optimizing readability
### Default Algorithm
#### Hierarchical Load Balancing
When the number of server nodes evenly divides the number of expert groups, we use the hierarchical load balancing policy to leverage group-limited expert routing. We first pack the expert groups onto nodes evenly, ensuring balanced loads across different nodes. Then, we replicate the experts within each node. Finally, we pack the replicated experts onto individual NPUs to ensure load balancing across them. The hierarchical load balancing policy can be used in the prefilling stage with a smaller expert-parallel size.
#### Global Load Balancing
In other cases, we use the global load balancing policy, which replicates experts globally regardless of expert groups, and packs the replicated experts onto individual NPUs. This policy can be adopted in the decoding stage with a larger expert-parallel size.
### Add a New EPLB Policy
If you want to add a new eplb policy to vllm_ascend, you must follow these steps:
1. Inherit the `EplbPolicy` abstract class of `policy_abstract.py` and override the `rebalance_experts` interface, ensuring consistent input parameters `current_expert_table`, `expert_workload` and return types `newplacement`.
For example:
```python
class RandomLoadBalance(EplbPolicy):
def __init__(self, config: DynamicConfig):
super().__init__(config)
def rebalance_experts(self, current_expert_table, expert_workload):
new_table = copy.deepcopy(current_expert_table)
num_layers = len(current_expert_table)
for i in range(num_layers):
# randomly choose two card
# indices = random.sample(range(num_card), 2)
indices = [3, 1]
# swap redundant experts
expert_id_to_exchange = new_table[i][indices[0]][-1].clone()
new_table[i][indices[0]][-1] = new_table[i][indices[1]][-1]
new_table[i][indices[1]][-1] = expert_id_to_exchange
return 1, [-i for i in range(num_layers)], new_table
```
2. To add a new EPLB algorithm, include the policy type and its corresponding implementation class in the `PolicyFactory` of `policy_factory.py`.
### Add a New MoE Model
**Implementation Guide for Model Integration**
1. **Adapter File Modification**
- Inherit or modify `vllm_ascend/eplb/adaptor/vllm_adaptor.py`
- Add processing logic for key parameters:
- `num_dense_layers`
- `global_expert_num`
- `num_roe_layers`
- Ensure parameter synchronization in the `model_register` function.
For example:
Modify `__init__` of `vllm_adaptor.py` to add a new moe model eplb params:
```python
if self.model.config.model_type == "qwen3_moe":
self.num_dense_layers = 0
self.global_expert_num = self.model.config.num_experts
```
Modify `model_register` of `vllm_adaptor.py` to register eplb params for new moe model:
```python
if config.model_type == "qwen3_moe":
model.num_moe_layers = config.num_hidden_layers
```
2. **MoE Feature Integration**
- Extend `vllm_ascend/eplb/utils.py` with MoE-specific methods
- Implement required functionality for expert routing or weight management
3. **Registration Logic Update**
- Add patch logic within the `model_register` function
- Maintain backward compatibility with existing model types
4. **Validation & Testing**
- Verify parameter consistency across layers
- Test cross-device communication for expert tables
- Benchmark against baseline implementations (e.g., Qwen3-MoE)
*Key Implementation Notes:*
- Preserve existing interface contracts in abstract classes
- Use decorators for non-intrusive patch integration
- Leverage `eplb_utils.py` for shared expert mapping operations
## DFX
### Parameter Validation
#### Integer Parameters
All integer input parameters must explicitly specify their maximum and minimum values and be subject to valid value validation. For example, `num_iterations_eplb_update` must be greater than 0:
```python
@staticmethod
def check_iterations(iterations):
if not isinstance(iterations, int):
raise TypeError(f"The {iterations} is not int.")
if iterations <= 0:
raise ValueError(
f"The {iterations} can not less than or equal to 0.")
if iterations > sys.maxsize:
raise ValueError(
f"The {iterations} can not large than {sys.maxsize}")
```
#### File Path
The file path for EPLB must be checked for legality, such as whether the file path is valid and whether it has appropriate read and write permissions. For example:
```python
@staticmethod
def check_expert_map_path(expert_map):
if expert_map is None:
return
if not isinstance(expert_map, str):
raise TypeError("The expert_map is not str.")
if not expert_map.strip():
raise ValueError("The expert_map is not empty.")
_, ext = os.path.splitext(expert_map)
if ext.lower() != ".json":
raise TypeError("The expert_map is not json.")
if not os.path.exists(expert_map):
raise ValueError("The expert_map is not exist.")
try:
with open(expert_map, "w", encoding='utf-8') as f:
f.read()
except Exception as e:
raise IOError(
f"Fail read expert info from {expert_map}, please check the reading permission of {expert_map} : {e}"
)
```
### Function Specifications
#### Initialization Function
All EPLB parameters must be initialized by default during initialization, with specified parameter types and default values for proper handling.
#### General Functions
All method arguments must specify parameter types and default values, and functions must include default return value handling for default arguments. It is recommended to use `try-except` blocks to handle the function body, specifying the type of exception captured and the failure handling (e.g., logging exceptions or returning a failure status).
### Consistency
#### Expert Map
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
#### Expert Weight
When updating expert weights, ensure that the memory allocated for the expert weights has been released, or that the expert (referring to the old version) is no longer in use.
## Limitation
Before using EPLB, start the script and add `export DYNAMIC_EPLB="true"`.
Before performing load data collection (or performance data collection), start the script and add `export EXPERT_MAP_RECORD="true"`.

View File

@@ -7,5 +7,10 @@ This section provides an overview of the features implemented in vLLM Ascend. De
:maxdepth: 1
patch
ModelRunner_prepare_inputs
disaggregated_prefill
eplb_swift_balancer.md
Multi_Token_Prediction
ACL_Graph
KV_Cache_Pool_Guide
add_custom_aclnn_op
:::

View File

@@ -38,7 +38,7 @@ Before writing a patch, following the principle above, we should patch the least
1. Decide which version of vLLM we should patch. For example, after analysis, here we want to patch both `0.10.0` and `main` of vLLM.
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example:
```python
@@ -51,7 +51,7 @@ Before writing a patch, following the principle above, we should patch the least
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_distributed` into `vllm_ascend/patch/platform/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
```