[Doc][Misc] Improve readability and fix typos in documentation (#8340)
### What this PR does / why we need it? This PR improves the readability of the documentation by fixing typos, correcting command extensions, and fixing broken links in the Chinese README. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes only. --------- Signed-off-by: sunshine202600 <sunshine202600@163.com>
This commit is contained in:
@@ -26,7 +26,7 @@ The KV Cache Pool integrates multiple memory tiers (HBM, DRAM, SSD, etc.) throug
|
||||
|
||||
Each connector implements a unified interface for storing, retrieving, and transferring KV blocks between tiers, depending on access frequency and hardware bandwidth.
|
||||
|
||||
When combined with vLLM’s Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
When combined with vLLM's Prefix Caching mechanism, the pool enables efficient caching both locally (in HBM) and globally (via Mooncake), ensuring that frequently used prefixes remain hot while less frequently accessed KV data can spill over to lower-cost memory.
|
||||
|
||||
### 1. Combining KV Cache Pool with HBM Prefix Caching
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ This feature addresses the need to optimize the **Time Per Output Token (TPOT)**
|
||||
Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**.
|
||||
|
||||
2. **Optimizing TPOT**
|
||||
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
Without the disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. Disaggregated-prefill solves this by allowing for better control over the system's **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens.
|
||||
|
||||
---
|
||||
|
||||
@@ -28,7 +28,7 @@ For step-by-step deployment and configuration, refer to the following guide:
|
||||
|
||||
### 1. Design Approach
|
||||
|
||||
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key–value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
|
||||
Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key-value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication.
|
||||
|
||||
### 2. Implementation Design
|
||||
|
||||
@@ -38,19 +38,19 @@ Our design diagram is shown below, illustrating the pull and push schemes respec
|
||||
|
||||
#### Mooncake Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
1. The request is sent to the Proxy's `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
3. After the P node's scheduler finishes prefill, `update_from_output` invokes the schedule connector's `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy.
|
||||
4. The Proxy calls `select_decoder` to choose a D node and forwards the request.
|
||||
5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result.
|
||||
|
||||
#### Mooncake Layerwise Connector
|
||||
|
||||
1. The request is sent to the Proxy’s `_handle_completions` endpoint.
|
||||
1. The request is sent to the Proxy's `_handle_completions` endpoint.
|
||||
2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint.
|
||||
3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete.
|
||||
4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
4. The Proxy's `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_completion_tokens=1`, and `min_tokens=1`.
|
||||
5. During processing, the P node's scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding.
|
||||
6. The D node performs decoding and returns the result.
|
||||
|
||||
### 3. Interface Design
|
||||
@@ -63,7 +63,7 @@ Taking MooncakeConnector as an example, the system is organized into three prima
|
||||
|
||||
### 4. Specifications Design
|
||||
|
||||
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes.
|
||||
This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving equal TP setups and certain unequal TP setups across multiple P and D nodes.
|
||||
|
||||
| Feature | Status |
|
||||
|-------------------------------|----------------|
|
||||
|
||||
@@ -236,7 +236,7 @@ All method arguments must specify parameter types and default values, and functi
|
||||
|
||||
#### Expert Map
|
||||
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified which ranks have inconsistent maps.
|
||||
The expert map must be globally unique during initialization and update. In a multi-node scenario during initialization, distributed communication should be used to verify the consistency of expert maps across each rank. If they are inconsistent, the user should be notified of which ranks have inconsistent maps.
|
||||
During the update process, if only a few layers or the expert table of a certain rank has been changed, the updated expert table must be synchronized with the EPLB's context to ensure global consistency.
|
||||
|
||||
#### Expert Weight
|
||||
|
||||
@@ -2,18 +2,18 @@
|
||||
|
||||
## How Does It Work?
|
||||
|
||||
This is an optimization based on Fx graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
This is an optimization based on FX graphs, which can be considered an acceleration solution for the aclgraph mode.
|
||||
|
||||
You can get its code [code](https://gitcode.com/Ascend/torchair)
|
||||
|
||||
## Default Fx Graph Optimization
|
||||
## Default FX Graph Optimization
|
||||
|
||||
### Fx Graph pass
|
||||
### FX Graph pass
|
||||
|
||||
- For the intermediate nodes of the model, replace the non-in-place operators contained in the nodes with in-place operators to reduce memory movement during computation and improve performance.
|
||||
- For the original input parameters of the model, if they include in-place operators, Dynamo's Functionalize process will replace the in-place operators with a form of non-in-place operators + copy operators. npugraph_ex will reverse this process, restoring the in-place operators and reducing memory movement.
|
||||
|
||||
### Fx fusion pass
|
||||
### FX fusion pass
|
||||
|
||||
npugraph_ex now provides three default operator fusion passes, and more will be added in the future.
|
||||
|
||||
|
||||
@@ -73,4 +73,4 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
## Limitations
|
||||
|
||||
1. In V1 Engine, vLLM starts three kinds of processes: Main process, EngineCore process and Worker process. Now vLLM Ascend can only patch the code in Main process and Worker process by default. If you want to patch the code running in EngineCore process, you should patch EngineCore process entirely during setup. Find the entire code in `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for v0.10.0 should work.
|
||||
2. If you are running edited vLLM code, the version of vLLM may be changed automatically. For example, if you run the edited vLLM based on v0.9.n, the version of vLLM may be changed to v0.9.nxxx. In this case, the patch for v0.9.n in vLLM Ascend would not work as expected, because vLLM Ascend can't distinguish the version of the vLLM you're using. In this case, you can set the environment variable `VLLM_VERSION` to specify the version of the vLLM you're using, and then the patch for that version (e.g., v0.9.n) should work.
|
||||
|
||||
@@ -24,7 +24,7 @@ The `embedding` method is generally not implemented for quantization, focusing o
|
||||
|
||||
The `create_weights` method is used for weight initialization; the `process_weights_after_loading` method is used for weight post-processing, such as transposition, format conversion, data type conversion, etc.; the `apply` method is used to perform activation quantization and quantized matrix multiplication calculations during the forward process.
|
||||
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **moe**).
|
||||
We need to implement the `create_weights`, `process_weights_after_loading`, and `apply` methods for different **layers** (**attention**, **mlp**, **MoE (Mixture of Experts)**).
|
||||
|
||||
**Supplement**: When loading the model, the quantized model's description file **quant_model_description.json** needs to be read. This file describes the quantization configuration and parameters for each part of the model weights, for example:
|
||||
|
||||
@@ -107,7 +107,7 @@ vLLM Ascend supports multiple quantization algorithms. The following table provi
|
||||
| `W8A8_DYNAMIC` | INT8 | INT8 | Per-Channel | Per-Token | Dynamic | Dynamic activation quantization with per-token scaling factor calculation |
|
||||
| `W4A8_DYNAMIC` | INT4 | INT8 | Per-Group | Per-Token | Dynamic | Supports both direct per-channel quantization to 4-bit and two-step quantization (per-channel to 8-bit then per-group to 4-bit) |
|
||||
| `W4A4_FLATQUANT_DYNAMIC` | INT4 | INT4 | Per-Channel | Per-Token | Dynamic | Uses FlatQuant for activation distribution smoothing before 4-bit dynamic quantization, with additional matrix multiplications for precision preservation |
|
||||
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | PD Colocation Scenario uses dynamic quantization for both P node and D node; PD Disaggregation Scenario uses dynamic quantization for P node and static for D node |
|
||||
| `W8A8_MIX` | INT8 | INT8 | Per-Channel | Per-Tensor/Token | Mixed | We support two deployment modes: PD Colocation (dynamic quantization for both P and D) and PD Disaggregation (dynamic-quant P and static-quant D) |
|
||||
|
||||
**Static vs Dynamic:** Static quantization uses pre-computed scaling factors with better performance, while dynamic quantization computes scaling factors on-the-fly for each token/activation tensor with higher precision.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user