[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# Fine-Grained Tensor Parallelism (Finegrained TP)
|
||||
# Fine-Grained Tensor Parallelism (Fine-grained TP)
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -8,12 +8,12 @@ This capability supports heterogeneous parallelism strategies within a single mo
|
||||
|
||||
---
|
||||
|
||||
## Benefits of Finegrained TP
|
||||
## Benefits of Fine-grained TP
|
||||
|
||||
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
|
||||
|
||||
- **Reduced Per-Device Memory Footprint**:
|
||||
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
Fine-grained TP shards large weight matrices (e.g., LM Head, o_proj) across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
|
||||
- **Faster Memory Access in GEMMs**:
|
||||
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
|
||||
@@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must:
|
||||
|
||||
---
|
||||
|
||||
## How to Use Finegrained TP
|
||||
## How to Use Fine-grained TP
|
||||
|
||||
### Configuration Format
|
||||
|
||||
|
||||
@@ -72,6 +72,7 @@ For best results, if you run inside a docker container, which `systemctl` is lik
|
||||
- **Stop `irqbalance` service**:
|
||||
|
||||
For example, on Ubuntu system, you can run the following command to stop irqbalance:
|
||||
|
||||
```bash
|
||||
sudo systemctl stop irqbalance
|
||||
```
|
||||
|
||||
@@ -56,12 +56,12 @@ All related code is under `vllm/distributed/ec_transfer`.
|
||||
* *Scheduler role* – checks cache existence and schedules loads.
|
||||
* *Worker role* – loads the embeddings into memory.
|
||||
|
||||
* **EPD Load Balance Proxy** -
|
||||
* **EPD Load Balancing Proxy** -
|
||||
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
|
||||
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
|
||||
|
||||
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
|
||||
|
||||
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
|
||||
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
|
||||
|
||||
@@ -4,7 +4,7 @@ For larger-scale deployments especially, it can make sense to handle the orchest
|
||||
|
||||
In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.
|
||||
|
||||
## Getting Start
|
||||
## Getting Started
|
||||
|
||||
The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:
|
||||
|
||||
|
||||
@@ -163,7 +163,7 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1
|
||||
#A2
|
||||
#export HCCL_INTRA_ROCE_ENABLE=1
|
||||
|
||||
#Minimum retransmission timeout of the RDMA,equals 4.096 μs * 2 ^ timeout.
|
||||
#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
|
||||
#Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
|
||||
#HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
|
||||
export HCCL_RDMA_TIMEOUT=17
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Distributed DP Server With Large-Scale Expert Parallelism
|
||||
|
||||
## Getting Start
|
||||
## Getting Started
|
||||
|
||||
vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
|
||||
Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.
|
||||
|
||||
Reference in New Issue
Block a user