[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# Fine-Grained Tensor Parallelism (Finegrained TP)
|
||||
# Fine-Grained Tensor Parallelism (Fine-grained TP)
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -8,12 +8,12 @@ This capability supports heterogeneous parallelism strategies within a single mo
|
||||
|
||||
---
|
||||
|
||||
## Benefits of Finegrained TP
|
||||
## Benefits of Fine-grained TP
|
||||
|
||||
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
|
||||
|
||||
- **Reduced Per-Device Memory Footprint**:
|
||||
Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
Fine-grained TP shards large weight matrices (e.g., LM Head, o_proj) across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
|
||||
- **Faster Memory Access in GEMMs**:
|
||||
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
|
||||
@@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must:
|
||||
|
||||
---
|
||||
|
||||
## How to Use Finegrained TP
|
||||
## How to Use Fine-grained TP
|
||||
|
||||
### Configuration Format
|
||||
|
||||
|
||||
Reference in New Issue
Block a user