[main][Docs] Fix typos across documentation (#6728)
## Summary
Fix typos and improve grammar consistency across 50 documentation files.
### Changes include:
- Spelling corrections (e.g., "Facotory" → "Factory", "certainty" →
"determinism")
- Grammar improvements (e.g., "multi-thread" → "multi-threaded",
"re-routed" → "re-run")
- Punctuation fixes (semicolon consistency in filter parameters)
- Code style fixes (correct flag name `--num-prompts` instead of
`--num-prompt`)
- Capitalization consistency (e.g., "python" → "Python", "ascend" →
"Ascend")
- vLLM version: v0.15.0
- vLLM main:
9562912cea
---------
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
Fine-Grained Tensor Parallelism (Finegrained TP) extends standard tensor parallelism by enabling **independent tensor parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Finegrained TP allows users to configure separate TP size for key modules—such as embedding, language model head (lm_head), attention output projection (oproj), and MLP blocks—via the `finegrained_tp_config` parameter.
|
||||
Fine-Grained Tensor Parallelism (Fine-grained TP) extends standard tensor parallelism by enabling **independent tensor-parallel sizes for different model components**. Instead of applying a single global `tensor_parallel_size` to all layers, Fine-grained TP allows users to configure separate TP sizes for key modules—such as embedding, language model head (lm_head), attention output projection (o_proj), and MLP blocks—via the `finegrained_tp_config` parameter.
|
||||
|
||||
This capability supports heterogeneous parallelism strategies within a single model, providing finer control over weight distribution, memory layout, and communication patterns across devices. The feature is compatible with standard dense transformer architectures and integrates seamlessly into vLLM’s serving pipeline.
|
||||
|
||||
@@ -12,10 +12,10 @@ This capability supports heterogeneous parallelism strategies within a single mo
|
||||
|
||||
Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:
|
||||
|
||||
- **Reduced Per-Device Memory Footprint**:
|
||||
Finegrained TP shards large weight matrices(如 LM Head、o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
- **Reduced Per-Device Memory Footprint**:
|
||||
Fine-grained TP shards large weight matrices(如 LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
|
||||
|
||||
- **Faster Memory Access in GEMMs**:
|
||||
- **Faster Memory Access in GEMMs**:
|
||||
In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
|
||||
|
||||
Together, these effects allow practitioners to better balance memory, communication, and compute—particularly in high-concurrency serving scenarios—while maintaining compatibility with standard dense transformer models.
|
||||
@@ -26,7 +26,7 @@ Together, these effects allow practitioners to better balance memory, communicat
|
||||
|
||||
### Models
|
||||
|
||||
Finegrained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
|
||||
Fine-grained TP is **model-agnostic** and supports all standard dense transformer architectures, including Llama, Qwen, DeepSeek (base/dense variants), and others.
|
||||
|
||||
### Component & Execution Mode Support
|
||||
|
||||
@@ -57,7 +57,7 @@ The Fine-Grained TP size for any component must:
|
||||
|
||||
### Configuration Format
|
||||
|
||||
Finegrained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
|
||||
Fine-grained TP is controlled via the `finegrained_tp_config` field inside `--additional-config`.
|
||||
|
||||
```bash
|
||||
--additional-config '{
|
||||
@@ -90,7 +90,7 @@ vllm serve deepseek-ai/DeepSeek-R1 \
|
||||
|
||||
## Experimental Results
|
||||
|
||||
To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD-separated decode instances in the environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8, the performance data is as follows.
|
||||
To evaluate the effectiveness of fine-grained TP in large-scale service scenarios, we use the model **DeepSeek-R1-W8A8**, deploy PD separated decode instances in an environment of 32 cards Ascend 910B*64G (A2), with parallel configuration as DP32+EP32, and fine-grained TP size of 8; the performance data is as follows.
|
||||
|
||||
| Module | Memory Savings | TPOT Impact (batch=24) |
|
||||
| ---------------- | -------------- | ------------------------- |
|
||||
|
||||
Reference in New Issue
Block a user