[Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
This commit is contained in:
56
docs/advanced_features/pd_multiplexing.md
Normal file
56
docs/advanced_features/pd_multiplexing.md
Normal file
@@ -0,0 +1,56 @@
|
||||
|
||||
# PD Multiplexing
|
||||
|
||||
|
||||
## Server Arguments
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|-----------------------------|-------------------------|----------------------------------------------------------|
|
||||
| `--enable-pdmux` | flag; default: disabled | Enable PD-Multiplexing (PD running on greenctx stream). |
|
||||
| `--pdmux-config-path <path>`| string path; none | Path to the PD-Multiplexing YAML config file. |
|
||||
|
||||
### YAML Configuration
|
||||
|
||||
Example configuration for an H200 (132 SMs)
|
||||
|
||||
```yaml
|
||||
# Number of SM groups to divide the GPU into.
|
||||
# Includes two default groups:
|
||||
# - Group 0: all SMs for prefill
|
||||
# - Last group: all SMs for decode
|
||||
# The number of manual divisions must be (sm_group_num - 2).
|
||||
sm_group_num: 8
|
||||
|
||||
# Optional manual divisions of SMs.
|
||||
# Each entry contains:
|
||||
# - prefill_sm: number of SMs allocated for prefill
|
||||
# - decode_sm: number of SMs allocated for decode
|
||||
# - decode_bs_threshold: minimum decode batch size to select this group
|
||||
#
|
||||
# The sum of `prefill_sm` and `decode_sm` must equal the total number of SMs.
|
||||
# If provided, the number of entries must equal (sm_group_num - 2).
|
||||
manual_divisions:
|
||||
- [112, 20, 1]
|
||||
- [104, 28, 5]
|
||||
- [96, 36, 10]
|
||||
- [80, 52, 15]
|
||||
- [64, 68, 20]
|
||||
- [56, 76, 25]
|
||||
|
||||
# Divisor for default stream index calculation.
|
||||
# Used when manual_divisions are not provided.
|
||||
# Formula:
|
||||
# stream_idx = max(
|
||||
# 1,
|
||||
# min(sm_group_num - 2,
|
||||
# decode_bs * (sm_group_num - 2) // decode_bs_divisor
|
||||
# )
|
||||
# )
|
||||
decode_bs_divisor: 36
|
||||
|
||||
# Maximum token budget for split_forward in the prefill stage.
|
||||
# Determines how many layers are executed per split_forward.
|
||||
# Formula:
|
||||
# forward_count = max(1, split_forward_token_budget // extend_num_tokens)
|
||||
split_forward_token_budget: 65536
|
||||
```
|
||||
@@ -44,6 +44,7 @@ The core features include:
|
||||
advanced_features/quantization.md
|
||||
advanced_features/lora.ipynb
|
||||
advanced_features/pd_disaggregation.md
|
||||
advanced_features/pd_multiplexing.md
|
||||
advanced_features/vlm_query.ipynb
|
||||
advanced_features/router.md
|
||||
advanced_features/observability.md
|
||||
|
||||
Reference in New Issue
Block a user