[doc](cp) correct the prefill of GQA and adjust desc of block table. (#5697)
### What this PR does / why we need it?
correct the seq length of KV for prefill of GQA and clarify the desc of
block table distribution in developer guide.
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
This commit is contained in:
Binary file not shown.
|
Before Width: | Height: | Size: 194 KiB After Width: | Height: | Size: 289 KiB |
BIN
docs/source/assets/cp/device_world.png
Normal file
BIN
docs/source/assets/cp/device_world.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 61 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 284 KiB After Width: | Height: | Size: 285 KiB |
@@ -25,20 +25,29 @@ Please refer to the [context parallel user guide](../../user_guide/feature_guide
|
||||
|
||||
## How It Works?
|
||||
|
||||
### Device Distribution
|
||||
|
||||
We introduce new communication domains for PCP and reuse TP for DCP, and this is the new layout of devices for PCP2, DCP2, and TP4.
|
||||

|
||||
|
||||
### Block Table
|
||||
|
||||
CP performs sequence sharding on the KV cache storage. To facilitate efficient storage and access, tokens are stored in an interleaved manner across devices, with the interleaving granularity determined by `cp_kv_cache_interleave_size`.
|
||||
CP performs sequence sharding on the KV cache storage. To facilitate efficient storage and access, tokens are stored in an interleaved manner across devices, with the interleaving granularity determined by `cp_kv_cache_interleave_size`, whose default value is `cp_kv_cache_interleave_size=1`, a.k.a. 'token interleave'.
|
||||
|
||||
As illustrated, a virtual block is defined in the block table, where blocks within the same CP device group form a virtual block. The virtual block size is `virtual_block_size = block_size * pcp_size * dcp_size`.
|
||||
Given that PCP and DCP behave similarly for KV cache sharding, we refer to them collectively as CP. Specifically, `cp_size = pcp_size * dcp_size`, and `cp_rank = pcp_rank * dcp_size + dcp_rank`.
|
||||
|
||||
For any token `x`, its (virtual) block index is `x // virtual_block_size`, and the offset within the virtual block is `x % virtual_block_size`. The local block index is `offset_within_virtual_block // cp_kv_cache_interleave_size`, and the device number is `local_block_index % (pcp_size * dcp_size)`. The offset within the local block is `(local_block_index // (pcp_size * dcp_size)) * cp_kv_cache_interleave_size + offset_within_virtual_block % cp_kv_cache_interleave_size`.
|
||||
As illustrated, a virtual block is defined in the block table, where blocks within the same CP device group form a virtual block. The virtual block size is `virtual_block_size = block_size * cp_size`.
|
||||
|
||||
For any token `x`, referencing the folloing figure, its (virtual) block index is `x // virtual_block_size`, and the offset within the virtual block is `offset_within_virtual_block = x % virtual_block_size`.
|
||||
The local block index is `local_block_index = offset_within_virtual_block // cp_kv_cache_interleave_size`, and the device number is `target_rank = local_block_index % cp_size`.
|
||||
The offset within the local block is `(local_block_index // cp_size) * cp_kv_cache_interleave_size + offset_within_virtual_block % cp_kv_cache_interleave_size`.
|
||||
|
||||

|
||||
|
||||
Based on the logic above, the `slot_mapping` calculation process is adjusted, and the `slot_mapping` values on each device are modified to ensure the KV cache is sharded along the sequence dimension and stored across different devices as expected.
|
||||
|
||||
The current implementation requires that `block_size % cp_kv_cache_interleave_size == 0`.
|
||||
|
||||

|
||||
|
||||
### Decode Context Parallel (DCP)
|
||||
|
||||
As mentioned above, the primary function of DCP is to shard the KV cache along the sequence dimension for storage. Its impact lies in the logic of the decode and chunked prefill phases.
|
||||
|
||||
Reference in New Issue
Block a user