[feat] support customized and separated hccl_buffer_size for process group initialization (#3073)
### What this PR does / why we need it? Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2 operators (dispatch and combine) while running moe models with large `ep_size` and `batch_size`. This environmental variable not only affects allocated VRAM for mc2 group, but also increases VRAM allocation for dp, tp & ep groups, leading to significant kvcache and free_memory drops. This PR supports to automatically calculate and set `hccl_buffer_size` for each process group **(except mc2 group)** separately when users set `HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted buffer_size set for dp, tp & ep groups. Note that current mc2 operators can only perform communication space partitioning based on `HCCL_BUFFSIZE` configuration. Once they support `hccl_buffer_size` configuration with `pg_options` while initializing process group, we'll caculate the required buffer size and users would avoid set `HCCL_BUFFSIZE` themselves. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2 process group and observed significant kv_cache and free_memory increase! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
@@ -29,7 +29,7 @@ class TestPatchDistributed(TestBase):
|
||||
self.mock_group_ranks = [[0, 1]]
|
||||
self.mock_local_rank = 0
|
||||
self.mock_backend = "hccl"
|
||||
self.mock_use_device_comm = True
|
||||
self.mock_use_device_comm = False
|
||||
|
||||
patcher_get_rank = patch("torch.distributed.get_rank", return_value=0)
|
||||
patcher_new_group = patch("torch.distributed.new_group",
|
||||
@@ -39,16 +39,24 @@ class TestPatchDistributed(TestBase):
|
||||
patcher_device_comm_cls = patch(
|
||||
"vllm.distributed.parallel_state.resolve_obj_by_qualname",
|
||||
return_value=MagicMock())
|
||||
patcher_calculate_dp_buffer = patch(
|
||||
"vllm_ascend.utils.calculate_dp_buffer_size", return_value=64)
|
||||
patcher_npu_current_device = patch("torch.npu.current_device",
|
||||
return_value=MagicMock())
|
||||
|
||||
self.mock_get_rank = patcher_get_rank.start()
|
||||
self.mock_new_group = patcher_new_group.start()
|
||||
self.mock_is_cuda_alike = patcher_is_cuda_alike.start()
|
||||
self.mock_resolve_obj = patcher_device_comm_cls.start()
|
||||
self.mock_calculate_dp_buffer = patcher_calculate_dp_buffer.start()
|
||||
self.mock_npu_current_device = patcher_npu_current_device.start()
|
||||
|
||||
self.addCleanup(patcher_get_rank.stop)
|
||||
self.addCleanup(patcher_new_group.stop)
|
||||
self.addCleanup(patcher_is_cuda_alike.stop)
|
||||
self.addCleanup(patcher_device_comm_cls.stop)
|
||||
self.addCleanup(patcher_calculate_dp_buffer.stop)
|
||||
self.addCleanup(patcher_npu_current_device.stop)
|
||||
|
||||
self.group_coordinator = GroupCoordinatorPatch(
|
||||
group_ranks=self.mock_group_ranks,
|
||||
|
||||
Reference in New Issue
Block a user