[Feat] Add custom Embedding tensor model parallel (#2616)
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.
And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
This commit is contained in:
@@ -27,14 +27,13 @@ The following table lists additional configuration options available in vLLM Asc
|
|||||||
| Name | Type | Default | Description |
|
| Name | Type | Default | Description |
|
||||||
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
|
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
| `xlite_graph_config` | dict | `{}` | Configuration options for xlite graph mode |
|
| `xlite_graph_config` | dict | `{}` | Configuration options for xlite graph mode |
|
||||||
|
| `finegrained_tp_config` | dict | `{}` | Configuration options for module tensor parallelism |
|
||||||
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
|
||||||
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
|
||||||
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
|
||||||
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
|
| `kv_cache_dtype` | str | `None` | When using the KV cache quantization method, KV cache dtype needs to be set, currently only int8 is supported. |
|
||||||
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
| `enable_shared_expert_dp` | bool | `False` | When the expert is shared in DP, it delivers better performance but consumes more memory. Currently only DeepSeek series models are supported. |
|
||||||
| `lmhead_tensor_parallel_size` | int | `None` | The custom tensor parallel size of lmhead. |
|
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effects on MoE models with shared experts. |
|
||||||
| `oproj_tensor_parallel_size` | int | `None` | The custom tensor parallel size of oproj. |
|
|
||||||
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multistream shared expert. This option only takes effect on MoE models with shared experts. |
|
|
||||||
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
| `dynamic_eplb` | bool | `False` | Whether to enable dynamic EPLB. |
|
||||||
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
| `num_iterations_eplb_update` | int | `400` | Forward iterations when EPLB begins. |
|
||||||
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
| `gate_eplb` | bool | `False` | Whether to enable EPLB only once. |
|
||||||
@@ -58,6 +57,15 @@ The details of each configuration option are as follows:
|
|||||||
| `enabled` | bool | `False` | Whether to enable weight prefetch. |
|
| `enabled` | bool | `False` | Whether to enable weight prefetch. |
|
||||||
| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weight. |
|
| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weight. |
|
||||||
|
|
||||||
|
**finegrained_tp_config**
|
||||||
|
|
||||||
|
| Name | Type | Default | Description |
|
||||||
|
| ---- | ---- | ------- | ----------- |
|
||||||
|
| `lmhead_tensor_parallel_size` | int | `0` | The custom tensor parallel size of lmhead. |
|
||||||
|
| `oproj_tensor_parallel_size` | int | `0` | The custom tensor parallel size of oproj. |
|
||||||
|
| `embedding_tensor_parallel_size` | int | `0` | The custom tensor parallel size of embedding. |
|
||||||
|
| `mlp_tensor_parallel_size` | int | `0` | The custom tensor parallel size of mlp. |
|
||||||
|
|
||||||
### Example
|
### Example
|
||||||
|
|
||||||
An example of additional configuration is as follows:
|
An example of additional configuration is as follows:
|
||||||
@@ -76,6 +84,12 @@ An example of additional configuration is as follows:
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
"finegrained_tp_config": {
|
||||||
|
"lmhead_tensor_parallel_size": 8,
|
||||||
|
"oproj_tensor_parallel_size": 8,
|
||||||
|
"embedding_tensor_parallel_size": 8,
|
||||||
|
"mlp_tensor_parallel_size": 8,
|
||||||
|
},
|
||||||
"multistream_overlap_shared_expert": True,
|
"multistream_overlap_shared_expert": True,
|
||||||
"refresh": False,
|
"refresh": False,
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -12,15 +12,17 @@ from vllm_ascend.distributed.parallel_state import (
|
|||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def parallel_config():
|
def parallel_config():
|
||||||
return ParallelConfig(data_parallel_size=2,
|
return ParallelConfig(
|
||||||
tensor_parallel_size=2,
|
data_parallel_size=2,
|
||||||
pipeline_parallel_size=2)
|
tensor_parallel_size=4,
|
||||||
|
pipeline_parallel_size=2,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def mock_distributed():
|
def mock_distributed():
|
||||||
with patch('torch.distributed.is_initialized', return_value=True), \
|
with patch('torch.distributed.is_initialized', return_value=True), \
|
||||||
patch('torch.distributed.get_world_size', return_value=8), \
|
patch('torch.distributed.get_world_size', return_value=16), \
|
||||||
patch('torch.distributed.get_backend', return_value='nccl'), \
|
patch('torch.distributed.get_backend', return_value='nccl'), \
|
||||||
patch('vllm_ascend.distributed.parallel_state.get_world_group') as mock_group, \
|
patch('vllm_ascend.distributed.parallel_state.get_world_group') as mock_group, \
|
||||||
patch('vllm_ascend.distributed.parallel_state.get_tp_group') as mock_tp_group, \
|
patch('vllm_ascend.distributed.parallel_state.get_tp_group') as mock_tp_group, \
|
||||||
@@ -36,8 +38,9 @@ def mock_distributed():
|
|||||||
|
|
||||||
def test_init_ascend_model_parallel(mock_distributed, parallel_config):
|
def test_init_ascend_model_parallel(mock_distributed, parallel_config):
|
||||||
mock_ascend_config = MagicMock()
|
mock_ascend_config = MagicMock()
|
||||||
mock_ascend_config.lmhead_tensor_parallel_size = 2
|
mock_ascend_config.finegrained_tp_config.lmhead_tensor_parallel_size = 2
|
||||||
mock_ascend_config.oproj_tensor_parallel_size = 2
|
mock_ascend_config.finegrained_tp_config.oproj_tensor_parallel_size = 2
|
||||||
|
mock_ascend_config.finegrained_tp_config.embedding_tensor_parallel_size = 2
|
||||||
mock_ascend_config.flashcomm2_oproj_tensor_parallel_size = 2
|
mock_ascend_config.flashcomm2_oproj_tensor_parallel_size = 2
|
||||||
mock_ascend_config.pd_tp_ratio = 2
|
mock_ascend_config.pd_tp_ratio = 2
|
||||||
mock_ascend_config.num_head_replica = 0
|
mock_ascend_config.num_head_replica = 0
|
||||||
|
|||||||
@@ -1,4 +1,3 @@
|
|||||||
import os
|
|
||||||
import unittest
|
import unittest
|
||||||
from unittest import mock
|
from unittest import mock
|
||||||
from unittest.mock import MagicMock, patch
|
from unittest.mock import MagicMock, patch
|
||||||
@@ -26,7 +25,8 @@ class BaseLinearTest(unittest.TestCase):
|
|||||||
parallel_state._OTP = self.mock_group
|
parallel_state._OTP = self.mock_group
|
||||||
|
|
||||||
self.mock_ascend_config = MagicMock()
|
self.mock_ascend_config = MagicMock()
|
||||||
self.mock_ascend_config.oproj_tensor_parallel_size = 2
|
self.mock_ascend_config.finegrained_tp_config.oproj_tensor_parallel_size = 2
|
||||||
|
self.mock_ascend_config.finegrained_tp_config.mlp_tensor_parallel_size = 2
|
||||||
|
|
||||||
self.patches = [
|
self.patches = [
|
||||||
patch("vllm_ascend.ascend_config.get_ascend_config",
|
patch("vllm_ascend.ascend_config.get_ascend_config",
|
||||||
@@ -81,7 +81,11 @@ class TestAscendUnquantizedLinearMethod(TestBase):
|
|||||||
class TestAscendRowParallelLinear(BaseLinearTest):
|
class TestAscendRowParallelLinear(BaseLinearTest):
|
||||||
|
|
||||||
def test_mlp_optimize(self):
|
def test_mlp_optimize(self):
|
||||||
os.environ["VLLM_ASCEND_ENABLE_MLP_OPTIMIZE"] = "1"
|
|
||||||
|
ascend_config._ASCEND_CONFIG = MagicMock()
|
||||||
|
ascend_config._ASCEND_CONFIG.recompute_scheduler_enable = False
|
||||||
|
ascend_config._ASCEND_CONFIG.finegrained_tp_config.mlp_tensor_parallel_size = 2
|
||||||
|
ascend_config._ASCEND_CONFIG.ascend_scheduler_config.enabled = False
|
||||||
|
|
||||||
linear = AscendRowParallelLinear(
|
linear = AscendRowParallelLinear(
|
||||||
input_size=16,
|
input_size=16,
|
||||||
@@ -98,8 +102,9 @@ class TestAscendRowParallelLinear(BaseLinearTest):
|
|||||||
config._current_vllm_config = MagicMock()
|
config._current_vllm_config = MagicMock()
|
||||||
|
|
||||||
ascend_config._ASCEND_CONFIG = MagicMock()
|
ascend_config._ASCEND_CONFIG = MagicMock()
|
||||||
ascend_config._ASCEND_CONFIG.oproj_tensor_parallel_size = 2
|
|
||||||
ascend_config._ASCEND_CONFIG.recompute_scheduler_enable = False
|
ascend_config._ASCEND_CONFIG.recompute_scheduler_enable = False
|
||||||
|
ascend_config._ASCEND_CONFIG.finegrained_tp_config.oproj_tensor_parallel_size = 2
|
||||||
|
ascend_config._ASCEND_CONFIG.ascend_scheduler_config.enabled = False
|
||||||
|
|
||||||
linear = AscendRowParallelLinear(
|
linear = AscendRowParallelLinear(
|
||||||
input_size=16,
|
input_size=16,
|
||||||
@@ -115,7 +120,11 @@ class TestAscendRowParallelLinear(BaseLinearTest):
|
|||||||
class TestAscendMergedColumnParallelLinear(BaseLinearTest):
|
class TestAscendMergedColumnParallelLinear(BaseLinearTest):
|
||||||
|
|
||||||
def test_merged_mlp_tp_init(self):
|
def test_merged_mlp_tp_init(self):
|
||||||
os.environ["VLLM_ASCEND_ENABLE_MLP_OPTIMIZE"] = "1"
|
|
||||||
|
ascend_config._ASCEND_CONFIG = MagicMock()
|
||||||
|
ascend_config._ASCEND_CONFIG.recompute_scheduler_enable = False
|
||||||
|
ascend_config._ASCEND_CONFIG.finegrained_tp_config.mlp_tensor_parallel_size = 2
|
||||||
|
ascend_config._ASCEND_CONFIG.ascend_scheduler_config.enabled = False
|
||||||
|
|
||||||
linear = AscendMergedColumnParallelLinear(
|
linear = AscendMergedColumnParallelLinear(
|
||||||
input_size=16,
|
input_size=16,
|
||||||
|
|||||||
@@ -14,11 +14,12 @@
|
|||||||
# Adapted from vllm/tests/lora/test_layers.py
|
# Adapted from vllm/tests/lora/test_layers.py
|
||||||
|
|
||||||
import unittest
|
import unittest
|
||||||
|
from unittest import mock
|
||||||
from unittest.mock import MagicMock, patch
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
from vllm_ascend.ascend_config import init_ascend_config
|
from vllm_ascend.distributed import parallel_state
|
||||||
from vllm_ascend.ops.vocab_parallel_embedding import (
|
from vllm_ascend.ops.vocab_parallel_embedding import (
|
||||||
AscendLogitsProcessor, AscendParallelLMHead, AscendVocabParallelEmbedding)
|
AscendLogitsProcessor, AscendParallelLMHead, AscendVocabParallelEmbedding)
|
||||||
|
|
||||||
@@ -32,9 +33,33 @@ class TestCustomVocabParallelEmbedding(unittest.TestCase):
|
|||||||
self.embedding_dim = 10
|
self.embedding_dim = 10
|
||||||
self.org_num_embeddings = 40
|
self.org_num_embeddings = 40
|
||||||
self.padding_size = 8
|
self.padding_size = 8
|
||||||
|
|
||||||
|
self.mock_group = mock.MagicMock()
|
||||||
|
self.mock_group.world_size = 2
|
||||||
|
self.mock_group.rank_in_group = 0
|
||||||
|
|
||||||
|
parallel_state._MLP_TP = self.mock_group
|
||||||
|
parallel_state._OTP = self.mock_group
|
||||||
|
|
||||||
mock_vllm_config = MagicMock()
|
mock_vllm_config = MagicMock()
|
||||||
mock_vllm_config.additional_config = {}
|
mock_vllm_config.additional_config = {}
|
||||||
init_ascend_config(mock_vllm_config)
|
self.mock_ascend_config = MagicMock()
|
||||||
|
self.mock_ascend_config.finegrained_tp_config.lmhead_tensor_parallel_size = 2
|
||||||
|
self.mock_ascend_config.finegrained_tp_config.embedding_tensor_parallel_size = 2
|
||||||
|
|
||||||
|
self.patches = [
|
||||||
|
patch("vllm_ascend.utils.get_ascend_config",
|
||||||
|
return_value=self.mock_ascend_config),
|
||||||
|
patch("vllm_ascend.distributed.parallel_state.get_lmhead_tp_group",
|
||||||
|
return_value=self.mock_group),
|
||||||
|
patch(
|
||||||
|
"vllm.distributed.parallel_state.get_tp_group",
|
||||||
|
return_value=self.mock_group,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
for p in self.patches:
|
||||||
|
p.start()
|
||||||
|
|
||||||
def _create_layer(self):
|
def _create_layer(self):
|
||||||
# Patch methods and dependencies for VocabParallelEmbedding
|
# Patch methods and dependencies for VocabParallelEmbedding
|
||||||
|
|||||||
@@ -67,6 +67,11 @@ class AscendConfig:
|
|||||||
self.ascend_compilation_config = AscendCompilationConfig(
|
self.ascend_compilation_config = AscendCompilationConfig(
|
||||||
**ascend_compilation_config)
|
**ascend_compilation_config)
|
||||||
|
|
||||||
|
finegrained_tp_config = additional_config.get("finegrained_tp_config",
|
||||||
|
{})
|
||||||
|
self.finegrained_tp_config = FinegrainedTPConfig(
|
||||||
|
finegrained_tp_config, vllm_config)
|
||||||
|
|
||||||
# Dump / PrecisionDebugger configuration
|
# Dump / PrecisionDebugger configuration
|
||||||
dump_config_path = additional_config.get("dump_config", None)
|
dump_config_path = additional_config.get("dump_config", None)
|
||||||
self.dump_config = DumpConfig(dump_config_path)
|
self.dump_config = DumpConfig(dump_config_path)
|
||||||
@@ -103,34 +108,6 @@ class AscendConfig:
|
|||||||
"multistream_overlap_shared_expert", False)
|
"multistream_overlap_shared_expert", False)
|
||||||
self.recompute_scheduler_enable = additional_config.get(
|
self.recompute_scheduler_enable = additional_config.get(
|
||||||
"recompute_scheduler_enable", False)
|
"recompute_scheduler_enable", False)
|
||||||
self.lmhead_tensor_parallel_size = additional_config.get(
|
|
||||||
"lmhead_tensor_parallel_size", None)
|
|
||||||
if self.lmhead_tensor_parallel_size is not None:
|
|
||||||
logger.info(
|
|
||||||
f"Enable lmhead_tensor_parallel_size={self.lmhead_tensor_parallel_size} in pure DP scenario"
|
|
||||||
)
|
|
||||||
if vllm_config.parallel_config.tensor_parallel_size != 1:
|
|
||||||
raise AssertionError(
|
|
||||||
"lmhead_tensor_parallel_size is only supported in the pure DP scenario"
|
|
||||||
)
|
|
||||||
self.oproj_tensor_parallel_size = additional_config.get(
|
|
||||||
"oproj_tensor_parallel_size", None)
|
|
||||||
if self.oproj_tensor_parallel_size is not None:
|
|
||||||
logger.info(
|
|
||||||
f"Enable oproj_tensor_parallel_size={self.oproj_tensor_parallel_size} in pure DP scenario"
|
|
||||||
)
|
|
||||||
if vllm_config.parallel_config.tensor_parallel_size != 1:
|
|
||||||
raise AssertionError(
|
|
||||||
"oproj_tensor_parallel_size is only supported in the pure DP scenario"
|
|
||||||
)
|
|
||||||
if vllm_config.model_config.enforce_eager is True:
|
|
||||||
raise AssertionError(
|
|
||||||
"oproj_tensor_parallel_size is only supported in graph mode"
|
|
||||||
)
|
|
||||||
if vllm_config.kv_transfer_config is None or not vllm_config.kv_transfer_config.is_kv_consumer:
|
|
||||||
raise AssertionError(
|
|
||||||
"oproj_tensor_parallel_size is only supported in pd scenario and can only be used in D node."
|
|
||||||
)
|
|
||||||
self.enable_cpu_binding = additional_config.get(
|
self.enable_cpu_binding = additional_config.get(
|
||||||
"enable_cpu_binding", False)
|
"enable_cpu_binding", False)
|
||||||
|
|
||||||
@@ -181,6 +158,61 @@ class AscendConfig:
|
|||||||
kv_cfg._engine_id_patched = True
|
kv_cfg._engine_id_patched = True
|
||||||
|
|
||||||
|
|
||||||
|
class FinegrainedTPConfig:
|
||||||
|
"""
|
||||||
|
Configuration Object for finegrained_tp_config from additional_config
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, finegrained_tp_config: dict, vllm_config):
|
||||||
|
self.oproj_tensor_parallel_size = finegrained_tp_config.get(
|
||||||
|
"oproj_tensor_parallel_size", 0)
|
||||||
|
self.lmhead_tensor_parallel_size = finegrained_tp_config.get(
|
||||||
|
"lmhead_tensor_parallel_size", 0)
|
||||||
|
self.embedding_tensor_parallel_size = finegrained_tp_config.get(
|
||||||
|
"embedding_tensor_parallel_size", 0)
|
||||||
|
self.mlp_tensor_parallel_size = finegrained_tp_config.get(
|
||||||
|
"mlp_tensor_parallel_size", 0)
|
||||||
|
|
||||||
|
enabled_configs = []
|
||||||
|
if self.oproj_tensor_parallel_size > 0:
|
||||||
|
enabled_configs.append(
|
||||||
|
f"oproj_tensor_parallel_size={self.oproj_tensor_parallel_size}"
|
||||||
|
)
|
||||||
|
# dummy_run does not run the entire attention module in eager mode,, so the o_proj tp split can only be used in graph mode.
|
||||||
|
if vllm_config.model_config.enforce_eager is True:
|
||||||
|
raise AssertionError(
|
||||||
|
"oproj_tensor_parallel_size is only supported in graph mode"
|
||||||
|
)
|
||||||
|
if vllm_config.kv_transfer_config is None or not vllm_config.kv_transfer_config.is_kv_consumer:
|
||||||
|
raise AssertionError(
|
||||||
|
"oproj_tensor_parallel_size is only supported in pd scenario and can only be used in D node."
|
||||||
|
)
|
||||||
|
if self.lmhead_tensor_parallel_size > 0:
|
||||||
|
enabled_configs.append(
|
||||||
|
f"lmhead_tensor_parallel_size={self.lmhead_tensor_parallel_size}"
|
||||||
|
)
|
||||||
|
if self.embedding_tensor_parallel_size > 0:
|
||||||
|
enabled_configs.append(
|
||||||
|
f"embedding_tensor_parallel_size={self.embedding_tensor_parallel_size}"
|
||||||
|
)
|
||||||
|
if self.mlp_tensor_parallel_size > 0:
|
||||||
|
enabled_configs.append(
|
||||||
|
f"mlp_tensor_parallel_size={self.mlp_tensor_parallel_size}")
|
||||||
|
module_tp_sizes = [
|
||||||
|
self.oproj_tensor_parallel_size,
|
||||||
|
self.lmhead_tensor_parallel_size,
|
||||||
|
self.embedding_tensor_parallel_size,
|
||||||
|
self.mlp_tensor_parallel_size,
|
||||||
|
]
|
||||||
|
for module_tp_size in module_tp_sizes:
|
||||||
|
if module_tp_size > 0 and vllm_config.parallel_config.data_parallel_size % module_tp_size != 0:
|
||||||
|
raise AssertionError(
|
||||||
|
"module tp sizes must divide data_parallel_size")
|
||||||
|
if any(size > 0 for size in module_tp_sizes) and enabled_configs:
|
||||||
|
logger.info(
|
||||||
|
f"finegrained_tp_config enabled: {', '.join(enabled_configs)}")
|
||||||
|
|
||||||
|
|
||||||
class AscendCompilationConfig:
|
class AscendCompilationConfig:
|
||||||
"""
|
"""
|
||||||
Configuration for controlling the behavior of Ascend graph optimization.
|
Configuration for controlling the behavior of Ascend graph optimization.
|
||||||
|
|||||||
@@ -7,69 +7,27 @@ from vllm.distributed.parallel_state import (GroupCoordinator, get_dp_group,
|
|||||||
get_world_group,
|
get_world_group,
|
||||||
init_model_parallel_group)
|
init_model_parallel_group)
|
||||||
|
|
||||||
import vllm_ascend.envs as envs_ascend
|
|
||||||
from vllm_ascend.ascend_config import get_ascend_config
|
from vllm_ascend.ascend_config import get_ascend_config
|
||||||
from vllm_ascend.utils import (enable_sp, flashcomm2_enable,
|
from vllm_ascend.utils import (enable_sp, flashcomm2_enable,
|
||||||
flashcomm2_o_shared_enabled)
|
flashcomm2_o_shared_enabled)
|
||||||
|
|
||||||
# Currently, mc2 op need their own group coordinator.
|
# Currently, mc2 op need their own group coordinator.
|
||||||
_MC2: Optional[GroupCoordinator] = None
|
_MC2: Optional[GroupCoordinator] = None
|
||||||
|
|
||||||
|
# Module specific tensor parallel groups
|
||||||
_MLP_TP: Optional[GroupCoordinator] = None
|
_MLP_TP: Optional[GroupCoordinator] = None
|
||||||
_OTP: Optional[GroupCoordinator] = None
|
_OTP: Optional[GroupCoordinator] = None
|
||||||
_LMTP: Optional[GroupCoordinator] = None
|
_LMTP: Optional[GroupCoordinator] = None
|
||||||
_P_TP: Optional[GroupCoordinator] = None
|
_EMBED_TP: Optional[GroupCoordinator] = None
|
||||||
|
|
||||||
|
# flashcomm2 specific groups
|
||||||
_FLASHCOMM2_OTP: Optional[GroupCoordinator] = None
|
_FLASHCOMM2_OTP: Optional[GroupCoordinator] = None
|
||||||
_FLASHCOMM2_ODP: Optional[GroupCoordinator] = None
|
_FLASHCOMM2_ODP: Optional[GroupCoordinator] = None
|
||||||
|
|
||||||
|
# shared_weight across rank groups
|
||||||
_SHARED_WEIGHT: Optional[GroupCoordinator] = None
|
_SHARED_WEIGHT: Optional[GroupCoordinator] = None
|
||||||
|
|
||||||
|
_P_TP: Optional[GroupCoordinator] = None
|
||||||
def get_mc2_group() -> GroupCoordinator:
|
|
||||||
assert _MC2 is not None, ("mc2 group is not initialized")
|
|
||||||
return _MC2
|
|
||||||
|
|
||||||
|
|
||||||
def get_otp_group() -> GroupCoordinator:
|
|
||||||
assert _OTP is not None, (
|
|
||||||
"output tensor parallel group is not initialized")
|
|
||||||
return _OTP
|
|
||||||
|
|
||||||
|
|
||||||
def get_lmhead_tp_group() -> GroupCoordinator:
|
|
||||||
assert _LMTP is not None, (
|
|
||||||
"lm head tensor parallel group is not initialized")
|
|
||||||
return _LMTP
|
|
||||||
|
|
||||||
|
|
||||||
def get_flashcomm2_otp_group() -> GroupCoordinator:
|
|
||||||
return _FLASHCOMM2_OTP
|
|
||||||
|
|
||||||
|
|
||||||
def get_flashcomm2_odp_group() -> GroupCoordinator:
|
|
||||||
assert _FLASHCOMM2_ODP is not None, (
|
|
||||||
"output data parallel group for flashcomm2 is not initialized")
|
|
||||||
return _FLASHCOMM2_ODP
|
|
||||||
|
|
||||||
|
|
||||||
def get_shared_weight_group() -> GroupCoordinator:
|
|
||||||
assert _SHARED_WEIGHT is not None, (
|
|
||||||
"output shared weight parallel group for flashcomm2 is not initialized"
|
|
||||||
)
|
|
||||||
return _SHARED_WEIGHT
|
|
||||||
|
|
||||||
|
|
||||||
def get_mlp_tp_group() -> GroupCoordinator:
|
|
||||||
assert _MLP_TP is not None, ("mlp group is not initialized")
|
|
||||||
return _MLP_TP
|
|
||||||
|
|
||||||
|
|
||||||
def get_p_tp_group() -> GroupCoordinator:
|
|
||||||
assert _P_TP is not None, (
|
|
||||||
"distributed prefill tensor parallel group is not initialized")
|
|
||||||
return _P_TP
|
|
||||||
|
|
||||||
|
|
||||||
def model_parallel_initialized():
|
|
||||||
return (_MC2 is not None)
|
|
||||||
|
|
||||||
|
|
||||||
def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
||||||
@@ -79,14 +37,16 @@ def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
|||||||
world_size = torch.distributed.get_world_size()
|
world_size = torch.distributed.get_world_size()
|
||||||
backend = torch.distributed.get_backend(get_world_group().device_group)
|
backend = torch.distributed.get_backend(get_world_group().device_group)
|
||||||
vllm_config = get_current_vllm_config()
|
vllm_config = get_current_vllm_config()
|
||||||
|
global_tp_size = parallel_config.tensor_parallel_size
|
||||||
|
global_dp_size = parallel_config.data_parallel_size
|
||||||
|
global_pp_size = parallel_config.pipeline_parallel_size
|
||||||
|
|
||||||
# The layout of all ranks: ExternalDP * EP
|
# The layout of all ranks: ExternalDP * EP
|
||||||
# ExternalDP is the data parallel group that is not part of the model,
|
# ExternalDP is the data parallel group that is not part of the model,
|
||||||
# every dp rank can generate independently (in verl integration).
|
# every dp rank can generate independently (in verl integration).
|
||||||
all_ranks = torch.arange(world_size).reshape(
|
all_ranks = torch.arange(world_size).reshape(
|
||||||
-1, parallel_config.data_parallel_size *
|
-1, global_dp_size * parallel_config.prefill_context_parallel_size *
|
||||||
parallel_config.prefill_context_parallel_size *
|
global_tp_size)
|
||||||
parallel_config.tensor_parallel_size)
|
|
||||||
|
|
||||||
pd_tp_ratio = get_ascend_config().pd_tp_ratio
|
pd_tp_ratio = get_ascend_config().pd_tp_ratio
|
||||||
pd_head_ratio = get_ascend_config().pd_head_ratio
|
pd_head_ratio = get_ascend_config().pd_head_ratio
|
||||||
@@ -98,13 +58,13 @@ def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
|||||||
if pd_head_ratio > 1 and get_current_vllm_config(
|
if pd_head_ratio > 1 and get_current_vllm_config(
|
||||||
).kv_transfer_config.is_kv_producer:
|
).kv_transfer_config.is_kv_producer:
|
||||||
num_head_replica = get_ascend_config().num_head_replica
|
num_head_replica = get_ascend_config().num_head_replica
|
||||||
remote_tp_size = parallel_config.tensor_parallel_size // pd_tp_ratio
|
remote_tp_size = global_tp_size // pd_tp_ratio
|
||||||
if num_head_replica <= 1:
|
if num_head_replica <= 1:
|
||||||
group_ranks = all_ranks.view(
|
group_ranks = all_ranks.view(
|
||||||
-1, prefill_tensor_model_parallel_size).unbind(0)
|
-1, prefill_tensor_model_parallel_size).unbind(0)
|
||||||
else:
|
else:
|
||||||
group_ranks = all_ranks.clone().view(
|
group_ranks = all_ranks.clone().view(
|
||||||
parallel_config.data_parallel_size, -1,
|
global_dp_size, -1,
|
||||||
num_head_replica) # [DP_size, num_head, num_head_replica]
|
num_head_replica) # [DP_size, num_head, num_head_replica]
|
||||||
group_ranks = group_ranks.permute(0, 2, 1)
|
group_ranks = group_ranks.permute(0, 2, 1)
|
||||||
group_ranks = group_ranks.reshape(
|
group_ranks = group_ranks.reshape(
|
||||||
@@ -112,8 +72,7 @@ def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
|||||||
group_ranks.size(-1)) # [DP_size * num_head_replica, num_head]
|
group_ranks.size(-1)) # [DP_size * num_head_replica, num_head]
|
||||||
alltoall_group_size = group_ranks.size(-1) // remote_tp_size
|
alltoall_group_size = group_ranks.size(-1) // remote_tp_size
|
||||||
group_ranks = group_ranks.unsqueeze(-1).view(
|
group_ranks = group_ranks.unsqueeze(-1).view(
|
||||||
parallel_config.data_parallel_size, num_head_replica, -1,
|
global_dp_size, num_head_replica, -1, alltoall_group_size
|
||||||
alltoall_group_size
|
|
||||||
) # [DP_size, num_head_replica, num_alltoall_group, alltoall_group_size]
|
) # [DP_size, num_head_replica, num_alltoall_group, alltoall_group_size]
|
||||||
group_ranks = group_ranks.reshape(-1,
|
group_ranks = group_ranks.reshape(-1,
|
||||||
alltoall_group_size).unbind(0)
|
alltoall_group_size).unbind(0)
|
||||||
@@ -135,54 +94,72 @@ def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
|||||||
get_world_group().local_rank,
|
get_world_group().local_rank,
|
||||||
backend,
|
backend,
|
||||||
group_name="mc2")
|
group_name="mc2")
|
||||||
if envs_ascend.VLLM_ASCEND_ENABLE_MLP_OPTIMIZE:
|
|
||||||
global _MLP_TP
|
|
||||||
assert _MLP_TP is None, (
|
|
||||||
"mlp tensor model parallel group is already initialized")
|
|
||||||
|
|
||||||
mlp_tp = parallel_config.data_parallel_size
|
# Initialize specialized tensor parallel (TP) process groups for fine-grained model parallelism
|
||||||
|
# on Ascend hardware. This enables independent TP configurations for three critical components:
|
||||||
|
|
||||||
all_ranks_mlp_head = torch.arange(world_size).reshape(
|
# 1. ** LM Head **:
|
||||||
-1, mlp_tp, parallel_config.pipeline_parallel_size, 1) # noqa
|
# The final linear layer that maps hidden states to vocabulary logits.
|
||||||
group_ranks = all_ranks_mlp_head.view(-1, mlp_tp).unbind(0)
|
# Controlled by `lmhead_tensor_parallel_size`.
|
||||||
group_ranks = [x.tolist() for x in group_ranks]
|
|
||||||
|
|
||||||
# message queue broadcaster is only used in tensor model parallel group
|
# 2. ** o_proj **:
|
||||||
_MLP_TP = init_model_parallel_group(group_ranks,
|
# The output projection in attention blocks (e.g., in Multi-Head Attention).
|
||||||
get_world_group().local_rank,
|
# Controlled by `oproj_tensor_parallel_size`.
|
||||||
backend,
|
|
||||||
group_name="mlp_tp")
|
|
||||||
|
|
||||||
# If oproj tensor parallel size is set, we will create a group for it.
|
# 3. ** Embedding **:
|
||||||
otp_size = get_ascend_config().oproj_tensor_parallel_size
|
# The token embedding table at the input and/or output of the model.
|
||||||
if otp_size is not None:
|
# Controlled by `embedding_tensor_parallel_size`.
|
||||||
group_ranks = []
|
|
||||||
global _OTP
|
|
||||||
num_oproj_tensor_parallel_groups: int = (world_size // otp_size)
|
|
||||||
for i in range(num_oproj_tensor_parallel_groups):
|
|
||||||
ranks = list(range(i * otp_size, (i + 1) * otp_size))
|
|
||||||
group_ranks.append(ranks)
|
|
||||||
_OTP = init_model_parallel_group(group_ranks,
|
|
||||||
get_world_group().local_rank,
|
|
||||||
backend,
|
|
||||||
group_name="otp")
|
|
||||||
|
|
||||||
lmhead_tensor_parallel_size = get_ascend_config(
|
# 4. ** MLP **:
|
||||||
).lmhead_tensor_parallel_size
|
# The feed-forward network layers within transformer blocks.
|
||||||
if lmhead_tensor_parallel_size is not None:
|
# Controlled by `mlp_tensor_parallel_size`.
|
||||||
group_ranks = []
|
|
||||||
global _LMTP
|
_group_cache = {}
|
||||||
num_lmhead_tensor_parallel_groups: int = (world_size //
|
|
||||||
lmhead_tensor_parallel_size)
|
def _create_or_get_group(group_size: int,
|
||||||
for i in range(num_lmhead_tensor_parallel_groups):
|
group_name: str) -> GroupCoordinator:
|
||||||
ranks = list(
|
if group_size is None:
|
||||||
range(i * lmhead_tensor_parallel_size,
|
return None
|
||||||
(i + 1) * lmhead_tensor_parallel_size))
|
if group_size not in _group_cache:
|
||||||
group_ranks.append(ranks)
|
|
||||||
_LMTP = init_model_parallel_group(group_ranks,
|
rank_grid = torch.arange(world_size).reshape(
|
||||||
get_world_group().local_rank,
|
global_pp_size, global_dp_size, global_tp_size)
|
||||||
backend,
|
num_chunks = global_dp_size // group_size
|
||||||
group_name="lmheadtp")
|
group_ranks = []
|
||||||
|
for pp_idx in range(global_pp_size):
|
||||||
|
stage_ranks = rank_grid[pp_idx] # (dp, tp)
|
||||||
|
for chunk in range(num_chunks):
|
||||||
|
for tp_idx in range(global_tp_size):
|
||||||
|
group = stage_ranks[chunk * group_size:(chunk + 1) *
|
||||||
|
group_size, tp_idx].tolist()
|
||||||
|
group_ranks.append(group)
|
||||||
|
pg = init_model_parallel_group(group_ranks,
|
||||||
|
get_world_group().local_rank,
|
||||||
|
backend,
|
||||||
|
group_name=group_name)
|
||||||
|
_group_cache[group_size] = pg
|
||||||
|
|
||||||
|
return _group_cache[group_size]
|
||||||
|
|
||||||
|
otp_size = get_ascend_config(
|
||||||
|
).finegrained_tp_config.oproj_tensor_parallel_size
|
||||||
|
lmhead_tp_size = get_ascend_config(
|
||||||
|
).finegrained_tp_config.lmhead_tensor_parallel_size
|
||||||
|
embedding_tp_size = get_ascend_config(
|
||||||
|
).finegrained_tp_config.embedding_tensor_parallel_size
|
||||||
|
mlp_tp_size = get_ascend_config(
|
||||||
|
).finegrained_tp_config.embedding_tensor_parallel_size
|
||||||
|
|
||||||
|
global _OTP, _LMTP, _EMBED_TP
|
||||||
|
|
||||||
|
if otp_size > 0:
|
||||||
|
_OTP = _create_or_get_group(otp_size, "otp")
|
||||||
|
if lmhead_tp_size > 0:
|
||||||
|
_LMTP = _create_or_get_group(lmhead_tp_size, "lmheadtp")
|
||||||
|
if embedding_tp_size > 0:
|
||||||
|
_EMBED_TP = _create_or_get_group(embedding_tp_size, "emtp")
|
||||||
|
if mlp_tp_size > 0:
|
||||||
|
_MLP_TP = _create_or_get_group(mlp_tp_size, "mlptp")
|
||||||
|
|
||||||
def _create_shared_weight_group(group_name: str) -> GroupCoordinator:
|
def _create_shared_weight_group(group_name: str) -> GroupCoordinator:
|
||||||
#This communication domain is used for asynchronous broadcasting, so we will create a new communication group to avoid interference
|
#This communication domain is used for asynchronous broadcasting, so we will create a new communication group to avoid interference
|
||||||
@@ -265,14 +242,58 @@ def init_ascend_model_parallel(parallel_config: ParallelConfig, ):
|
|||||||
_SHARED_WEIGHT = _create_shared_weight_group("flashcomm2_o_shared")
|
_SHARED_WEIGHT = _create_shared_weight_group("flashcomm2_o_shared")
|
||||||
|
|
||||||
|
|
||||||
def get_mlp_tensor_model_parallel_world_size():
|
def model_parallel_initialized():
|
||||||
"""Return world size for the tensor model parallel group."""
|
return (_MC2 is not None)
|
||||||
return get_mlp_tp_group().world_size
|
|
||||||
|
|
||||||
|
|
||||||
def get_mlp_tensor_model_parallel_rank():
|
def get_mc2_group() -> GroupCoordinator:
|
||||||
"""Return world size for the tensor model parallel group."""
|
assert _MC2 is not None, ("mc2 group is not initialized")
|
||||||
return get_mlp_tp_group().rank_in_group
|
return _MC2
|
||||||
|
|
||||||
|
|
||||||
|
def get_mlp_tp_group() -> GroupCoordinator:
|
||||||
|
assert _MLP_TP is not None, ("mlp group is not initialized")
|
||||||
|
return _MLP_TP
|
||||||
|
|
||||||
|
|
||||||
|
def get_otp_group() -> GroupCoordinator:
|
||||||
|
assert _OTP is not None, (
|
||||||
|
"output tensor parallel group is not initialized")
|
||||||
|
return _OTP
|
||||||
|
|
||||||
|
|
||||||
|
def get_lmhead_tp_group() -> GroupCoordinator:
|
||||||
|
assert _LMTP is not None, (
|
||||||
|
"lm head tensor parallel group is not initialized")
|
||||||
|
return _LMTP
|
||||||
|
|
||||||
|
|
||||||
|
def get_embed_tp_group() -> GroupCoordinator:
|
||||||
|
assert _EMBED_TP is not None, ("emtp group is not initialized")
|
||||||
|
return _EMBED_TP
|
||||||
|
|
||||||
|
|
||||||
|
def get_flashcomm2_otp_group() -> GroupCoordinator:
|
||||||
|
return _FLASHCOMM2_OTP
|
||||||
|
|
||||||
|
|
||||||
|
def get_flashcomm2_odp_group() -> GroupCoordinator:
|
||||||
|
assert _FLASHCOMM2_ODP is not None, (
|
||||||
|
"output data parallel group for flashcomm2 is not initialized")
|
||||||
|
return _FLASHCOMM2_ODP
|
||||||
|
|
||||||
|
|
||||||
|
def get_shared_weight_group() -> GroupCoordinator:
|
||||||
|
assert _SHARED_WEIGHT is not None, (
|
||||||
|
"output shared weight parallel group for flashcomm2 is not initialized"
|
||||||
|
)
|
||||||
|
return _SHARED_WEIGHT
|
||||||
|
|
||||||
|
|
||||||
|
def get_p_tp_group() -> GroupCoordinator:
|
||||||
|
assert _P_TP is not None, (
|
||||||
|
"distributed prefill tensor parallel group is not initialized")
|
||||||
|
return _P_TP
|
||||||
|
|
||||||
|
|
||||||
def destroy_ascend_model_parallel():
|
def destroy_ascend_model_parallel():
|
||||||
@@ -291,6 +312,11 @@ def destroy_ascend_model_parallel():
|
|||||||
_LMTP.destroy()
|
_LMTP.destroy()
|
||||||
_LMTP = None
|
_LMTP = None
|
||||||
|
|
||||||
|
global _EMBED_TP
|
||||||
|
if _EMBED_TP:
|
||||||
|
_EMBED_TP.destroy()
|
||||||
|
_EMBED_TP = None
|
||||||
|
|
||||||
global _OTP
|
global _OTP
|
||||||
if _OTP:
|
if _OTP:
|
||||||
_OTP.destroy()
|
_OTP.destroy()
|
||||||
|
|||||||
@@ -118,10 +118,6 @@ env_variables: Dict[str, Callable[[], Any]] = {
|
|||||||
# However, there might be hidden issues, and it is currently recommended to prioritize its use with dense models.
|
# However, there might be hidden issues, and it is currently recommended to prioritize its use with dense models.
|
||||||
"VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE":
|
"VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE":
|
||||||
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE", '0'))),
|
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE", '0'))),
|
||||||
# Whether to enable mlp optimize when tensor parallel is enabled.
|
|
||||||
# this feature in eager mode will get better performance.
|
|
||||||
"VLLM_ASCEND_ENABLE_MLP_OPTIMIZE":
|
|
||||||
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_MLP_OPTIMIZE", '0'))),
|
|
||||||
# Whether to enable msMonitor tool to monitor the performance of vllm-ascend.
|
# Whether to enable msMonitor tool to monitor the performance of vllm-ascend.
|
||||||
"MSMONITOR_USE_DAEMON":
|
"MSMONITOR_USE_DAEMON":
|
||||||
lambda: bool(int(os.getenv("MSMONITOR_USE_DAEMON", '0'))),
|
lambda: bool(int(os.getenv("MSMONITOR_USE_DAEMON", '0'))),
|
||||||
|
|||||||
@@ -30,8 +30,9 @@ from vllm.model_executor.layers.vocab_parallel_embedding import (
|
|||||||
VocabParallelEmbedding, pad_vocab_size)
|
VocabParallelEmbedding, pad_vocab_size)
|
||||||
from vllm.model_executor.utils import set_weight_attrs
|
from vllm.model_executor.utils import set_weight_attrs
|
||||||
|
|
||||||
from vllm_ascend.distributed.parallel_state import get_lmhead_tp_group
|
from vllm_ascend.distributed.parallel_state import (get_embed_tp_group,
|
||||||
from vllm_ascend.utils import lmhead_tp_enable
|
get_lmhead_tp_group)
|
||||||
|
from vllm_ascend.utils import embedding_tp_enable, lmhead_tp_enable
|
||||||
|
|
||||||
|
|
||||||
class AscendVocabParallelEmbedding(VocabParallelEmbedding):
|
class AscendVocabParallelEmbedding(VocabParallelEmbedding):
|
||||||
@@ -50,9 +51,12 @@ class AscendVocabParallelEmbedding(VocabParallelEmbedding):
|
|||||||
quant_config: Optional[QuantizationConfig] = None,
|
quant_config: Optional[QuantizationConfig] = None,
|
||||||
prefix: str = ""):
|
prefix: str = ""):
|
||||||
nn.Module.__init__(self)
|
nn.Module.__init__(self)
|
||||||
|
self.forward_type = None
|
||||||
if lmhead_tp_enable() and prefix.find("head") != -1:
|
if lmhead_tp_enable() and "head" in prefix:
|
||||||
self.comm_group = get_lmhead_tp_group()
|
self.comm_group = get_lmhead_tp_group()
|
||||||
|
elif embedding_tp_enable() and "embed_tokens" in prefix:
|
||||||
|
self.comm_group = get_embed_tp_group()
|
||||||
|
self.forward_type = "embed_tp"
|
||||||
else:
|
else:
|
||||||
self.comm_group = get_tp_group()
|
self.comm_group = get_tp_group()
|
||||||
|
|
||||||
@@ -146,6 +150,28 @@ class AscendVocabParallelEmbedding(VocabParallelEmbedding):
|
|||||||
return input_, ~vocab_mask
|
return input_, ~vocab_mask
|
||||||
|
|
||||||
def forward(self, input_):
|
def forward(self, input_):
|
||||||
|
if self.forward_type == "embed_tp":
|
||||||
|
return self._forward_embed_tp(input_)
|
||||||
|
else:
|
||||||
|
return self._forward_origin(input_)
|
||||||
|
|
||||||
|
def _forward_embed_tp(self, input_):
|
||||||
|
complete_input = self.comm_group.all_gather(input_, dim=0)
|
||||||
|
masked_input, input_mask = self._get_masked_input_and_mask(
|
||||||
|
complete_input, self.shard_indices.org_vocab_start_index,
|
||||||
|
self.shard_indices.org_vocab_end_index,
|
||||||
|
self.shard_indices.num_org_vocab_padding,
|
||||||
|
self.shard_indices.added_vocab_start_index,
|
||||||
|
self.shard_indices.added_vocab_end_index)
|
||||||
|
# Get the embeddings.
|
||||||
|
output_parallel = self.quant_method.embedding(self,
|
||||||
|
masked_input.long())
|
||||||
|
output_parallel.masked_fill_(input_mask.unsqueeze(-1), 0)
|
||||||
|
output = self.comm_group.reduce_scatter(output_parallel, dim=0)
|
||||||
|
output = output.view(input_.shape[0], -1)
|
||||||
|
return output
|
||||||
|
|
||||||
|
def _forward_origin(self, input_):
|
||||||
if self.tp_size > 1:
|
if self.tp_size > 1:
|
||||||
# Build the mask.
|
# Build the mask.
|
||||||
masked_input, input_mask = self._get_masked_input_and_mask(
|
masked_input, input_mask = self._get_masked_input_and_mask(
|
||||||
|
|||||||
@@ -715,15 +715,23 @@ def get_ascend_device_type():
|
|||||||
|
|
||||||
|
|
||||||
def lmhead_tp_enable() -> bool:
|
def lmhead_tp_enable() -> bool:
|
||||||
return get_ascend_config().lmhead_tensor_parallel_size is not None
|
return get_ascend_config(
|
||||||
|
).finegrained_tp_config.lmhead_tensor_parallel_size > 0
|
||||||
|
|
||||||
|
|
||||||
|
def embedding_tp_enable() -> bool:
|
||||||
|
return get_ascend_config(
|
||||||
|
).finegrained_tp_config.embedding_tensor_parallel_size > 0
|
||||||
|
|
||||||
|
|
||||||
def oproj_tp_enable() -> bool:
|
def oproj_tp_enable() -> bool:
|
||||||
return get_ascend_config().oproj_tensor_parallel_size is not None
|
return get_ascend_config(
|
||||||
|
).finegrained_tp_config.oproj_tensor_parallel_size > 0
|
||||||
|
|
||||||
|
|
||||||
def mlp_tp_enable() -> bool:
|
def mlp_tp_enable() -> bool:
|
||||||
return envs_ascend.VLLM_ASCEND_ENABLE_MLP_OPTIMIZE
|
return get_ascend_config(
|
||||||
|
).finegrained_tp_config.mlp_tensor_parallel_size > 0
|
||||||
|
|
||||||
|
|
||||||
def matmul_allreduce_enable() -> bool:
|
def matmul_allreduce_enable() -> bool:
|
||||||
@@ -971,7 +979,7 @@ def get_flashcomm2_config_and_validate(ascend_config, vllm_config):
|
|||||||
logger.warning_once(
|
logger.warning_once(
|
||||||
"It is recommended to enable FLASHCOMM1 simultaneously when starting FLASHCOMM2 for optimal performance."
|
"It is recommended to enable FLASHCOMM1 simultaneously when starting FLASHCOMM2 for optimal performance."
|
||||||
)
|
)
|
||||||
if ascend_config.oproj_tensor_parallel_size is not None:
|
if ascend_config.finegrained_tp_config.oproj_tensor_parallel_size > 0:
|
||||||
raise AssertionError(
|
raise AssertionError(
|
||||||
"flashcomm2_oproj_tensor_parallel_size cannot be enabled simultaneously with oproj_tensor_parallel_size"
|
"flashcomm2_oproj_tensor_parallel_size cannot be enabled simultaneously with oproj_tensor_parallel_size"
|
||||||
)
|
)
|
||||||
|
|||||||
Reference in New Issue
Block a user