Drop torchair (#4814)
aclgraph is stable and fast now. Let's drop torchair graph mode now.
TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
This commit is contained in:
@@ -251,7 +251,6 @@ This will reproduce the E2E test. See [vllm_ascend_test.yaml](https://github.com
|
||||
- Offline test example: [`tests/e2e/singlecard/test_offline_inference.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_offline_inference.py)
|
||||
- Online test examples: [`tests/e2e/singlecard/test_prompt_embedding.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_prompt_embedding.py)
|
||||
- Correctness test example: [`tests/e2e/singlecard/test_aclgraph.py`](https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_aclgraph.py)
|
||||
- Reduced Layer model test example: [test_torchair_graph_mode.py - DeepSeek-V3-Pruning](https://github.com/vllm-project/vllm-ascend/blob/20767a043cccb3764214930d4695e53941de87ec/tests/e2e/multicard/test_torchair_graph_mode.py#L48)
|
||||
|
||||
The CI resource is limited, and you might need to reduce the number of layers of a model. Below is an example of how to generate a reduced layer model:
|
||||
1. Fork the original model repo in modelscope. All the files in the repo except for weights are required.
|
||||
|
||||
@@ -6,7 +6,7 @@ MTP boosts inference performance by parallelizing the prediction of multiple tok
|
||||
## How to Use MTP
|
||||
To enable MTP for DeepSeek-V3 models, add the following parameter when starting the service:
|
||||
|
||||
--speculative_config ' {"method": "deepseek_mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
|
||||
--speculative_config ' {"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": False} '
|
||||
|
||||
- `num_speculative_tokens`: The number of speculative tokens which enable model to predict multiple tokens at once, if provided. It will default to the number in the draft model config if present, otherwise, it is required.
|
||||
- `disable_padded_drafter_batch`: Disable input padding for speculative decoding. If set to True, speculative input batches can contain sequences of different lengths, which may only be supported by certain attention backends. This currently only affects the MTP method of speculation, default is False.
|
||||
@@ -74,21 +74,18 @@ If the bonus token is accepted, the MTP model performs inference for (num_specul
|
||||
|
||||
### Method Validation
|
||||
|
||||
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and deepseek_mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
|
||||
- Currently, the spec_decode scenario only supports methods such as ngram, eagle, eagle3, and mtp. If an incorrect parameter is passed for the method, the code will raise an error to alert the user that an incorrect method was provided.
|
||||
|
||||
```
|
||||
def get_spec_decode_method(method,
|
||||
vllm_config,
|
||||
device,
|
||||
runner,
|
||||
is_torchair_graph=False):
|
||||
runner):
|
||||
if method == "ngram":
|
||||
return NgramProposer(vllm_config, device, runner)
|
||||
elif method in ["eagle", "eagle3"]:
|
||||
return EagleProposer(vllm_config, device, runner)
|
||||
elif method == 'deepseek_mtp':
|
||||
if is_torchair_graph:
|
||||
return TorchairMtpProposer(vllm_config, device, runner)
|
||||
elif method == 'mtp':
|
||||
return MtpProposer(vllm_config, device, runner)
|
||||
else:
|
||||
raise ValueError("Unknown speculative decoding method: "
|
||||
|
||||
Reference in New Issue
Block a user