[MTP] follow custom deepseek modeling changes to support graph mode (#636)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode(torchair) ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
This commit is contained in:
11
.github/workflows/vllm_ascend_test.yaml
vendored
11
.github/workflows/vllm_ascend_test.yaml
vendored
@@ -138,13 +138,18 @@ jobs:
|
||||
speculative_tests_changed:
|
||||
- "tests/singlecard/spec_decode/**"
|
||||
- "tests/multicard/spec_decode_e2e/**"
|
||||
- "vllm_ascend/worker/worker.py"
|
||||
- "vllm_ascend/worker/model_runner.py"
|
||||
- "vllm_ascend/worker/multi_step_runner.py"
|
||||
- "vllm_ascend/worker/multi_step_worker.py"
|
||||
- "vllm_ascend/patch/patch_rejection_sampler.py"
|
||||
- "vllm_ascend/patch/patch_spec_decode_worker.py"
|
||||
- "vllm_ascend/patch/patch_multi_step_worker.py"
|
||||
- "vllm_ascend/worker/draft_model_runner.py"
|
||||
- "vllm_ascend/patch/worker/patch_common/patch_metrics.py"
|
||||
- "vllm_ascend/patch/worker/patch_common/patch_spec_decode_worker.py"
|
||||
- "vllm_ascend/patch/worker/patch_common/patch_multi_step_worker.py"
|
||||
|
||||
- name: Run vllm-project/vllm-ascend Speculative Decode test
|
||||
env:
|
||||
VLLM_USE_V1: 0
|
||||
if: steps.filter_spec_decode.outputs.speculative_tests_changed == 'true'
|
||||
run: |
|
||||
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
|
||||
|
||||
Reference in New Issue
Block a user