[MTP] follow custom deepseek modeling changes to support graph mode (#636)

### What this PR does / why we need it? As custom deepseek modeling do some changes to support graph mode in https://github.com/vllm-project/vllm-ascend/pull/585, so i follow it to change custom deepseek_mtp modeling. And some modifications for k>1 were not carried over by the https://github.com/vllm-project/vllm-ascend/pull/429, now i add it. In order to better take care of the MTP feature in the vllm-ascend repository, I added cases related to graph mode(torchair), but i skip it since torchair can not correctly clean up memory in vllmrunner. Also i add some case for MTP quantization weights, but test weight is not ready, so i skip it and i will open it when test quant weights is ready. https://github.com/vllm-project/vllm-ascend/pull/648 did not completely fix the sample change(https://github.com/vllm-project/vllm-ascend/issues/660) issue, I added the relevant changes. ### Does this PR introduce _any_ user-facing change? now, u can use following method to use mtp in deepseek v3/r1 float or quant weights with eager mode. ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, enforce_eager=True, trust_remote_code=True, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` or use mtp in deepseek v3/r1 float or quant weights with graph mode（torchair） ```python llm = LLM( model="wemaster/deepseek_mtp_main_random_bf16", tensor_parallel_size=2, speculative_config={ "num_speculative_tokens": 1, }, trust_remote_code=True, additional_config={ 'enable_graph_mode': True, }, disable_log_stats=False, gpu_memory_utilization=0.8, max_model_len=64, ) ``` add notes: 1. now, we support k>1, so u can set num_speculative_tokens > 1 if there is sufficient redundant computing power; 2. MTP is not supported in V1, we will support it when vLLM does it in https://github.com/vllm-project/vllm/issues/13500. 3. if u run MTP failed by `segmentation fault`, u can follow v0.7.3 patch https://github.com/vllm-project/vllm-ascend/pull/236 file `vllm_ascend/patch/patch_metrics.py` method `__npu_async_metrics_collector_init__` ### How was this patch tested? local tested passed and test by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
2025-04-28 21:18:53 +08:00
parent be9e3e8545
commit 54c0e63df7
15 changed files with 288 additions and 39 deletions
--- a/tests/singlecard/spec_decode/e2e/conftest.py
+++ b/tests/singlecard/spec_decode/e2e/conftest.py
@@ -17,7 +17,9 @@
 # limitations under the License.
 #

+import shutil
 from itertools import cycle
+from pathlib import Path
 from typing import List, Optional, Sequence, Tuple, Union

 import pytest
@@ -177,6 +179,12 @@ def _check_logprobs_when_output_disabled(
        assert spec_pos_logprob_token_id in baseline_pos_logprobs


+def _clean_torchair_cache():
+    cache_path = Path.cwd() / '.torchair_cache'
+    if cache_path.exists() and cache_path.is_dir():
+        shutil.rmtree(cache_path)
+
+
 def run_equality_correctness_test(
        vllm_runner,
        common_llm_kwargs,
@@ -219,10 +227,20 @@ def run_equality_correctness_test(
                                     logprobs=logprobs,
                                     prompt_logprobs=prompt_logprobs)

+    # TODO current torchair graph mode needs clean torchair cache.
+    # if do not clean, it will raise error
+    additional_config = common_llm_kwargs.get("additional_config")
+    enable_graph_mode = additional_config.get(
+        "enable_graph_mode") if additional_config else False
+
    with vllm_runner(**org_args) as vllm_model:
+        if enable_graph_mode:
+            _clean_torchair_cache()
        org_outputs = vllm_model.generate_w_logprobs(prompts, sampling_params)

    with vllm_runner(**sd_args) as vllm_model:
+        if enable_graph_mode:
+            _clean_torchair_cache()
        if ensure_all_accepted or expected_acceptance_rate is not None:
            # Force log interval to be 0 to catch all metrics.
            stat_logger = vllm_model.model.llm_engine.stat_loggers[