xc-llm-ascend/vllm_ascend/spec_decode/suffix_proposer.py

import torch
from vllm.config import CUDAGraphMode
from vllm.v1.spec_decode.suffix_decoding import \
    SuffixDecodingProposer as VllmSuffixDecodingProposer

from vllm_ascend.spec_decode.interface import Proposer, SpecDcodeType


class SuffixDecodingProposer(VllmSuffixDecodingProposer, Proposer):

    def __init__(self, vllm_config, device, runner):
        super().__init__(vllm_config)
        self.name = SpecDcodeType.SUFFIX
        self.device = device
        self.runner = runner

    def load_model(self, *args, **kwargs):
        # No model to load.
        pass

    @torch.inference_mode()
    def dummy_run(self,
                  num_tokens,
                  with_prefill=None,
                  in_graph_capturing=None,
                  num_reqs=None,
                  num_tokens_across_dp=None,
                  aclgraph_runtime_mode: CUDAGraphMode = CUDAGraphMode.NONE,
                  batch_descriptor=None,
                  dummy_compute_logits=lambda hidden_states: None,
                  is_profile=False):
        pass

    def generate_token_ids(self,
                           valid_sampled_token_ids,
                           sampling_metadata=None,
                           scheduler_output=None,
                           spec_decode_metadata=None,
                           positions=None,
                           num_scheduled_tokens=None,
                           hidden_states=None,
                           aux_hidden_states=None) -> list[list[int]]:
        draft_token_ids = self.propose(self.runner.input_batch,
                                       valid_sampled_token_ids)
        return draft_token_ids
[Feature] Integrate Suffix Spec Decoding (#4045) ### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (https://github.com/vllm-project/vllm/pull/25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac with PR https://github.com/vllm-project/vllm/pull/25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com> 2025-12-01 18:41:42 +08:00			`import torch`
			`from vllm.config import CUDAGraphMode`
			`from vllm.v1.spec_decode.suffix_decoding import \`
			`SuffixDecodingProposer as VllmSuffixDecodingProposer`

			`from vllm_ascend.spec_decode.interface import Proposer, SpecDcodeType`


			`class SuffixDecodingProposer(VllmSuffixDecodingProposer, Proposer):`

			`def __init__(self, vllm_config, device, runner):`
			`super().__init__(vllm_config)`
			`self.name = SpecDcodeType.SUFFIX`
			`self.device = device`
			`self.runner = runner`

			`def load_model(self, args, *kwargs):`
			`# No model to load.`
			`pass`

			`@torch.inference_mode()`
			`def dummy_run(self,`
			`num_tokens,`
			`with_prefill=None,`
[Bugfix] Fix the attn_metadata is None (#5038) ### What this PR does / why we need it? Fix the bug " TypeError: 'NoneType' object is not iterable' " in vllm_ascend/compilation/acl_graph.py The reason of that is the attn_metadata is none in the dummy_run of MTP. - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 Signed-off-by: chenmenglong <chenmenglong1@huawei.com> 2025-12-16 09:14:05 +08:00			`in_graph_capturing=None,`
[Feature] Integrate Suffix Spec Decoding (#4045) ### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (https://github.com/vllm-project/vllm/pull/25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac with PR https://github.com/vllm-project/vllm/pull/25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com> 2025-12-01 18:41:42 +08:00			`num_reqs=None,`
			`num_tokens_across_dp=None,`
			`aclgraph_runtime_mode: CUDAGraphMode = CUDAGraphMode.NONE,`
			`batch_descriptor=None,`
[Bugfix] Fix in_profile_run in mtp_proposer dummy_run (#5165) ### What this PR does / why we need it? This PR aims to fix failure of `enable_force_load_balance` caused by missing `in_profile_run` in `dummy_run` of mtp_proposer. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9 --------- Signed-off-by: Zetong Li <slippersss@126.com> 2025-12-18 22:27:47 +08:00			`dummy_compute_logits=lambda hidden_states: None,`
			`is_profile=False):`
[Feature] Integrate Suffix Spec Decoding (#4045) ### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (https://github.com/vllm-project/vllm/pull/25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:https://github.com/vllm-project/vllm/commit/83f478bb19489b41e9d208b47b4bb5a95ac171ac with PR https://github.com/vllm-project/vllm/pull/25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <38945811+fluctlux@users.noreply.github.com> 2025-12-01 18:41:42 +08:00			`pass`

			`def generate_token_ids(self,`
			`valid_sampled_token_ids,`
			`sampling_metadata=None,`
			`scheduler_output=None,`
			`spec_decode_metadata=None,`
			`positions=None,`
			`num_scheduled_tokens=None,`
			`hidden_states=None,`
			`aux_hidden_states=None) -> list[list[int]]:`
			`draft_token_ids = self.propose(self.runner.input_batch,`
			`valid_sampled_token_ids)`
			`return draft_token_ids`