[Attention][Kernel]moe support for llama4 and mllama4 (#740)

### What this PR does / why we need it? moe support for llama4 and mllama4 in vllm-ascend ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? start sever: python -m vllm.entrypoints.openai.api_server --model /data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct \ --max-num-seqs=256 \ --max-model-len=8192 \ --tensor-parallel-size=8 \ --block-size=128 \ --dtype bfloat16 \ --host=0.0.0.0 \ --port=8000 \ --gpu-memory-utilization=0.9 \ --trust-remote-code client: python online_server.py --model-path /data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct --image-path /data/nfs/w60040464/cherry_blossom.jpg --docker-ip 7.242.108.253 --served-port 8000 --text "what is the content of this image?" result: {'id': 'chatcmpl-2b709a5d2e1a4017991ec4ba8248686a', 'object': 'chat.completion', 'created': 1747056823, 'model': '/data/nfs/benchmark/tokenizer/Llama-4-Scout-17B-16E-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'The image depicts a tower, likely Tokyo Skytree, framed by branches of a cherry blossom tree. The tower is white and has a distinctive shape, with a large sphere at the top and a long, thin spire extending from it. The branches of the cherry blossom tree are in the foreground, with pink flowers blooming on them. The background is a clear blue sky.\n\n**Key Features:**\n\n* **Tower:** White, spherical shape at the top, long thin spire\n', 'tool_calls': []}, 'logprobs': None, 'finish_reason': 'length', 'stop_reason': None}], 'usage': {'prompt_tokens': 2340, 'total_tokens': 2440, 'completion_tokens': 100, 'prompt_tokens_details': None}, 'prompt_logprobs': None} Signed-off-by: chenxu <chenxu68@huawei.com> Co-authored-by: chenxu <chenxu68@huawei.com> Co-authored-by: evian <eviantai@u.nus.edu>
2025-05-13 19:12:40 +08:00
parent 217211d8a3
commit e564470338
4 changed files with 34 additions and 12 deletions
--- a/vllm_ascend/attention/attention.py
+++ b/vllm_ascend/attention/attention.py
@@ -708,6 +708,7 @@ class AscendAttentionBackendImpl(AttentionImpl):
        blocksparse_params: Optional[Dict[str, Any]] = None,
        logits_soft_cap: Optional[float] = None,
        attn_type: str = AttentionType.DECODER,
+        use_irope: bool = False,
    ) -> None:
        self.num_heads = num_heads
        self.head_size = head_size
--- a/vllm_ascend/attention/attention_v1.py
+++ b/vllm_ascend/attention/attention_v1.py
@@ -174,6 +174,7 @@ class AscendAttentionBackendImpl(AttentionImpl):
        blocksparse_params: Optional[Dict[str, Any]] = None,
        logits_soft_cap: Optional[float] = None,
        attn_type: str = AttentionType.DECODER,
+        use_irope: bool = False,
    ) -> None:
        self.num_heads = num_heads
        self.head_size = head_size