Files
xc-llm-ascend/vllm_ascend/worker
drslark 5666ce03f5 [bugfix] Fixed an accuracy problem of gdn layer in graph (#6822)
### What this PR does / why we need it?

There will be random ouputs if we run model with GDN attention in graph
mode:

```python
prompts = [
    "1. Who are you?",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
            tensor_parallel_size=4,

            distributed_executor_backend="mp",
            gpu_memory_utilization=0.7,
            speculative_config={
                "method": "qwen3_next_mtp",
                "num_speculative_tokens": 3,
            },
            
            compilation_config={
                "cudagraph_mode": "FULL_DECODE_ONLY",
                "cudagraph_capture_sizes": [8],
            },
            
            max_model_len=4096, 
            enable_prefix_caching=False)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{output.prompt_token_ids=}")
    print(f"{output.outputs[0].token_ids=}")
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Before appling this change, the outputs was:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 323, 279, 1112, 279]
Prompt: '1. Who are you?', Generated text: ' What and the... the'
```

After applying this change, the output is:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 374, 697, 829, 30]
Prompt: '1. Who are you?', Generated text: ' What is your name?'
```

**Why does this change sovle the problem?**

Now, `query_start_loc` is padded because of `fia`.

But, for `gdn-attention`, padded version of `query_start_loc` will cause
accuracy problem.

So, we need an unpadded version of `query_start_loc` named
`gdn_query_start_loc` and use it in `gdn-attention`, it works fine.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

As described aboved.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: drslark <slarksblood@qq.com>
2026-02-28 08:57:53 +08:00
..