xc-llm-ascend

Files

drslark 5666ce03f5 [bugfix] Fixed an accuracy problem of gdn layer in graph (#6822 )

### What this PR does / why we need it?

There will be random ouputs if we run model with GDN attention in graph
mode:

```python
prompts = [
    "1. Who are you?",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=5)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
            tensor_parallel_size=4,

            distributed_executor_backend="mp",
            gpu_memory_utilization=0.7,
            speculative_config={
                "method": "qwen3_next_mtp",
                "num_speculative_tokens": 3,
            },
            
            compilation_config={
                "cudagraph_mode": "FULL_DECODE_ONLY",
                "cudagraph_capture_sizes": [8],
            },
            
            max_model_len=4096, 
            enable_prefix_caching=False)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"{output.prompt_token_ids=}")
    print(f"{output.outputs[0].token_ids=}")
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Before appling this change, the outputs was:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 323, 279, 1112, 279]
Prompt: '1. Who are you?', Generated text: ' What and the... the'
```

After applying this change, the output is:

```text
output.prompt_token_ids=[16, 13, 10479, 525, 498, 30]
output.outputs[0].token_ids=[3555, 374, 697, 829, 30]
Prompt: '1. Who are you?', Generated text: ' What is your name?'
```

**Why does this change sovle the problem?**

Now, `query_start_loc` is padded because of `fia`.

But, for `gdn-attention`, padded version of `query_start_loc` will cause
accuracy problem.

So, we need an unpadded version of `query_start_loc` named
`gdn_query_start_loc` and use it in `gdn-attention`, it works fine.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

As described aboved.

- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

Signed-off-by: drslark <slarksblood@qq.com>

2026-02-28 08:57:53 +08:00

[Feature] adapt to uva buffer and main2main (#6657 )

2026-02-12 10:36:31 +08:00

__init__.py

[Misc][V0 Deprecation] Remove Cache Engine Used for V0 Worker (#1878 )

2025-07-19 09:42:32 +08:00

block_table.py

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #7 ) (#6023 )

2026-02-06 14:56:53 +08:00

model_runner_v1.py

[bugfix] Fixed an accuracy problem of gdn layer in graph (#6822 )

2026-02-28 08:57:53 +08:00