xc-llm-ascend

Files

Song Zhixin 6995a7bc5b [Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788 )

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before：

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: jesse <szxfml@gmail.com>

2025-09-24 11:21:58 +08:00

test_input_batch.py

[New model] Qwen3-next support (#2917 )

2025-09-16 01:17:42 +08:00

test_model_runner_v1.py

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788 )

2025-09-24 11:21:58 +08:00

test_worker_v1.py

[BugFix] Async scheduling and PP compatibility with DP (#2796 )

2025-09-19 11:29:50 +08:00