[V1][PP] Support pp with ray backend in V1 (#1800)

### What this PR does / why we need it? Support pipeline parallel with ray backend in V1Engine. Fixes #1751 ### Does this PR introduce _any_ user-facing change? Users could specify ray as distributed backend when inferencing with pp ### How was this patch tested? CI passed with new added test. - vLLM version: v0.9.2 - vLLM main: 32142b3c62 --------- Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-23 14:52:52 +08:00
parent 9a3bdf2162
commit 3aa3b46bfe
5 changed files with 32 additions and 18 deletions
--- a/vllm_ascend/attention/attention_v1.py
+++ b/vllm_ascend/attention/attention_v1.py
@@ -396,8 +396,10 @@ class AscendAttentionBackendImpl(AttentionImpl):
                if self.head_size == 192:
                    cu_seqlen_q = [0] + attn_metadata.query_lens.tolist()
                    cu_seqlen_k = [0] + attn_metadata.seq_lens.tolist()
-                    cu_seqlen_q = torch.tensor(cu_seqlen_q, device="npu")
-                    cu_seqlen_k = torch.tensor(cu_seqlen_k, device="npu")
+                    cu_seqlen_q = torch.tensor(cu_seqlen_q,
+                                               device=query.device)
+                    cu_seqlen_k = torch.tensor(cu_seqlen_k,
+                                               device=query.device)
                    cu_seqlen_q = torch.cumsum(cu_seqlen_q, dim=0)
                    cu_seqlen_k = torch.cumsum(cu_seqlen_k, dim=0)
                    max_seqlen_q = torch.max(attn_metadata.query_lens)