xc-llm-ascend/vllm_ascend at c40a387f63bdc451c198bfa036c723c162545278 - xc-llm-ascend - Gitea: Git with a cup of tea

EngineX/xc-llm-ascend

Files

History

zouyida2052 c40a387f63 [bugfix]fix extra npu context in device 0 (#8041 )

<!-- Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
When we launch a PD-disaggregated process and send requests, an
additional processes appear on NPU 0, becasue when a thread has a
primary cuda context, the child thread it creates automatically doesn't
inherit the cuda context. See
https://forums.developer.nvidia.com/t/when-a-thread-has-a-primary-cuda-context-does-the-child-thread-it-creates-automatically-inherit-the-cuda-context/362810.
vLLM has fixed this issue in [pr-37449
](https://github.com/vllm-project/vllm/pull/37449), but version 0.18.0
does not include the fix. Therefore, we need to patch it.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

---------

Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>

2026-04-08 23:35:52 +08:00

..

[310P]: add torch chunk gated delta rule and 910b parity ut (#7594 )

2026-03-25 16:46:43 +08:00

_cann_ops_custom

[Kernel] add custom op GmmSwigluQuantWeightNzTensorList (#3804 )

2025-11-28 18:06:39 +08:00

[v0.18.0]feat(quant): add C8 INT8 KV cache support for GQA attention models (#7474 ) (#8007 )

2026-04-08 10:51:58 +08:00

[Feat][SP] Suport SP for VL MoE models (#7044 )

2026-03-24 17:16:00 +08:00

[v0.18.0][Misc] Recompute scheduler upgrade to vLLM 0.18.0 (#7720 )

2026-03-27 18:24:53 +08:00

[A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573 )

2026-03-25 17:20:28 +08:00

device_allocator

[Lint]Style: Convert vllm-ascend/ to ruff format(Batch #2 ) (#5977 )

2026-01-19 08:59:46 +08:00

[v0.18.0][BugFix][P/D]Fix layerwise connector out of memory during large buffer transfer (#7752 )

2026-03-31 22:16:53 +08:00

[V0.18.0][EPLB][BugFix] Fix moe_load precision in allgather (#7890 )

2026-04-02 09:20:31 +08:00

upgrade to 0.18.0 (#7502 )

2026-03-21 16:05:38 +08:00

[Bugfix][LoRA] Fix the bug when runs Qwen3-Reranker-0.6B with LoRA. (#7156 )

2026-03-15 17:55:42 +08:00

[ModelLoader][Feature] Add rfork support for fast model loading (#7392 )

2026-03-25 16:40:30 +08:00

[BugFix] fix qwen3-next compilation error (#7977 )

2026-04-03 20:03:39 +08:00

[bugfix]fix extra npu context in device 0 (#8041 )

2026-04-08 23:35:52 +08:00

[v0.18.0]feat(quant): add C8 INT8 KV cache support for GQA attention models (#7474 ) (#8007 )

2026-04-08 10:51:58 +08:00

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794 )

2026-03-31 19:01:51 +08:00

[v0.18.0][Bugfix][EAGLE] Fix FIA pad bug under max concurrency (#7754 )

2026-03-29 12:23:44 +08:00

[BugFix][0.18.0][KV Pool] Fix KV Pool not putting kv cache for vllm v0.18.0 (#7874 )

2026-04-02 10:57:09 +08:00

Main2main upgrade to vllm 0317 afternoon (#7409 )

2026-03-18 23:24:27 +08:00

__init__.py

[ModelLoader][Feature] Add rfork support for fast model loading (#7392 )

2026-03-25 16:40:30 +08:00

ascend_config.py

[feat] support dispatch_v2/combine_v2 hierarchy communication (#7698 )

2026-03-27 09:20:16 +08:00

ascend_forward_context.py

[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp (#7297 )

2026-03-23 17:05:02 +08:00

batch_invariant.py

[CI] Add pre-commit check for patch logger (#7446 )

2026-03-19 16:53:20 +08:00

cpu_binding.py

[CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (#6945 )

2026-03-03 17:20:52 +08:00

envs.py

[Misc] Drop Prefetch MLP Env (#7357 )

2026-03-19 14:27:27 +08:00

flash_common3_context.py

[Lint]Style: Convert vllm-ascend/compilation to ruff format (#5912 )

2026-01-16 20:57:46 +08:00

meta_registration.py

[Ops][Refactor] Remove custom rotary_embedding operator (#6523 )

2026-02-07 09:24:05 +08:00

platform.py

[releases/v0.18.0][BugFix] Fix server init error when set max_num_seqs not a multiple of tp while FLASHCOMM is on (#7832 )

2026-03-30 20:24:52 +08:00

profiling_config.py

[Core][Misc] Clean up ProfileExecuteDuration (#6461 )

2026-02-01 20:06:01 +08:00

utils.py

[Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737 )

2026-03-31 14:49:29 +08:00