xc-llm-ascend

Files

clrs97 cd1ffbb6cd [1/N][Feat] Cut down memory usage for o_proj in DeepSeek (#2931 )

### What this PR does / why we need it?
To cut down the memory usage of large weight matrices, we often rely on
various linear operations:
- `ReplicatedLinear`: Stores the entire matrix, consuming excessive
memory.
- `RowParallelLinear`: Requires an `all_reduce` to merge answer,
introducing additional communication overhead and potential accuracy
loss. Each token is handled across multiple devices rather than a single
device, which is undesirable in SP scenario.
- ...

Furthermore, in multi-way Data Parallelism (DP) configurations, layers
typically store redundant weight copies.

This PR introduces a shared-weight plugin for layers inheriting from
`LinearBase`. It offers the following advantages:
- It evenly distributes a set of layers with identical structures across
devices. Each layer retains its complete weights, eliminating redundant
memory usage.
- It supports asynchronous broadcasting to prefetch weights for upcoming
layers.
- It preserves the custom `process_weights_after_loading()` method to
make keeping NZ format possible.
- It is compatible with any linear class that inherits from
`LinearBase`, thereby preserving all the features of the original linear
implementation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM main:
f4a948f33f

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com>

2025-09-24 17:16:41 +08:00

attention

[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 )

2025-09-24 11:29:59 +08:00

compilation

[Refactor][Graph] Move graph parameter logic to acl_graph module (#3101 )

2025-09-22 22:23:14 +08:00

core

[CORE] concurrent partial prefills (#2372 )

2025-09-24 17:12:55 +08:00

device_allocator

[Misc]Clean up useless import from vllm (#2049 )