Files
xc-llm-ascend/vllm_ascend
clrs97 cd1ffbb6cd [1/N][Feat] Cut down memory usage for o_proj in DeepSeek (#2931)
### What this PR does / why we need it?
To cut down the memory usage of large weight matrices, we often rely on
various linear operations:
- `ReplicatedLinear`: Stores the entire matrix, consuming excessive
memory.
- `RowParallelLinear`: Requires an `all_reduce` to merge answer,
introducing additional communication overhead and potential accuracy
loss. Each token is handled across multiple devices rather than a single
device, which is undesirable in SP scenario.
- ...

Furthermore, in multi-way Data Parallelism (DP) configurations, layers
typically store redundant weight copies.

This PR introduces a shared-weight plugin for layers inheriting from
`LinearBase`. It offers the following advantages:
- It evenly distributes a set of layers with identical structures across
devices. Each layer retains its complete weights, eliminating redundant
memory usage.
- It supports asynchronous broadcasting to prefetch weights for upcoming
layers.
- It preserves the custom `process_weights_after_loading()` method to
make keeping NZ format possible.
- It is compatible with any linear class that inherits from
`LinearBase`, thereby preserving all the features of the original linear
implementation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM main:
f4a948f33f

- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9

---------

Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com>
2025-09-24 17:16:41 +08:00
..
2025-09-23 10:27:14 +08:00