### What this PR does / why we need it?
To cut down the memory usage of large weight matrices, we often rely on
various linear operations:
- `ReplicatedLinear`: Stores the entire matrix, consuming excessive
memory.
- `RowParallelLinear`: Requires an `all_reduce` to merge answer,
introducing additional communication overhead and potential accuracy
loss. Each token is handled across multiple devices rather than a single
device, which is undesirable in SP scenario.
- ...
Furthermore, in multi-way Data Parallelism (DP) configurations, layers
typically store redundant weight copies.
This PR introduces a shared-weight plugin for layers inheriting from
`LinearBase`. It offers the following advantages:
- It evenly distributes a set of layers with identical structures across
devices. Each layer retains its complete weights, eliminating redundant
memory usage.
- It supports asynchronous broadcasting to prefetch weights for upcoming
layers.
- It preserves the custom `process_weights_after_loading()` method to
make keeping NZ format possible.
- It is compatible with any linear class that inherits from
`LinearBase`, thereby preserving all the features of the original linear
implementation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
vLLM main:
f4a948f33f
- vLLM version: v0.10.2
- vLLM main:
f225ea7dd9
---------
Signed-off-by: clrs97 <524936896@qq.com>
Co-authored-by: CalvinXKY <kyxiezju@163.com>