[Lint]Style: reformat markdown files via markdownlint (#5884)

### What this PR does / why we need it? reformat markdown files via markdownlint - vLLM version: v0.13.0 - vLLM main: bde38c11df --------- Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain> Signed-off-by: MrZ20 <2609716663@qq.com> Co-authored-by: root <root@LAPTOP-VQKDDVMG.localdomain>
2026-01-15 09:06:01 +08:00
parent 96edd4673f
commit 4811ba62e0
75 changed files with 711 additions and 308 deletions
--- a/docs/source/user_guide/feature_guide/layer_sharding.md
+++ b/docs/source/user_guide/feature_guide/layer_sharding.md
@@ -5,12 +5,14 @@
 **Layer Shard Linear** is a memory-optimization feature designed for large language model (LLM) inference. It addresses the high memory pressure caused by **repeated linear operators across many layers** that share identical structure but have distinct weights.

 Instead of replicating all weights on every device, **Layer Shard Linear shards the weights of a "series" of such operators across the NPU devices in a communication group**:
+
 - The **i-th layer's linear weight** is stored **only on device `i % K`**, where `K` is the number of devices in the group.
 - Other devices hold a lightweight **shared dummy tensor** during initialization and fetch the real weight **on-demand via asynchronous broadcast** during the forward pass.

 As illustrated in the figure below, this design enables broadcast to reach weights: while the current layer (e.g., MLA or MOE) is being computed, the system **asynchronously broadcasts the next layer's weight** in the background. Because the attention computation in the MLA module is sufficiently latency-bound, the weight transfer for `o_proj` is **fully overlapped with computation**, making the communication **latency-free from the perspective of end-to-end inference**.

 This approach **preserves exact computational semantics** while **significantly reducing NPU memory footprint**, especially critical for:
+
 - Extremely deep architectures (e.g., DeepSeek-V3/R1 with 61 layers);
 - Models using **[DSA-CP](https://github.com/vllm-project/vllm-ascend/pull/4702)** or **[FlashComm2](https://github.com/vllm-project/vllm-ascend/pull/4188)**, where the full `O` (output) projection matrix must reside in memory per layer;
 - Scenarios where **attention computation latency fully overlaps** (hides) the communication cost of weight broadcasting.
@@ -18,6 +20,7 @@ This approach **preserves exact computational semantics** while **significantly
 ---

 ### Flowchart
+
 ![layer shard](./images/layer_sharding.png)

 > **Figure.** Layer Shard Linear workflow: weights are sharded by layer across devices (top), and during forward execution (bottom), asynchronous broadcast pre-fetches the next layer's weight while the current layer computes—enabling zero-overhead weight loading.
@@ -68,4 +71,4 @@ vllm serve \
  --additional-config '{
    "layer_sharding": ["q_b_proj", "o_proj"]
  }'
-``` 
+```