[Doc][Misc] Correcting the document and uploading the model deployment template (#8287)

### What this PR does / why we need it? Correcting the document and uploading the model deployment template ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
2026-04-15 16:03:11 +08:00
parent 147b589f62
commit 95726d20eb
31 changed files with 536 additions and 308 deletions
--- a/docs/source/user_guide/feature_guide/Fine_grained_TP.md
+++ b/docs/source/user_guide/feature_guide/Fine_grained_TP.md
@@ -1,4 +1,4 @@
-# Fine-Grained Tensor Parallelism (Finegrained TP)
+# Fine-Grained Tensor Parallelism (Fine-grained TP)

 ## Overview

@@ -8,12 +8,12 @@ This capability supports heterogeneous parallelism strategies within a single mo

 ---

-## Benefits of Finegrained TP
+## Benefits of Fine-grained TP

 Fine-Grained Tensor Parallelism delivers two primary performance advantages through targeted weight sharding:

 - **Reduced Per-Device Memory Footprint**:  
-  Fine-grained TP shards large weight matrices(e.g., LM Head, o_proj)across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
+  Fine-grained TP shards large weight matrices (e.g., LM Head, o_proj) across devices, lowering peak memory usage and enabling larger batches or deployment on memory-limited hardware—without quantization.
  
 - **Faster Memory Access in GEMMs**:  
  In decode-heavy workloads, GEMM performance is often memory-bound. Weight sharding reduces per-device weight fetch volume, cutting DRAM traffic and improving bandwidth efficiency—especially for latency-sensitive layers like LM Head and o_proj.
@@ -53,7 +53,7 @@ The Fine-Grained TP size for any component must:

 ---

-## How to Use Finegrained TP
+## How to Use Fine-grained TP

 ### Configuration Format

--- a/docs/source/user_guide/feature_guide/cpu_binding.md
+++ b/docs/source/user_guide/feature_guide/cpu_binding.md
@@ -72,6 +72,7 @@ For best results, if you run inside a docker container, which `systemctl` is lik
 - **Stop `irqbalance` service**:

    For example, on Ubuntu system, you can run the following command to stop irqbalance:
+
    ```bash
    sudo systemctl stop irqbalance
    ```
--- a/docs/source/user_guide/feature_guide/epd_disaggregation.md
+++ b/docs/source/user_guide/feature_guide/epd_disaggregation.md
@@ -56,12 +56,12 @@ All related code is under `vllm/distributed/ec_transfer`.
    * *Scheduler role* – checks cache existence and schedules loads.  
    * *Worker role* – loads the embeddings into memory.

-* **EPD Load Balance Proxy** -
+* **EPD Load Balancing Proxy** -
    * *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
    * *Instance-Level Dynamic Load Balancing* -  dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
  
 We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:  
-[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
+[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)

 For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
 `docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
--- a/docs/source/user_guide/feature_guide/external_dp.md
+++ b/docs/source/user_guide/feature_guide/external_dp.md
@@ -4,7 +4,7 @@ For larger-scale deployments especially, it can make sense to handle the orchest

 In this case, it's more convenient to treat each DP rank like a separate vLLM deployment, with its own endpoint, and have an external router balance HTTP requests between them, making use of appropriate real-time telemetry from each server for routing decisions.

-## Getting Start
+## Getting Started

 The functionality of [external DP](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/?h=external#external-load-balancing) is already natively supported by vLLM. In vllm-ascend we provide two enhanced functionalities:

--- a/docs/source/user_guide/feature_guide/kv_pool.md
+++ b/docs/source/user_guide/feature_guide/kv_pool.md
@@ -163,7 +163,7 @@ export ASCEND_ENABLE_USE_FABRIC_MEM=1
 #A2
 #export HCCL_INTRA_ROCE_ENABLE=1

-#Minimum retransmission timeout of the RDMA，equals 4.096 μs * 2 ^ timeout.
+#Minimum retransmission timeout of the RDMA, equals 4.096 μs * 2 ^ timeout.
 #Needs to satisfy the equation: ASCEND_TRANSFER_TIMEOUT > RDMA_TIMEOUT * 7, where 7 is the default number of retry for RDMA transfer.
 #HCCL_RDMA_TIMEOUT also affects collective communication behavior and should be configured carefully.
 export HCCL_RDMA_TIMEOUT=17
--- a/docs/source/user_guide/feature_guide/large_scale_ep.md
+++ b/docs/source/user_guide/feature_guide/large_scale_ep.md
@@ -1,6 +1,6 @@
 # Distributed DP Server With Large-Scale Expert Parallelism

-## Getting Start
+## Getting Started

 vLLM-Ascend now supports prefill-decode (PD) disaggregation in the large-scale **Expert Parallelism (EP)** scenario. To achieve better performance, the distributed DP server is applied in vLLM-Ascend. In the PD separation scenario, different optimization strategies can be implemented based on the distinct characteristics of PD nodes, thereby enabling more flexible model deployment. \
 Taking the DeepSeek model as an example, using 8 Atlas 800T A3 servers to deploy the model. Assume the IP of the servers starts from 192.0.0.1 and ends by 192.0.0.8. Use the first 4 servers as prefiller nodes and the last 4 servers as decoder nodes. And the prefiller nodes are deployed as master nodes independently, while the decoder nodes use the 192.0.0.5 node as the master node.