[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
herizhen
2026-04-09 15:37:57 +08:00
committed by GitHub
parent c40a387f63
commit 0d1424d81a
71 changed files with 1295 additions and 1296 deletions

View File

@@ -6,30 +6,28 @@ A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in
1. **Independent, fine-grained scaling**
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.
2. **Lower time-to-first-token (TTFT)**
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
3. **Cross-process reuse and caching of encoder outputs**
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
Design doc: <
<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
>
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
---
## Usage
The current reference pathway is **ExampleConnector**.
Below ready-to-run scripts shows the workflow:
The ready-to-run scripts below show the workflow:
1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
@@ -45,7 +43,7 @@ Below ready-to-run scripts shows the workflow:
Disaggregated encoding is implemented by running two parts:
* **Encoder instance** a vLLM instance to performs vision encoding.
* **Encoder instance** a vLLM instance to perform vision encoding.
* **Prefill/Decode (PD) instance(s)** runs language pre-fill and decode.
* PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
@@ -62,11 +60,11 @@ All related code is under `vllm/distributed/ec_transfer`.
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and referred to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/feature_guide/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
## Limitations