[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)
What this PR does / why we need it? This pull request performs a comprehensive cleanup of the vLLM Ascend documentation. It fixes numerous typos, grammatical errors, and phrasing issues across community guidelines, developer documents, hardware tutorials, and feature guides. Key improvements include correcting hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code examples (removing duplicate flags and trailing commas), and improving the clarity of technical explanations. These changes are necessary to ensure the documentation is professional, accurate, and easy for users to follow. Does this PR introduce any user-facing change? No, this PR contains documentation-only updates. How was this patch tested? The changes were manually reviewed for accuracy and grammatical correctness. No functional code changes were introduced. --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -6,30 +6,28 @@ A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in
|
||||
|
||||
1. **Independent, fine-grained scaling**
|
||||
|
||||
* Vision encoders are lightweight, while language models are orders of magnitude larger.
|
||||
* The language model can be parallelised without affecting the encoder fleet.
|
||||
* Encoder nodes can be added or removed independently.
|
||||
* Vision encoders are lightweight, while language models are orders of magnitude larger.
|
||||
* The language model can be parallelised without affecting the encoder fleet.
|
||||
* Encoder nodes can be added or removed independently.
|
||||
|
||||
2. **Lower time-to-first-token (TTFT)**
|
||||
|
||||
* Language-only requests bypass the vision encoder entirely.
|
||||
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
|
||||
* Language-only requests bypass the vision encoder entirely.
|
||||
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
|
||||
|
||||
3. **Cross-process reuse and caching of encoder outputs**
|
||||
|
||||
* In-process encoders confine reuse to a single worker.
|
||||
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
|
||||
* In-process encoders confine reuse to a single worker.
|
||||
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
|
||||
|
||||
Design doc: <
|
||||
<https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
|
||||
>
|
||||
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
The current reference pathway is **ExampleConnector**.
|
||||
Below ready-to-run scripts shows the workflow:
|
||||
The ready-to-run scripts below show the workflow:
|
||||
|
||||
1 Encoder instance + 1 PD instance:
|
||||
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
|
||||
@@ -45,7 +43,7 @@ Below ready-to-run scripts shows the workflow:
|
||||
|
||||
Disaggregated encoding is implemented by running two parts:
|
||||
|
||||
* **Encoder instance** – a vLLM instance to performs vision encoding.
|
||||
* **Encoder instance** – a vLLM instance to perform vision encoding.
|
||||
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
|
||||
* PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
|
||||
|
||||
@@ -62,11 +60,11 @@ All related code is under `vllm/distributed/ec_transfer`.
|
||||
* *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
|
||||
* *Instance-Level Dynamic Load Balancing* - dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
|
||||
|
||||
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and referred to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html)
|
||||
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
|
||||
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
|
||||
|
||||
For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
|
||||
`docs/source/developer_guide/feature_guide/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
|
||||
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
|
||||
|
||||
## Limitations
|
||||
|
||||
|
||||
Reference in New Issue
Block a user