xc-llm-ascend/docs/source/user_guide/feature_guide/epd_disaggregation.md

# Disaggregated-encoder

## Why disaggregated-encoder?

A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:

1. **Independent, fine-grained scaling**  

   * Vision encoders are lightweight, while language models are orders of magnitude larger.  
   * The language model can be parallelised without affecting the encoder fleet.  
   * Encoder nodes can be added or removed independently.

2. **Lower time-to-first-token (TTFT)**

   * Language-only requests bypass the vision encoder entirely.  
   * Encoder output is injected only at required attention layers, shortening the pre-fill critical path.  

3. **Cross-process reuse and caching of encoder outputs**

   * In-process encoders confine reuse to a single worker.  
   * A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.

Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>

---

## Usage

The current reference pathway is **ExampleConnector**.
The ready-to-run scripts below show the workflow:

1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`

1 Encoder instance + 1 Prefill instance + 1 Decode instance:
`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`

---

## Development

![alt text](<./images/epd_disaggregation.jpg>)

Disaggregated encoding is implemented by running two parts:

* **Encoder instance** – a vLLM instance to perform vision encoding.
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
    * PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)

A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.  
All related code is under `vllm/distributed/ec_transfer`.

## Key abstractions

* **ECConnector** – interface for retrieving EC caches produced by the encoder.  
    * *Scheduler role* – checks cache existence and schedules loads.  
    * *Worker role* – loads the embeddings into memory.

* **EPD Load Balance Proxy** -
    * *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
    * *Instance-Level Dynamic Load Balancing* -  dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
  
We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:  
[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)

For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.

## Limitations

* Disable `--mm-processor-cache-gb 0` if you want to use cross-process caching

* For the PD disaggregation part, refer to the limitations of PD decomposition
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
+								# Disaggregated-encoder
 								## Why disaggregated-encoder?
 								A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
 . **Independent, fine-grained scaling**
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								   * Vision encoders are lightweight, while language models are orders of magnitude larger.
 								   * The language model can be parallelised without affecting the encoder fleet.
 								   * Encoder nodes can be added or removed independently.
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 . **Lower time-to-first-token (TTFT)**
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								   * Language-only requests bypass the vision encoder entirely.
 								   * Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 . **Cross-process reuse and caching of encoder outputs**
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								   * In-process encoders confine reuse to a single worker.
 								   * A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 								---
 								## Usage
 								The current reference pathway is **ExampleConnector**.
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								The ready-to-run scripts below show the workflow:
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 Encoder instance + 1 PD instance:
 								`examples/online_serving/disaggregated_encoder/disagg_1e1pd/`
 Encoder instance + 1 Prefill instance + 1 Decode instance:
 								`examples/online_serving/disaggregated_encoder/disagg_1e1p1d/`
 								---
 								## Development
 								![alt text](<./images/epd_disaggregation.jpg>)
 								Disaggregated encoding is implemented by running two parts:
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								* **Encoder instance** – a vLLM instance to perform vision encoding.
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
+								* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
 								    * PD can be in either a single normal instance with (E + PD) or in disaggregated instances with (E + P + D)
 								A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
 								All related code is under `vllm/distributed/ec_transfer`.
 								## Key abstractions
 								* **ECConnector** – interface for retrieving EC caches produced by the encoder.
 								    * *Scheduler role* – checks cache existence and schedules loads.
 								    * *Worker role* – loads the embeddings into memory.
 								* **EPD Load Balance Proxy** -
 								    * *Multi-Path Scheduling Strategy* - dynamically diverts the multimodal request or text requests to the corresponding inference path
 								    * *Instance-Level Dynamic Load Balancing* -  dispatches multimodal requests based on a least-loaded strategy, using a priority queue to balance the active token workload across instances.
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								We create the example setup with the **MooncakeLayerwiseConnector** from `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py` and refer to the `examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py` to facilitate the kv transfer between P and D. For step-by-step deployment and configuration of Mooncake, refer to the following guide:
 								[https://docs.vllm.ai/projects/ascend/en/latest/tutorials/pd_disaggregation_mooncake_multi_node.html](https://docs.vllm.ai/projects/ascend/en/latest/tutorials/features/pd_disaggregation_mooncake_multi_node.html)
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 								For the PD disaggregation part, when using MooncakeLayerwiseConnector: The request first enters the Decoder instance,the Decoder triggers a remote prefill task in reverse via the Metaserver. The Prefill node then executes inference and pushes KV Cache layer-wise to the Decoder, overlapping computation with transmission. Once the transfer is complete, the Decoder seamlessly continues with the subsequent token generation.
-												[Doc][Misc] Comprehensive documentation cleanup and grammatical fixes (#8073)

What this PR does / why we need it?
This pull request performs a comprehensive cleanup of the vLLM Ascend
documentation. It fixes numerous typos, grammatical errors, and phrasing
issues across community guidelines, developer documents, hardware
tutorials, and feature guides. Key improvements include correcting
hardware names (e.g., Atlas 300I), fixing broken links, cleaning up code
examples (removing duplicate flags and trailing commas), and improving
the clarity of technical explanations. These changes are necessary to
ensure the documentation is professional, accurate, and easy for users
to follow.

Does this PR introduce any user-facing change?
No, this PR contains documentation-only updates.

How was this patch tested?
The changes were manually reviewed for accuracy and grammatical
correctness. No functional code changes were introduced.

---------

Signed-off-by: herizhen <1270637059@qq.com>
Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
											
										
										
											2026-04-09 15:37:57 +08:00
+								`docs/source/developer_guide/Design_Documents/disaggregated_prefill.md` shows the brief idea about the disaggregated prefill.
-												[Doc] EPD doc and load-balance proxy example (#6221)

Add EPD doc and load-balance proxy example

- vLLM version: v0.14.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/d68209402ddab3f54a09bc1f4de9a9495a283b60

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
											
										
										
											2026-03-12 16:17:17 +08:00
 								## Limitations
 								* Disable `--mm-processor-cache-gb 0` if you want to use cross-process caching
 								* For the PD disaggregation part, refer to the limitations of PD decomposition