From a398fa6a0b024f59aaa823c483529bcf2357540f Mon Sep 17 00:00:00 2001
From: ZRJ026 <52122544+ZRJ026@users.noreply.github.com>
Date: Tue, 10 Mar 2026 10:11:35 +0800
Subject: [PATCH] [Bugfix]: correct streaming content-type in load balance
 proxy server (#6985)

Set proper 'text/event-stream; charset=utf-8' media type for streaming
requests instead of hardcoded 'application/json'

### What this PR does / why we need it?

This PR fixes an issue in the disaggregated prefill proxy server where
streaming requests (`"stream": true`) were always returned with a
hardcoded `Content-Type: application/json`, even when the backend vLLM
servers correctly returned Server-Sent Events (SSE) with `Content-Type:
text/event-stream; charset=utf-8`.

Specifically, the proxy used `StreamingResponse` with a fixed
`media_type` of `application/json`, which caused FastAPI to override the
response headers and break proper SSE semantics. As a result, clients
(e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not
reliably receive token-by-token streaming output.

In addition, this incorrect response type causes compatibility issues
with benchmarking and load-testing tools such as **EvalScope**. When
streaming is enabled, these tools expect SSE-formatted responses to
correctly parse token usage information. With the incorrect
`application/json` content type, EvalScope fails to parse the response
and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR:
Failed to parse usage from response: list index out of range. Response:
[]`

This PR updates the proxy to:
- Detect whether the incoming request is a streaming request
(`stream=true`)
- Use `text/event-stream; charset=utf-8` for streaming responses
- Preserve `application/json` for non-streaming responses

This aligns the proxy behavior with native vLLM prefill/decoder servers
and the OpenAI-compatible streaming API contract.

Fixes incorrect streaming response headers that prevented proper
real-time token delivery.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?
This change was tested manually using a disaggregated prefill + decode
setup
with the proxy server.

### Test Steps

1. Start prefiller and decoder vLLM servers:
```bash
   vllm serve --host 0.0.0.0 --port 8001 ...
   vllm serve --host 0.0.0.0 --port 8002 ...
```

2. Start the proxy server:
```bash
python load_balance_proxy_server_example.py \
  --host 127.0.0.1 --port 8000 \
  --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \
  --decoder-hosts 127.0.0.1 --decoder-ports 8002
```
3. Send a streaming completion request through the proxy:
```bash
curl -i -X POST http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "test",
        "prompt": "hello",
        "max_tokens": 3,
        "stream": true
      }'
```
4. Verify the following:

- The response header is Content-Type: text/event-stream; charset=utf-8
- Tokens are streamed incrementally as SSE data: events
- Non-streaming requests still return application/json
No automated tests were added because this change affects an example
proxy
server and is limited to HTTP response headers. The behavior is directly
verifiable using standard SSE-compatible clients.

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7

Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>
---
 .../load_balance_proxy_server_example.py                      | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py b/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
index aef8095a..dd47bee5 100644
--- a/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
+++ b/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
@@ -814,7 +814,9 @@ async def _handle_completions(api: str, request: Request):
             # After streaming done, release tokens
             proxy_state.release_decoder(instance_info.decoder_idx, instance_info.decoder_score)
 
-        return StreamingResponse(generate_stream(), media_type="application/json")
+        # Determine the correct media type based on stream flag
+        media_type = "text/event-stream; charset=utf-8" if stream_flag else "application/json"
+        return StreamingResponse(generate_stream(), media_type=media_type)
     except Exception as e:
         import traceback