Set proper 'text/event-stream; charset=utf-8' media type for streaming
requests instead of hardcoded 'application/json'
### What this PR does / why we need it?
This PR fixes an issue in the disaggregated prefill proxy server where
streaming requests (`"stream": true`) were always returned with a
hardcoded `Content-Type: application/json`, even when the backend vLLM
servers correctly returned Server-Sent Events (SSE) with `Content-Type:
text/event-stream; charset=utf-8`.
Specifically, the proxy used `StreamingResponse` with a fixed
`media_type` of `application/json`, which caused FastAPI to override the
response headers and break proper SSE semantics. As a result, clients
(e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not
reliably receive token-by-token streaming output.
In addition, this incorrect response type causes compatibility issues
with benchmarking and load-testing tools such as **EvalScope**. When
streaming is enabled, these tools expect SSE-formatted responses to
correctly parse token usage information. With the incorrect
`application/json` content type, EvalScope fails to parse the response
and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR:
Failed to parse usage from response: list index out of range. Response:
[]`
This PR updates the proxy to:
- Detect whether the incoming request is a streaming request
(`stream=true`)
- Use `text/event-stream; charset=utf-8` for streaming responses
- Preserve `application/json` for non-streaming responses
This aligns the proxy behavior with native vLLM prefill/decoder servers
and the OpenAI-compatible streaming API contract.
Fixes incorrect streaming response headers that prevented proper
real-time token delivery.
### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?
This change was tested manually using a disaggregated prefill + decode
setup
with the proxy server.
### Test Steps
1. Start prefiller and decoder vLLM servers:
```bash
vllm serve --host 0.0.0.0 --port 8001 ...
vllm serve --host 0.0.0.0 --port 8002 ...
```
2. Start the proxy server:
```bash
python load_balance_proxy_server_example.py \
--host 127.0.0.1 --port 8000 \
--prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \
--decoder-hosts 127.0.0.1 --decoder-ports 8002
```
3. Send a streaming completion request through the proxy:
```bash
curl -i -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"prompt": "hello",
"max_tokens": 3,
"stream": true
}'
```
4. Verify the following:
- The response header is Content-Type: text/event-stream; charset=utf-8
- Tokens are streamed incrementally as SSE data: events
- Non-streaming requests still return application/json
No automated tests were added because this change affects an example
proxy
server and is limited to HTTP response headers. The behavior is directly
verifiable using standard SSE-compatible clients.
- vLLM version: v0.16.0
- vLLM main:
15d76f74e2
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>