Files
xc-llm-ascend/examples
ZRJ026 a398fa6a0b [Bugfix]: correct streaming content-type in load balance proxy server (#6985)
Set proper 'text/event-stream; charset=utf-8' media type for streaming
requests instead of hardcoded 'application/json'

### What this PR does / why we need it?

This PR fixes an issue in the disaggregated prefill proxy server where
streaming requests (`"stream": true`) were always returned with a
hardcoded `Content-Type: application/json`, even when the backend vLLM
servers correctly returned Server-Sent Events (SSE) with `Content-Type:
text/event-stream; charset=utf-8`.

Specifically, the proxy used `StreamingResponse` with a fixed
`media_type` of `application/json`, which caused FastAPI to override the
response headers and break proper SSE semantics. As a result, clients
(e.g. `curl -i`, EventSource, or OpenAI-compatible SDKs) could not
reliably receive token-by-token streaming output.

In addition, this incorrect response type causes compatibility issues
with benchmarking and load-testing tools such as **EvalScope**. When
streaming is enabled, these tools expect SSE-formatted responses to
correctly parse token usage information. With the incorrect
`application/json` content type, EvalScope fails to parse the response
and reports errors similar to:`2025-12-15 09:27:56 - evalscope - ERROR:
Failed to parse usage from response: list index out of range. Response:
[]`

This PR updates the proxy to:
- Detect whether the incoming request is a streaming request
(`stream=true`)
- Use `text/event-stream; charset=utf-8` for streaming responses
- Preserve `application/json` for non-streaming responses

This aligns the proxy behavior with native vLLM prefill/decoder servers
and the OpenAI-compatible streaming API contract.

Fixes incorrect streaming response headers that prevented proper
real-time token delivery.

### Does this PR introduce _any_ user-facing change?

None

### How was this patch tested?
This change was tested manually using a disaggregated prefill + decode
setup
with the proxy server.

### Test Steps

1. Start prefiller and decoder vLLM servers:
```bash
   vllm serve --host 0.0.0.0 --port 8001 ...
   vllm serve --host 0.0.0.0 --port 8002 ...
```

2. Start the proxy server:
```bash
python load_balance_proxy_server_example.py \
  --host 127.0.0.1 --port 8000 \
  --prefiller-hosts 127.0.0.1 --prefiller-ports 8001 \
  --decoder-hosts 127.0.0.1 --decoder-ports 8002
```
3. Send a streaming completion request through the proxy:
```bash
curl -i -X POST http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "test",
        "prompt": "hello",
        "max_tokens": 3,
        "stream": true
      }'
```
4. Verify the following:

- The response header is Content-Type: text/event-stream; charset=utf-8
- Tokens are streamed incrementally as SSE data: events
- Non-streaming requests still return application/json
No automated tests were added because this change affects an example
proxy
server and is limited to HTTP response headers. The behavior is directly
verifiable using standard SSE-compatible clients.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Co-authored-by: zrj026 <zhangrunjiang026@gmail.com>
2026-03-10 10:11:35 +08:00
..
2025-12-10 09:20:40 +08:00