Sync from v0.13

This commit is contained in:
2026-01-19 10:38:50 +08:00
parent b2ef04d792
commit 5aef6c175a
3714 changed files with 854317 additions and 89342 deletions

View File

@@ -0,0 +1,171 @@
# EPD Correctness Test
This test verifies that EPD (Encoder-Prefill-Decode) disaggregation produces identical outputs to a baseline single instance.
## What It Tests
- **Baseline**: Single vLLM instance serving a multimodal model
- **EPD (1E+1PD)**: 1 Encoder + 1 Prefill-Decode instance
- **Baseline (1P+1D)**: 1 Prefill + 1 Decode instance
- **EPD (1E+1P+1D)**: 1 Encoder + 1 Prefill + 1 Decode instance
The test ensures that disaggregated encoding produces **identical** outputs to the baseline.
Note that currently PD disaggregation set up may give slightly different results from a single instance. Therefore, we need the result from 1P+1D as the baseline for 1E+1P+1D
Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.
## Files
- `run_epd_correctness_test.sh` - Main test script (starts all instances and runs tests)
- `test_epd_correctness.py` - Python test script (compares outputs)
## Usage
### Multimodal Prompts (Default)
```bash
cd vllm
./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```
This runs the test with actual multimodal (image) prompts.
### Text-Only Prompts
```bash
cd vllm
USE_MM_PROMPTS=0 ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```
This runs a quick test with text-only prompts to verify the setup works.
### Custom Configuration
```bash
# Use specific GPUs
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
# Use specific ports
ENDPOINT_PORT=10001 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
# Use specific model
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
# Use specific storage path
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
```
## How It Works
### Step 1: Baseline
1. Start single vLLM instance on GPU
2. Run test prompts (multimodal or text-only)
3. Save outputs to `.vllm_epd_baseline.txt`
4. Shutdown instance
### Step 2: EPD (1E + 1PD)
1. Clear encoder cache storage
2. Start instances and proxy
3. Run same test prompts
4. Assert outputs match baseline exactly
5. Shutdown instances
### Step 3: EPD (1E + 1P + 1D)
1. Clear encoder cache storage
2. Start instances and proxy
3. Run same test prompts
4. Assert outputs match baseline exactly
5. Shutdown instances
## Test Scenarios
### Multimodal Prompts (--use_mm_prompts)
Tests encoder cache transfer:
- Single image query
- Multiple images in one request
- Mixed image and text
- Image with detailed questions
### Text-Only Prompts (default)
Quick sanity check:
- Simple text queries
- Text-only explanations
- Verifies proxy routing works
## Expected Behavior
### ✅ Test Passes When
- All disagg outputs match baseline outputs exactly
- No errors during instance startup
- Encoder cache is properly saved and loaded
- Proxy correctly routes requests
### ❌ Test Fails When
- Outputs differ between baseline and disagg
- Server startup fails
- Encoder cache not found (should fall back to local execution)
- Proxy routing errors
## Notes
- The test uses deterministic generation (`temperature=0.0`, `seed=42`)
- Encoder cache should enable exact output reproduction
- Test cleans up all instances and cache files after completion
- Safe to run multiple times (idempotent)
- We setup the PD disagg part with NixlConnector. Please read details about EPD in `examples/online_serving/disaggregated_encoder/README.md`
## Requirements
- Multiple GPUs (3 for 1E+1P+1D, 2 for 1E+1PD, 1 for baseline)
- 1E+1P+1D is runnable with 2 GPU by assign E and P on the same GPU now.
- Multimodal model (e.g., Qwen2.5-VL-3B-Instruct)
- Internet access (for accessing vllm test images)
## Debugging
### Check Logs
Logs and baseline output are saved in `/tmp/` by default.
Can be customized by changing the environment variables.
### Check Encoder Cache
```bash
# Verify cache files are created
ls -la $EC_SHARED_STORAGE_PATH/
# Should see directories with mm_hash names
# Each containing encoder_cache.safetensors
```
### Manual Testing
Run individual components:
```bash
# Baseline only
python test_epd_correctness.py \
--service_url http://localhost:8000 \
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
--mode baseline \
--baseline_file test_output.txt \
--use_mm_prompts
# Disagg only (requires baseline output file!)
python test_epd_correctness.py \
--service_url http://localhost:8000 \
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
--mode disagg \
--baseline_file test_output.txt \
--use_mm_prompts
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 821 KiB

View File

@@ -0,0 +1,476 @@
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
#
# EPD (Encoder-Prefill-Decode) Correctness Test
#
# This script tests that EPD disaggregation produces the same outputs as baseline.
# It runs:
# 1. Baseline: Single vLLM instance
# 2. EPD: 1E + 1PD setup
# 3. Baseline for (E + P + D): 1P + 1D vLLM instances disagg
# 4. EPD: 1E + 1P + 1D setup
# For GPU usage
# set -xe
# Find the git repository root directory
GIT_ROOT=$(git rev-parse --show-toplevel)
# Model to test
MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}"
# Set 1 to use multimodal prompts; else to use text-only
USE_MM_PROMPTS="${USE_MM_PROMPTS:-1}"
MM_FLAG=""
if [ $USE_MM_PROMPTS = "1" ]; then
MM_FLAG="--use_mm_prompts"
fi
# GPU configuration
GPU_E="${GPU_E:-0}"
GPU_P="${GPU_P:-1}"
GPU_D="${GPU_D:-2}"
GPU_SINGLE="${GPU_SINGLE:-$GPU_P}"
GPU_PD="${GPU_PD:-$GPU_P}"
# Port
ENCODE_PORT="${ENCODE_PORT:-19534}"
PREFILL_PORT="${PREFILL_PORT:-19535}"
DECODE_PORT="${DECODE_PORT:-19536}"
PREFILL_DECODE_PORT="${PREFILL_DECODE_PORT:-19537}"
ENDPOINT_PORT="${ENDPOINT_PORT:-10001}"
# Storage path for encoder cache
EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache_test}"
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-600}"
# Output file for baseline comparison and logs
LOG_PATH="${LOG_PATH:-/tmp}"
BASELINE_FILE="${BASELINE_FILE:-/tmp/vllm_baseline.txt}"
BASELINE_PD_FILE="${BASELINE_PD_FILE:-/tmp/vllm_epd_baseline.txt}"
mkdir -p $LOG_PATH
# Trap the SIGINT signal (triggered by Ctrl+C)
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
# Wait for server to be ready
wait_for_server() {
local port=$1
timeout "$TIMEOUT_SECONDS" bash -c "
until curl -s localhost:${port}/v1/chat/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}
# Cleanup function
cleanup_instances() {
echo "Cleaning up any running vLLM instances..."
pkill -f "vllm serve" || true
pkill -f "disagg_epd_proxy.py" || true
sleep 2
}
# Function to run baseline (single instance)
run_baseline() {
echo "================================"
echo "Running BASELINE (single instance)"
echo "================================"
cleanup_instances
rm -rf "$EC_SHARED_STORAGE_PATH"
local PORT=$ENDPOINT_PORT
# Start baseline instance
echo "Starting baseline instance on GPU $GPU_SINGLE, port $PORT"
CUDA_VISIBLE_DEVICES="$GPU_SINGLE" vllm serve "$MODEL" \
--port $PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
> $LOG_PATH/baseline.log 2>&1 &
local BASELINE_PID=$!
# Wait for baseline to start
echo "Waiting for baseline instance to start..."
wait_for_server $PORT
curl http://127.0.0.1:$PORT/v1/models
echo ""
# Run test in baseline mode
echo "Running baseline..."
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
--service_url "http://localhost:$PORT" \
--model_name "$MODEL" \
--mode baseline \
--baseline_file "$BASELINE_FILE" \
$MM_FLAG
# Cleanup baseline
echo "Stopping baseline instance..."
kill $BASELINE_PID 2>/dev/null || true
sleep 2
cleanup_instances
}
# Function to run EPD with 1E + 1PD
run_epd_1e_1pd() {
echo "================================"
echo "Running EPD (1E + 1PD)"
echo "================================"
cleanup_instances
rm -rf "$EC_SHARED_STORAGE_PATH"
mkdir -p "$EC_SHARED_STORAGE_PATH"
local ENCODE_PORT=$ENCODE_PORT
local PREFILL_DECODE_PORT=$PREFILL_DECODE_PORT
local PROXY_PORT=$ENDPOINT_PORT
declare -a PIDS=()
# Start encoder instance
echo "Starting encoder instance on GPU $GPU_E, port $ENCODE_PORT"
CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
--port $ENCODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.01 \
--enable-request-id-headers \
--no-enable-prefix-caching \
--max-num-batched-tokens 114688 \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--ec-transfer-config '{
"ec_connector": "ECExampleConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
> $LOG_PATH/1e1pd_encoder.log 2>&1 &
PIDS+=($!)
# Start prefill+decode instance
echo "Starting PD instance on GPU $GPU_PD, port $PREFILL_DECODE_PORT"
CUDA_VISIBLE_DEVICES="$GPU_PD" vllm serve "$MODEL" \
--port $PREFILL_DECODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--enable-request-id-headers \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--ec-transfer-config '{
"ec_connector": "ECExampleConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
> $LOG_PATH/1e1pd_pd.log 2>&1 &
PIDS+=($!)
# Wait for instances to start
echo "Waiting for encoder instance..."
wait_for_server $ENCODE_PORT
echo "Waiting for PD instance..."
wait_for_server $PREFILL_DECODE_PORT
# Start proxy
echo "Starting EPD proxy on port $PROXY_PORT"
python "${GIT_ROOT}/examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py" \
--host "0.0.0.0" \
--port $PROXY_PORT \
--encode-servers-urls "http://localhost:$ENCODE_PORT" \
--prefill-servers-urls "disable" \
--decode-servers-urls "http://localhost:$PREFILL_DECODE_PORT" \
> $LOG_PATH/1e1pd_proxy.log 2>&1 &
PIDS+=($!)
# Wait for proxy
echo "Waiting for proxy..."
wait_for_server $PROXY_PORT
curl http://127.0.0.1:$PROXY_PORT/v1/models
curl http://127.0.0.1:$PROXY_PORT/health
echo ""
echo "All EPD (1E+1PD) services are up!"
# Run test in disagg mode
echo "Running EPD (1E+1PD) correctness test..."
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
--service_url "http://localhost:$PROXY_PORT" \
--model_name "$MODEL" \
--mode disagg \
--baseline_file "$BASELINE_FILE" \
$MM_FLAG
# Cleanup
echo "✓✓ 1E+1PD Correctness Test finished"
echo "Stopping EPD (1E+1PD) instances..."
for pid in "${PIDS[@]}"; do
kill $pid 2>/dev/null || true
done
sleep 2
cleanup_instances
}
# Function to run baseline for 1E + 1P + 1D (PD disagg)
run_baseline_1p_1d() {
echo "================================"
echo "Running PD BASELINE (1P + 1D)"
echo "================================"
cleanup_instances
rm -rf "$EC_SHARED_STORAGE_PATH"
mkdir -p "$EC_SHARED_STORAGE_PATH"
local PREFILL_PORT=$PREFILL_PORT
local DECODE_PORT=$DECODE_PORT
local PROXY_PORT=$ENDPOINT_PORT
declare -a PIDS=()
# Start prefill instance
echo "Starting prefill instance on GPU $GPU_P, port $PREFILL_PORT"
CUDA_VISIBLE_DEVICES="$GPU_P" \
VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
vllm serve "$MODEL" \
--port $PREFILL_PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--enable-request-id-headers \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_producer"
}' \
> $LOG_PATH/1p1d_prefill.log 2>&1 &
PIDS+=($!)
# Start decode instance
echo "Starting decode instance on GPU $GPU_D, port $DECODE_PORT"
CUDA_VISIBLE_DEVICES="$GPU_D" \
VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
vllm serve "$MODEL" \
--port $DECODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--enable-request-id-headers \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_consumer"
}' \
> $LOG_PATH/1p1d_decode.log 2>&1 &
PIDS+=($!)
# Wait for instances to start
echo "Waiting for prefill instance..."
wait_for_server $PREFILL_PORT
echo "Waiting for decode instance..."
wait_for_server $DECODE_PORT
# Start proxy
echo "Starting EPD proxy on port $PROXY_PORT"
python "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py" \
--host "0.0.0.0" \
--port $PROXY_PORT \
--prefiller-ports $PREFILL_PORT \
--decoder-ports $DECODE_PORT \
> $LOG_PATH/1p1d_proxy.log 2>&1 &
PIDS+=($!)
# Wait for proxy
echo "Waiting for proxy..."
wait_for_server $PROXY_PORT
curl http://127.0.0.1:$PROXY_PORT/healthcheck
echo ""
echo "All PD (1P+1D) services are up!"
# Run test in baseline mode
echo "Running PD disagg baseline..."
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
--service_url "http://localhost:$PROXY_PORT" \
--model_name "$MODEL" \
--mode baseline_pd \
--baseline_file "$BASELINE_PD_FILE" \
$MM_FLAG
# Cleanup
echo "Stopping PD (1P+1D) instances..."
for pid in "${PIDS[@]}"; do
kill $pid 2>/dev/null || true
done
sleep 2
cleanup_instances
}
# Function to run EPD with 1E + 1P + 1D
run_epd_1e_1p_1d() {
echo "================================"
echo "Running EPD (1E + 1P + 1D)"
echo "================================"
cleanup_instances
rm -rf "$EC_SHARED_STORAGE_PATH"
mkdir -p "$EC_SHARED_STORAGE_PATH"
local ENCODE_PORT=$ENCODE_PORT
local PREFILL_PORT=$PREFILL_PORT
local DECODE_PORT=$DECODE_PORT
local PROXY_PORT=$ENDPOINT_PORT
declare -a PIDS=()
# Start encoder instance
echo "Starting encoder instance on GPU $GPU_E, port $ENCODE_PORT"
CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
--port $ENCODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.01 \
--enable-request-id-headers \
--no-enable-prefix-caching \
--max-num-batched-tokens 114688 \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--ec-transfer-config '{
"ec_connector": "ECExampleConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
> $LOG_PATH/1e1p1d_encoder.log 2>&1 &
PIDS+=($!)
# Start prefill instance
echo "Starting prefill instance on GPU $GPU_P, port $PREFILL_PORT"
CUDA_VISIBLE_DEVICES="$GPU_P" \
VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
vllm serve "$MODEL" \
--port $PREFILL_PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--enable-request-id-headers \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--ec-transfer-config '{
"ec_connector": "ECExampleConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_producer"
}' \
> $LOG_PATH/1e1p1d_prefill.log 2>&1 &
PIDS+=($!)
# Start decode instance
echo "Starting decode instance on GPU $GPU_D, port $DECODE_PORT"
CUDA_VISIBLE_DEVICES="$GPU_D" \
VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
vllm serve "$MODEL" \
--port $DECODE_PORT \
--enforce-eager \
--gpu-memory-utilization 0.7 \
--enable-request-id-headers \
--max-num-seqs 128 \
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_consumer"
}' \
> $LOG_PATH/1e1p1d_decode.log 2>&1 &
PIDS+=($!)
# Wait for instances to start
echo "Waiting for encoder instance..."
wait_for_server $ENCODE_PORT
echo "Waiting for prefill instance..."
wait_for_server $PREFILL_PORT
echo "Waiting for decode instance..."
wait_for_server $DECODE_PORT
# Start proxy
echo "Starting EPD proxy on port $PROXY_PORT"
python "${GIT_ROOT}/examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py" \
--host "0.0.0.0" \
--port $PROXY_PORT \
--encode-servers-urls "http://localhost:$ENCODE_PORT" \
--prefill-servers-urls "http://localhost:$PREFILL_PORT" \
--decode-servers-urls "http://localhost:$DECODE_PORT" \
> $LOG_PATH/1e1p1d_proxy.log 2>&1 &
PIDS+=($!)
# Wait for proxy
echo "Waiting for proxy..."
wait_for_server $PROXY_PORT
curl http://127.0.0.1:$PROXY_PORT/v1/models
curl http://127.0.0.1:$PROXY_PORT/health
echo ""
echo "All EPD (1E+1P+1D) services are up!"
# Run test in disagg mode
echo "Running EPD (1E+1P+1D) correctness test..."
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
--service_url "http://localhost:$PROXY_PORT" \
--model_name "$MODEL" \
--mode disagg \
--baseline_file "$BASELINE_PD_FILE" \
$MM_FLAG
# Cleanup
echo "✓✓ 1E+1P+1D Correctness Test finished"
echo "Stopping EPD (1E+1P+1D) instances..."
for pid in "${PIDS[@]}"; do
kill $pid 2>/dev/null || true
done
sleep 2
cleanup_instances
}
# Main execution
echo "================================"
echo "EPD Correctness Test Suite"
echo "Model: $MODEL"
echo "================================"
# Step 1: Run baseline
run_baseline
# Step 2: Test 1E + 1PD
run_epd_1e_1pd
# Step 3: Test baseline 1P + 1D
run_baseline_1p_1d
# Step 4: Test 1E + 1P + 1D
run_epd_1e_1p_1d
# Cleanup output file
rm -f "$BASELINE_FILE"
rm -f "$BASELINE_PD_FILE"
echo "================================"
echo "✓✓ All EPD correctness tests finished!"
echo "================================"

View File

@@ -0,0 +1,304 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
EPD Correctness Test
Tests that EPD (Encoder-Prefill-Decode) disaggregation produces the same
outputs as a baseline single instance.
Usage:
# Baseline mode (saves outputs):
python test_epd_correctness.py \
--service_url http://localhost:8000 \
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
--mode baseline \
--baseline_file .vllm_epd_baseline.txt
# Disagg mode (compares outputs):
python test_epd_correctness.py \
--service_url http://localhost:8000 \
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
--mode disagg \
--baseline_file .vllm_epd_baseline.txt
"""
import argparse
import json
import os
import time
import openai
import requests
from vllm.assets.image import ImageAsset
from vllm.multimodal.utils import encode_image_base64
MAX_OUTPUT_LEN = 256
# Sample prompts with multimodal content
image_1 = ImageAsset("stop_sign").pil_image.resize((1280, 720))
image_2 = ImageAsset("cherry_blossom").pil_image.resize((1280, 720))
image_local_path = f"{os.path.dirname(os.path.abspath(__file__))}/hato.jpg"
SAMPLE_PROMPTS_MM: list[dict] = [
{
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image;base64,{encode_image_base64(image_1)}"
},
},
{"type": "text", "text": "What's in this image?"},
],
}
],
"description": "Single image query",
},
{
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image;base64,{encode_image_base64(image_2)}"
},
},
{
"type": "image_url",
"image_url": {"url": f"file://{image_local_path}"},
},
{"type": "text", "text": "Describe these 2 images in detail."},
],
}
],
"description": "2 images with detailed query",
},
]
# Text-only prompts for mixed testing
SAMPLE_PROMPTS_TEXT: list[dict] = [
{
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"description": "Simple text-only query",
},
{
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"description": "Text-only explanation request",
},
]
def check_vllm_server(url: str, timeout=5, retries=10) -> bool:
"""Check if the vLLM server is ready.
Args:
url: The URL to check (usually /health or /healthcheck endpoint)
timeout: Timeout in seconds for each request
retries: Number of retries if the server is not ready
Returns:
True if the server is ready, False otherwise
"""
for attempt in range(retries):
try:
response = requests.get(url, timeout=timeout)
if response.status_code == 200:
print(f"Server is ready at {url}")
return True
else:
print(
f"Attempt {attempt + 1}/{retries}: Server returned "
f"status code {response.status_code}"
)
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1}/{retries}: Error connecting: {e}")
time.sleep(2) # Wait before retrying
return False
def run_chat_completion(
base_url: str,
model_name: str,
messages: list,
max_tokens: int = MAX_OUTPUT_LEN,
) -> str:
"""Run a chat completion request.
Args:
base_url: Base URL of the vLLM server
model_name: Name of the model
messages: Messages for chat completion
max_tokens: Maximum tokens to generate
Returns:
Generated text content
"""
client = openai.OpenAI(api_key="EMPTY", base_url=base_url)
completion = client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=max_tokens,
temperature=0.0,
seed=42,
)
return completion.choices[0].message.content
def main():
"""Main test function."""
parser = argparse.ArgumentParser(
description="EPD correctness test - compare disagg vs baseline"
)
parser.add_argument(
"--service_url",
type=str,
required=True,
help="The vLLM service URL (e.g., http://localhost:8000)",
)
parser.add_argument(
"--model_name",
type=str,
required=True,
help="Model name",
)
parser.add_argument(
"--mode",
type=str,
default="baseline",
choices=["baseline", "baseline_pd", "disagg"],
help="Mode: baseline/baseline_pd (saves outputs) or disagg (compares outputs)",
)
parser.add_argument(
"--baseline_file",
type=str,
default=".vllm_epd_baseline.txt",
help="File to save/load baseline outputs",
)
parser.add_argument(
"--use_mm_prompts",
action="store_true",
help="Use multimodal prompts (default: use text-only for quick testing)",
)
args = parser.parse_args()
print(f"Service URL: {args.service_url}")
print(f"Model: {args.model_name}")
print(f"Mode: {args.mode}")
print(f"Output file: {args.baseline_file}")
print(f"Use MM prompts: {args.use_mm_prompts}")
# Determine health check endpoint
if args.mode == "baseline":
health_check_url = f"{args.service_url}/health"
elif args.mode == "baseline_pd":
# Nixl toy proxy use /healthcheck
health_check_url = f"{args.service_url}/healthcheck"
else:
# Disagg EPD proxy uses /health
health_check_url = f"{args.service_url}/health"
if not os.path.exists(args.baseline_file):
raise ValueError(
f"In disagg mode, the output file {args.baseline_file} from "
"baseline does not exist. Run baseline mode first."
)
# Check if server is ready
if not check_vllm_server(health_check_url):
raise RuntimeError(f"vLLM server at {args.service_url} is not ready!")
# Select prompts to use
if args.use_mm_prompts:
test_prompts = SAMPLE_PROMPTS_MM
print("Using multimodal prompts")
else:
test_prompts = SAMPLE_PROMPTS_TEXT
print("Using text-only prompts for quick testing")
# Run completions
service_url = f"{args.service_url}/v1"
output_strs = {}
for i, prompt_data in enumerate(test_prompts):
print(
f"\nRunning prompt {i + 1}/{len(test_prompts)}: "
f"{prompt_data['description']}"
)
output_str = run_chat_completion(
base_url=service_url,
model_name=args.model_name,
messages=prompt_data["messages"],
max_tokens=MAX_OUTPUT_LEN,
)
# Use description as key for comparison
key = prompt_data["description"]
output_strs[key] = output_str
print(f"Output: {output_str}")
if args.mode in ("baseline", "baseline_pd"):
# Baseline mode: Save outputs
print(f"\nSaving baseline outputs to {args.baseline_file}")
try:
with open(args.baseline_file, "w") as json_file:
json.dump(output_strs, json_file, indent=4)
print("✅ Baseline outputs saved successfully")
except OSError as e:
print(f"Error writing to file: {e}")
raise
else:
# Disagg mode: Load and compare outputs
print(f"\nLoading baseline outputs from {args.baseline_file}")
baseline_outputs = None
try:
with open(args.baseline_file) as json_file:
baseline_outputs = json.load(json_file)
except OSError as e:
print(f"Error reading from file: {e}")
raise
# Verify outputs match
print("\nComparing disagg outputs with baseline...")
assert isinstance(baseline_outputs, dict), "Baseline outputs should be a dict"
assert len(baseline_outputs) == len(output_strs), (
f"Length mismatch: baseline has {len(baseline_outputs)}, "
f"disagg has {len(output_strs)}"
)
all_match = True
for key, baseline_output in baseline_outputs.items():
assert key in output_strs, f"{key} not in disagg outputs"
disagg_output = output_strs[key]
if baseline_output == disagg_output:
print(f"{key}: MATCH")
else:
print(f"{key}: MISMATCH")
print(f" Baseline: {baseline_output}")
print(f" Disagg: {disagg_output}")
all_match = False
assert all_match, "❌❌Disagg outputs do not match baseline!❌❌"
if all_match:
print("\n✅ All outputs match! Test PASSED")
if __name__ == "__main__":
main()