Sync from v0.13
This commit is contained in:
171
tests/v1/ec_connector/integration/README.md
Normal file
171
tests/v1/ec_connector/integration/README.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# EPD Correctness Test
|
||||
|
||||
This test verifies that EPD (Encoder-Prefill-Decode) disaggregation produces identical outputs to a baseline single instance.
|
||||
|
||||
## What It Tests
|
||||
|
||||
- **Baseline**: Single vLLM instance serving a multimodal model
|
||||
- **EPD (1E+1PD)**: 1 Encoder + 1 Prefill-Decode instance
|
||||
- **Baseline (1P+1D)**: 1 Prefill + 1 Decode instance
|
||||
- **EPD (1E+1P+1D)**: 1 Encoder + 1 Prefill + 1 Decode instance
|
||||
|
||||
The test ensures that disaggregated encoding produces **identical** outputs to the baseline.
|
||||
|
||||
Note that currently PD disaggregation set up may give slightly different results from a single instance. Therefore, we need the result from 1P+1D as the baseline for 1E+1P+1D
|
||||
|
||||
Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.
|
||||
|
||||
## Files
|
||||
|
||||
- `run_epd_correctness_test.sh` - Main test script (starts all instances and runs tests)
|
||||
- `test_epd_correctness.py` - Python test script (compares outputs)
|
||||
|
||||
## Usage
|
||||
|
||||
### Multimodal Prompts (Default)
|
||||
|
||||
```bash
|
||||
cd vllm
|
||||
./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
```
|
||||
|
||||
This runs the test with actual multimodal (image) prompts.
|
||||
|
||||
### Text-Only Prompts
|
||||
|
||||
```bash
|
||||
cd vllm
|
||||
USE_MM_PROMPTS=0 ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
```
|
||||
|
||||
This runs a quick test with text-only prompts to verify the setup works.
|
||||
|
||||
### Custom Configuration
|
||||
|
||||
```bash
|
||||
# Use specific GPUs
|
||||
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
|
||||
# Use specific ports
|
||||
ENDPOINT_PORT=10001 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
|
||||
# Use specific model
|
||||
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
|
||||
# Use specific storage path
|
||||
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
### Step 1: Baseline
|
||||
|
||||
1. Start single vLLM instance on GPU
|
||||
2. Run test prompts (multimodal or text-only)
|
||||
3. Save outputs to `.vllm_epd_baseline.txt`
|
||||
4. Shutdown instance
|
||||
|
||||
### Step 2: EPD (1E + 1PD)
|
||||
|
||||
1. Clear encoder cache storage
|
||||
2. Start instances and proxy
|
||||
3. Run same test prompts
|
||||
4. Assert outputs match baseline exactly
|
||||
5. Shutdown instances
|
||||
|
||||
### Step 3: EPD (1E + 1P + 1D)
|
||||
|
||||
1. Clear encoder cache storage
|
||||
2. Start instances and proxy
|
||||
3. Run same test prompts
|
||||
4. Assert outputs match baseline exactly
|
||||
5. Shutdown instances
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### Multimodal Prompts (--use_mm_prompts)
|
||||
|
||||
Tests encoder cache transfer:
|
||||
|
||||
- Single image query
|
||||
- Multiple images in one request
|
||||
- Mixed image and text
|
||||
- Image with detailed questions
|
||||
|
||||
### Text-Only Prompts (default)
|
||||
|
||||
Quick sanity check:
|
||||
|
||||
- Simple text queries
|
||||
- Text-only explanations
|
||||
- Verifies proxy routing works
|
||||
|
||||
## Expected Behavior
|
||||
|
||||
### ✅ Test Passes When
|
||||
|
||||
- All disagg outputs match baseline outputs exactly
|
||||
- No errors during instance startup
|
||||
- Encoder cache is properly saved and loaded
|
||||
- Proxy correctly routes requests
|
||||
|
||||
### ❌ Test Fails When
|
||||
|
||||
- Outputs differ between baseline and disagg
|
||||
- Server startup fails
|
||||
- Encoder cache not found (should fall back to local execution)
|
||||
- Proxy routing errors
|
||||
|
||||
## Notes
|
||||
|
||||
- The test uses deterministic generation (`temperature=0.0`, `seed=42`)
|
||||
- Encoder cache should enable exact output reproduction
|
||||
- Test cleans up all instances and cache files after completion
|
||||
- Safe to run multiple times (idempotent)
|
||||
- We setup the PD disagg part with NixlConnector. Please read details about EPD in `examples/online_serving/disaggregated_encoder/README.md`
|
||||
|
||||
## Requirements
|
||||
|
||||
- Multiple GPUs (3 for 1E+1P+1D, 2 for 1E+1PD, 1 for baseline)
|
||||
- 1E+1P+1D is runnable with 2 GPU by assign E and P on the same GPU now.
|
||||
- Multimodal model (e.g., Qwen2.5-VL-3B-Instruct)
|
||||
- Internet access (for accessing vllm test images)
|
||||
|
||||
## Debugging
|
||||
|
||||
### Check Logs
|
||||
|
||||
Logs and baseline output are saved in `/tmp/` by default.
|
||||
Can be customized by changing the environment variables.
|
||||
|
||||
### Check Encoder Cache
|
||||
|
||||
```bash
|
||||
# Verify cache files are created
|
||||
ls -la $EC_SHARED_STORAGE_PATH/
|
||||
|
||||
# Should see directories with mm_hash names
|
||||
# Each containing encoder_cache.safetensors
|
||||
```
|
||||
|
||||
### Manual Testing
|
||||
|
||||
Run individual components:
|
||||
|
||||
```bash
|
||||
# Baseline only
|
||||
python test_epd_correctness.py \
|
||||
--service_url http://localhost:8000 \
|
||||
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--mode baseline \
|
||||
--baseline_file test_output.txt \
|
||||
--use_mm_prompts
|
||||
|
||||
# Disagg only (requires baseline output file!)
|
||||
python test_epd_correctness.py \
|
||||
--service_url http://localhost:8000 \
|
||||
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--mode disagg \
|
||||
--baseline_file test_output.txt \
|
||||
--use_mm_prompts
|
||||
```
|
||||
BIN
tests/v1/ec_connector/integration/hato.jpg
Normal file
BIN
tests/v1/ec_connector/integration/hato.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 821 KiB |
476
tests/v1/ec_connector/integration/run_epd_correctness_test.sh
Normal file
476
tests/v1/ec_connector/integration/run_epd_correctness_test.sh
Normal file
@@ -0,0 +1,476 @@
|
||||
#!/bin/bash
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
#
|
||||
# EPD (Encoder-Prefill-Decode) Correctness Test
|
||||
#
|
||||
# This script tests that EPD disaggregation produces the same outputs as baseline.
|
||||
# It runs:
|
||||
# 1. Baseline: Single vLLM instance
|
||||
# 2. EPD: 1E + 1PD setup
|
||||
# 3. Baseline for (E + P + D): 1P + 1D vLLM instances disagg
|
||||
# 4. EPD: 1E + 1P + 1D setup
|
||||
|
||||
# For GPU usage
|
||||
|
||||
# set -xe
|
||||
|
||||
# Find the git repository root directory
|
||||
GIT_ROOT=$(git rev-parse --show-toplevel)
|
||||
|
||||
# Model to test
|
||||
MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}"
|
||||
|
||||
# Set 1 to use multimodal prompts; else to use text-only
|
||||
USE_MM_PROMPTS="${USE_MM_PROMPTS:-1}"
|
||||
MM_FLAG=""
|
||||
if [ $USE_MM_PROMPTS = "1" ]; then
|
||||
MM_FLAG="--use_mm_prompts"
|
||||
fi
|
||||
|
||||
# GPU configuration
|
||||
GPU_E="${GPU_E:-0}"
|
||||
GPU_P="${GPU_P:-1}"
|
||||
GPU_D="${GPU_D:-2}"
|
||||
GPU_SINGLE="${GPU_SINGLE:-$GPU_P}"
|
||||
GPU_PD="${GPU_PD:-$GPU_P}"
|
||||
|
||||
# Port
|
||||
ENCODE_PORT="${ENCODE_PORT:-19534}"
|
||||
PREFILL_PORT="${PREFILL_PORT:-19535}"
|
||||
DECODE_PORT="${DECODE_PORT:-19536}"
|
||||
PREFILL_DECODE_PORT="${PREFILL_DECODE_PORT:-19537}"
|
||||
ENDPOINT_PORT="${ENDPOINT_PORT:-10001}"
|
||||
|
||||
# Storage path for encoder cache
|
||||
EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache_test}"
|
||||
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-600}"
|
||||
|
||||
# Output file for baseline comparison and logs
|
||||
LOG_PATH="${LOG_PATH:-/tmp}"
|
||||
BASELINE_FILE="${BASELINE_FILE:-/tmp/vllm_baseline.txt}"
|
||||
BASELINE_PD_FILE="${BASELINE_PD_FILE:-/tmp/vllm_epd_baseline.txt}"
|
||||
|
||||
mkdir -p $LOG_PATH
|
||||
|
||||
# Trap the SIGINT signal (triggered by Ctrl+C)
|
||||
trap 'kill $(jobs -pr)' SIGINT SIGTERM EXIT
|
||||
|
||||
# Wait for server to be ready
|
||||
wait_for_server() {
|
||||
local port=$1
|
||||
timeout "$TIMEOUT_SECONDS" bash -c "
|
||||
until curl -s localhost:${port}/v1/chat/completions > /dev/null; do
|
||||
sleep 1
|
||||
done" && return 0 || return 1
|
||||
}
|
||||
|
||||
# Cleanup function
|
||||
cleanup_instances() {
|
||||
echo "Cleaning up any running vLLM instances..."
|
||||
pkill -f "vllm serve" || true
|
||||
pkill -f "disagg_epd_proxy.py" || true
|
||||
sleep 2
|
||||
}
|
||||
|
||||
# Function to run baseline (single instance)
|
||||
run_baseline() {
|
||||
echo "================================"
|
||||
echo "Running BASELINE (single instance)"
|
||||
echo "================================"
|
||||
|
||||
cleanup_instances
|
||||
rm -rf "$EC_SHARED_STORAGE_PATH"
|
||||
|
||||
local PORT=$ENDPOINT_PORT
|
||||
|
||||
# Start baseline instance
|
||||
echo "Starting baseline instance on GPU $GPU_SINGLE, port $PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_SINGLE" vllm serve "$MODEL" \
|
||||
--port $PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
> $LOG_PATH/baseline.log 2>&1 &
|
||||
|
||||
local BASELINE_PID=$!
|
||||
|
||||
# Wait for baseline to start
|
||||
echo "Waiting for baseline instance to start..."
|
||||
wait_for_server $PORT
|
||||
|
||||
curl http://127.0.0.1:$PORT/v1/models
|
||||
echo ""
|
||||
|
||||
# Run test in baseline mode
|
||||
echo "Running baseline..."
|
||||
|
||||
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
|
||||
--service_url "http://localhost:$PORT" \
|
||||
--model_name "$MODEL" \
|
||||
--mode baseline \
|
||||
--baseline_file "$BASELINE_FILE" \
|
||||
$MM_FLAG
|
||||
|
||||
# Cleanup baseline
|
||||
echo "Stopping baseline instance..."
|
||||
kill $BASELINE_PID 2>/dev/null || true
|
||||
sleep 2
|
||||
cleanup_instances
|
||||
}
|
||||
|
||||
# Function to run EPD with 1E + 1PD
|
||||
run_epd_1e_1pd() {
|
||||
echo "================================"
|
||||
echo "Running EPD (1E + 1PD)"
|
||||
echo "================================"
|
||||
|
||||
cleanup_instances
|
||||
rm -rf "$EC_SHARED_STORAGE_PATH"
|
||||
mkdir -p "$EC_SHARED_STORAGE_PATH"
|
||||
|
||||
local ENCODE_PORT=$ENCODE_PORT
|
||||
local PREFILL_DECODE_PORT=$PREFILL_DECODE_PORT
|
||||
local PROXY_PORT=$ENDPOINT_PORT
|
||||
|
||||
declare -a PIDS=()
|
||||
|
||||
# Start encoder instance
|
||||
echo "Starting encoder instance on GPU $GPU_E, port $ENCODE_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
|
||||
--port $ENCODE_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.01 \
|
||||
--enable-request-id-headers \
|
||||
--no-enable-prefix-caching \
|
||||
--max-num-batched-tokens 114688 \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--ec-transfer-config '{
|
||||
"ec_connector": "ECExampleConnector",
|
||||
"ec_role": "ec_producer",
|
||||
"ec_connector_extra_config": {
|
||||
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
|
||||
}
|
||||
}' \
|
||||
> $LOG_PATH/1e1pd_encoder.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Start prefill+decode instance
|
||||
echo "Starting PD instance on GPU $GPU_PD, port $PREFILL_DECODE_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_PD" vllm serve "$MODEL" \
|
||||
--port $PREFILL_DECODE_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--enable-request-id-headers \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--ec-transfer-config '{
|
||||
"ec_connector": "ECExampleConnector",
|
||||
"ec_role": "ec_consumer",
|
||||
"ec_connector_extra_config": {
|
||||
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
|
||||
}
|
||||
}' \
|
||||
> $LOG_PATH/1e1pd_pd.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for instances to start
|
||||
echo "Waiting for encoder instance..."
|
||||
wait_for_server $ENCODE_PORT
|
||||
echo "Waiting for PD instance..."
|
||||
wait_for_server $PREFILL_DECODE_PORT
|
||||
|
||||
# Start proxy
|
||||
echo "Starting EPD proxy on port $PROXY_PORT"
|
||||
python "${GIT_ROOT}/examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py" \
|
||||
--host "0.0.0.0" \
|
||||
--port $PROXY_PORT \
|
||||
--encode-servers-urls "http://localhost:$ENCODE_PORT" \
|
||||
--prefill-servers-urls "disable" \
|
||||
--decode-servers-urls "http://localhost:$PREFILL_DECODE_PORT" \
|
||||
> $LOG_PATH/1e1pd_proxy.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for proxy
|
||||
echo "Waiting for proxy..."
|
||||
wait_for_server $PROXY_PORT
|
||||
|
||||
curl http://127.0.0.1:$PROXY_PORT/v1/models
|
||||
curl http://127.0.0.1:$PROXY_PORT/health
|
||||
echo ""
|
||||
|
||||
echo "All EPD (1E+1PD) services are up!"
|
||||
|
||||
# Run test in disagg mode
|
||||
echo "Running EPD (1E+1PD) correctness test..."
|
||||
|
||||
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
|
||||
--service_url "http://localhost:$PROXY_PORT" \
|
||||
--model_name "$MODEL" \
|
||||
--mode disagg \
|
||||
--baseline_file "$BASELINE_FILE" \
|
||||
$MM_FLAG
|
||||
|
||||
# Cleanup
|
||||
echo "✓✓ 1E+1PD Correctness Test finished"
|
||||
echo "Stopping EPD (1E+1PD) instances..."
|
||||
for pid in "${PIDS[@]}"; do
|
||||
kill $pid 2>/dev/null || true
|
||||
done
|
||||
sleep 2
|
||||
cleanup_instances
|
||||
}
|
||||
|
||||
# Function to run baseline for 1E + 1P + 1D (PD disagg)
|
||||
run_baseline_1p_1d() {
|
||||
echo "================================"
|
||||
echo "Running PD BASELINE (1P + 1D)"
|
||||
echo "================================"
|
||||
|
||||
cleanup_instances
|
||||
rm -rf "$EC_SHARED_STORAGE_PATH"
|
||||
mkdir -p "$EC_SHARED_STORAGE_PATH"
|
||||
|
||||
local PREFILL_PORT=$PREFILL_PORT
|
||||
local DECODE_PORT=$DECODE_PORT
|
||||
local PROXY_PORT=$ENDPOINT_PORT
|
||||
|
||||
declare -a PIDS=()
|
||||
|
||||
# Start prefill instance
|
||||
echo "Starting prefill instance on GPU $GPU_P, port $PREFILL_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_P" \
|
||||
VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
|
||||
vllm serve "$MODEL" \
|
||||
--port $PREFILL_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--enable-request-id-headers \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--kv-transfer-config '{
|
||||
"kv_connector": "NixlConnector",
|
||||
"kv_role": "kv_producer"
|
||||
}' \
|
||||
> $LOG_PATH/1p1d_prefill.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Start decode instance
|
||||
echo "Starting decode instance on GPU $GPU_D, port $DECODE_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_D" \
|
||||
VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
|
||||
vllm serve "$MODEL" \
|
||||
--port $DECODE_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--enable-request-id-headers \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--kv-transfer-config '{
|
||||
"kv_connector": "NixlConnector",
|
||||
"kv_role": "kv_consumer"
|
||||
}' \
|
||||
> $LOG_PATH/1p1d_decode.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for instances to start
|
||||
echo "Waiting for prefill instance..."
|
||||
wait_for_server $PREFILL_PORT
|
||||
echo "Waiting for decode instance..."
|
||||
wait_for_server $DECODE_PORT
|
||||
|
||||
# Start proxy
|
||||
echo "Starting EPD proxy on port $PROXY_PORT"
|
||||
python "${GIT_ROOT}/tests/v1/kv_connector/nixl_integration/toy_proxy_server.py" \
|
||||
--host "0.0.0.0" \
|
||||
--port $PROXY_PORT \
|
||||
--prefiller-ports $PREFILL_PORT \
|
||||
--decoder-ports $DECODE_PORT \
|
||||
> $LOG_PATH/1p1d_proxy.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for proxy
|
||||
echo "Waiting for proxy..."
|
||||
wait_for_server $PROXY_PORT
|
||||
|
||||
curl http://127.0.0.1:$PROXY_PORT/healthcheck
|
||||
echo ""
|
||||
|
||||
echo "All PD (1P+1D) services are up!"
|
||||
|
||||
# Run test in baseline mode
|
||||
echo "Running PD disagg baseline..."
|
||||
|
||||
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
|
||||
--service_url "http://localhost:$PROXY_PORT" \
|
||||
--model_name "$MODEL" \
|
||||
--mode baseline_pd \
|
||||
--baseline_file "$BASELINE_PD_FILE" \
|
||||
$MM_FLAG
|
||||
|
||||
# Cleanup
|
||||
echo "Stopping PD (1P+1D) instances..."
|
||||
for pid in "${PIDS[@]}"; do
|
||||
kill $pid 2>/dev/null || true
|
||||
done
|
||||
sleep 2
|
||||
cleanup_instances
|
||||
}
|
||||
|
||||
# Function to run EPD with 1E + 1P + 1D
|
||||
run_epd_1e_1p_1d() {
|
||||
echo "================================"
|
||||
echo "Running EPD (1E + 1P + 1D)"
|
||||
echo "================================"
|
||||
|
||||
cleanup_instances
|
||||
rm -rf "$EC_SHARED_STORAGE_PATH"
|
||||
mkdir -p "$EC_SHARED_STORAGE_PATH"
|
||||
|
||||
local ENCODE_PORT=$ENCODE_PORT
|
||||
local PREFILL_PORT=$PREFILL_PORT
|
||||
local DECODE_PORT=$DECODE_PORT
|
||||
local PROXY_PORT=$ENDPOINT_PORT
|
||||
|
||||
declare -a PIDS=()
|
||||
|
||||
# Start encoder instance
|
||||
echo "Starting encoder instance on GPU $GPU_E, port $ENCODE_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
|
||||
--port $ENCODE_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.01 \
|
||||
--enable-request-id-headers \
|
||||
--no-enable-prefix-caching \
|
||||
--max-num-batched-tokens 114688 \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--ec-transfer-config '{
|
||||
"ec_connector": "ECExampleConnector",
|
||||
"ec_role": "ec_producer",
|
||||
"ec_connector_extra_config": {
|
||||
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
|
||||
}
|
||||
}' \
|
||||
> $LOG_PATH/1e1p1d_encoder.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Start prefill instance
|
||||
echo "Starting prefill instance on GPU $GPU_P, port $PREFILL_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_P" \
|
||||
VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
|
||||
vllm serve "$MODEL" \
|
||||
--port $PREFILL_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--enable-request-id-headers \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--ec-transfer-config '{
|
||||
"ec_connector": "ECExampleConnector",
|
||||
"ec_role": "ec_consumer",
|
||||
"ec_connector_extra_config": {
|
||||
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
|
||||
}
|
||||
}' \
|
||||
--kv-transfer-config '{
|
||||
"kv_connector": "NixlConnector",
|
||||
"kv_role": "kv_producer"
|
||||
}' \
|
||||
> $LOG_PATH/1e1p1d_prefill.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Start decode instance
|
||||
echo "Starting decode instance on GPU $GPU_D, port $DECODE_PORT"
|
||||
CUDA_VISIBLE_DEVICES="$GPU_D" \
|
||||
VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
|
||||
vllm serve "$MODEL" \
|
||||
--port $DECODE_PORT \
|
||||
--enforce-eager \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--enable-request-id-headers \
|
||||
--max-num-seqs 128 \
|
||||
--allowed-local-media-path ${GIT_ROOT}/tests/v1/ec_connector/integration \
|
||||
--kv-transfer-config '{
|
||||
"kv_connector": "NixlConnector",
|
||||
"kv_role": "kv_consumer"
|
||||
}' \
|
||||
> $LOG_PATH/1e1p1d_decode.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for instances to start
|
||||
echo "Waiting for encoder instance..."
|
||||
wait_for_server $ENCODE_PORT
|
||||
echo "Waiting for prefill instance..."
|
||||
wait_for_server $PREFILL_PORT
|
||||
echo "Waiting for decode instance..."
|
||||
wait_for_server $DECODE_PORT
|
||||
|
||||
# Start proxy
|
||||
echo "Starting EPD proxy on port $PROXY_PORT"
|
||||
python "${GIT_ROOT}/examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py" \
|
||||
--host "0.0.0.0" \
|
||||
--port $PROXY_PORT \
|
||||
--encode-servers-urls "http://localhost:$ENCODE_PORT" \
|
||||
--prefill-servers-urls "http://localhost:$PREFILL_PORT" \
|
||||
--decode-servers-urls "http://localhost:$DECODE_PORT" \
|
||||
> $LOG_PATH/1e1p1d_proxy.log 2>&1 &
|
||||
PIDS+=($!)
|
||||
|
||||
# Wait for proxy
|
||||
echo "Waiting for proxy..."
|
||||
wait_for_server $PROXY_PORT
|
||||
|
||||
curl http://127.0.0.1:$PROXY_PORT/v1/models
|
||||
curl http://127.0.0.1:$PROXY_PORT/health
|
||||
echo ""
|
||||
|
||||
echo "All EPD (1E+1P+1D) services are up!"
|
||||
|
||||
# Run test in disagg mode
|
||||
echo "Running EPD (1E+1P+1D) correctness test..."
|
||||
|
||||
python "${GIT_ROOT}/tests/v1/ec_connector/integration/test_epd_correctness.py" \
|
||||
--service_url "http://localhost:$PROXY_PORT" \
|
||||
--model_name "$MODEL" \
|
||||
--mode disagg \
|
||||
--baseline_file "$BASELINE_PD_FILE" \
|
||||
$MM_FLAG
|
||||
|
||||
# Cleanup
|
||||
echo "✓✓ 1E+1P+1D Correctness Test finished"
|
||||
echo "Stopping EPD (1E+1P+1D) instances..."
|
||||
for pid in "${PIDS[@]}"; do
|
||||
kill $pid 2>/dev/null || true
|
||||
done
|
||||
sleep 2
|
||||
cleanup_instances
|
||||
}
|
||||
|
||||
# Main execution
|
||||
echo "================================"
|
||||
echo "EPD Correctness Test Suite"
|
||||
echo "Model: $MODEL"
|
||||
echo "================================"
|
||||
|
||||
# Step 1: Run baseline
|
||||
run_baseline
|
||||
|
||||
# Step 2: Test 1E + 1PD
|
||||
run_epd_1e_1pd
|
||||
|
||||
# Step 3: Test baseline 1P + 1D
|
||||
run_baseline_1p_1d
|
||||
|
||||
# Step 4: Test 1E + 1P + 1D
|
||||
run_epd_1e_1p_1d
|
||||
|
||||
# Cleanup output file
|
||||
rm -f "$BASELINE_FILE"
|
||||
rm -f "$BASELINE_PD_FILE"
|
||||
|
||||
echo "================================"
|
||||
echo "✓✓ All EPD correctness tests finished!"
|
||||
echo "================================"
|
||||
304
tests/v1/ec_connector/integration/test_epd_correctness.py
Normal file
304
tests/v1/ec_connector/integration/test_epd_correctness.py
Normal file
@@ -0,0 +1,304 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
"""
|
||||
EPD Correctness Test
|
||||
|
||||
Tests that EPD (Encoder-Prefill-Decode) disaggregation produces the same
|
||||
outputs as a baseline single instance.
|
||||
|
||||
Usage:
|
||||
# Baseline mode (saves outputs):
|
||||
python test_epd_correctness.py \
|
||||
--service_url http://localhost:8000 \
|
||||
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--mode baseline \
|
||||
--baseline_file .vllm_epd_baseline.txt
|
||||
|
||||
# Disagg mode (compares outputs):
|
||||
python test_epd_correctness.py \
|
||||
--service_url http://localhost:8000 \
|
||||
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
||||
--mode disagg \
|
||||
--baseline_file .vllm_epd_baseline.txt
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
|
||||
import openai
|
||||
import requests
|
||||
|
||||
from vllm.assets.image import ImageAsset
|
||||
from vllm.multimodal.utils import encode_image_base64
|
||||
|
||||
MAX_OUTPUT_LEN = 256
|
||||
|
||||
# Sample prompts with multimodal content
|
||||
image_1 = ImageAsset("stop_sign").pil_image.resize((1280, 720))
|
||||
image_2 = ImageAsset("cherry_blossom").pil_image.resize((1280, 720))
|
||||
|
||||
image_local_path = f"{os.path.dirname(os.path.abspath(__file__))}/hato.jpg"
|
||||
|
||||
SAMPLE_PROMPTS_MM: list[dict] = [
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image;base64,{encode_image_base64(image_1)}"
|
||||
},
|
||||
},
|
||||
{"type": "text", "text": "What's in this image?"},
|
||||
],
|
||||
}
|
||||
],
|
||||
"description": "Single image query",
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image;base64,{encode_image_base64(image_2)}"
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"file://{image_local_path}"},
|
||||
},
|
||||
{"type": "text", "text": "Describe these 2 images in detail."},
|
||||
],
|
||||
}
|
||||
],
|
||||
"description": "2 images with detailed query",
|
||||
},
|
||||
]
|
||||
|
||||
# Text-only prompts for mixed testing
|
||||
SAMPLE_PROMPTS_TEXT: list[dict] = [
|
||||
{
|
||||
"messages": [{"role": "user", "content": "What is the capital of France?"}],
|
||||
"description": "Simple text-only query",
|
||||
},
|
||||
{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Explain quantum computing in simple terms."}
|
||||
],
|
||||
"description": "Text-only explanation request",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def check_vllm_server(url: str, timeout=5, retries=10) -> bool:
|
||||
"""Check if the vLLM server is ready.
|
||||
|
||||
Args:
|
||||
url: The URL to check (usually /health or /healthcheck endpoint)
|
||||
timeout: Timeout in seconds for each request
|
||||
retries: Number of retries if the server is not ready
|
||||
|
||||
Returns:
|
||||
True if the server is ready, False otherwise
|
||||
"""
|
||||
for attempt in range(retries):
|
||||
try:
|
||||
response = requests.get(url, timeout=timeout)
|
||||
if response.status_code == 200:
|
||||
print(f"Server is ready at {url}")
|
||||
return True
|
||||
else:
|
||||
print(
|
||||
f"Attempt {attempt + 1}/{retries}: Server returned "
|
||||
f"status code {response.status_code}"
|
||||
)
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"Attempt {attempt + 1}/{retries}: Error connecting: {e}")
|
||||
time.sleep(2) # Wait before retrying
|
||||
return False
|
||||
|
||||
|
||||
def run_chat_completion(
|
||||
base_url: str,
|
||||
model_name: str,
|
||||
messages: list,
|
||||
max_tokens: int = MAX_OUTPUT_LEN,
|
||||
) -> str:
|
||||
"""Run a chat completion request.
|
||||
|
||||
Args:
|
||||
base_url: Base URL of the vLLM server
|
||||
model_name: Name of the model
|
||||
messages: Messages for chat completion
|
||||
max_tokens: Maximum tokens to generate
|
||||
|
||||
Returns:
|
||||
Generated text content
|
||||
"""
|
||||
client = openai.OpenAI(api_key="EMPTY", base_url=base_url)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model=model_name,
|
||||
messages=messages,
|
||||
max_tokens=max_tokens,
|
||||
temperature=0.0,
|
||||
seed=42,
|
||||
)
|
||||
|
||||
return completion.choices[0].message.content
|
||||
|
||||
|
||||
def main():
|
||||
"""Main test function."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="EPD correctness test - compare disagg vs baseline"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--service_url",
|
||||
type=str,
|
||||
required=True,
|
||||
help="The vLLM service URL (e.g., http://localhost:8000)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--model_name",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Model name",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
type=str,
|
||||
default="baseline",
|
||||
choices=["baseline", "baseline_pd", "disagg"],
|
||||
help="Mode: baseline/baseline_pd (saves outputs) or disagg (compares outputs)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--baseline_file",
|
||||
type=str,
|
||||
default=".vllm_epd_baseline.txt",
|
||||
help="File to save/load baseline outputs",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--use_mm_prompts",
|
||||
action="store_true",
|
||||
help="Use multimodal prompts (default: use text-only for quick testing)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print(f"Service URL: {args.service_url}")
|
||||
print(f"Model: {args.model_name}")
|
||||
print(f"Mode: {args.mode}")
|
||||
print(f"Output file: {args.baseline_file}")
|
||||
print(f"Use MM prompts: {args.use_mm_prompts}")
|
||||
|
||||
# Determine health check endpoint
|
||||
if args.mode == "baseline":
|
||||
health_check_url = f"{args.service_url}/health"
|
||||
elif args.mode == "baseline_pd":
|
||||
# Nixl toy proxy use /healthcheck
|
||||
health_check_url = f"{args.service_url}/healthcheck"
|
||||
else:
|
||||
# Disagg EPD proxy uses /health
|
||||
health_check_url = f"{args.service_url}/health"
|
||||
if not os.path.exists(args.baseline_file):
|
||||
raise ValueError(
|
||||
f"In disagg mode, the output file {args.baseline_file} from "
|
||||
"baseline does not exist. Run baseline mode first."
|
||||
)
|
||||
|
||||
# Check if server is ready
|
||||
if not check_vllm_server(health_check_url):
|
||||
raise RuntimeError(f"vLLM server at {args.service_url} is not ready!")
|
||||
|
||||
# Select prompts to use
|
||||
if args.use_mm_prompts:
|
||||
test_prompts = SAMPLE_PROMPTS_MM
|
||||
print("Using multimodal prompts")
|
||||
else:
|
||||
test_prompts = SAMPLE_PROMPTS_TEXT
|
||||
print("Using text-only prompts for quick testing")
|
||||
|
||||
# Run completions
|
||||
service_url = f"{args.service_url}/v1"
|
||||
output_strs = {}
|
||||
|
||||
for i, prompt_data in enumerate(test_prompts):
|
||||
print(
|
||||
f"\nRunning prompt {i + 1}/{len(test_prompts)}: "
|
||||
f"{prompt_data['description']}"
|
||||
)
|
||||
|
||||
output_str = run_chat_completion(
|
||||
base_url=service_url,
|
||||
model_name=args.model_name,
|
||||
messages=prompt_data["messages"],
|
||||
max_tokens=MAX_OUTPUT_LEN,
|
||||
)
|
||||
|
||||
# Use description as key for comparison
|
||||
key = prompt_data["description"]
|
||||
output_strs[key] = output_str
|
||||
print(f"Output: {output_str}")
|
||||
|
||||
if args.mode in ("baseline", "baseline_pd"):
|
||||
# Baseline mode: Save outputs
|
||||
print(f"\nSaving baseline outputs to {args.baseline_file}")
|
||||
try:
|
||||
with open(args.baseline_file, "w") as json_file:
|
||||
json.dump(output_strs, json_file, indent=4)
|
||||
print("✅ Baseline outputs saved successfully")
|
||||
except OSError as e:
|
||||
print(f"Error writing to file: {e}")
|
||||
raise
|
||||
else:
|
||||
# Disagg mode: Load and compare outputs
|
||||
print(f"\nLoading baseline outputs from {args.baseline_file}")
|
||||
baseline_outputs = None
|
||||
try:
|
||||
with open(args.baseline_file) as json_file:
|
||||
baseline_outputs = json.load(json_file)
|
||||
except OSError as e:
|
||||
print(f"Error reading from file: {e}")
|
||||
raise
|
||||
|
||||
# Verify outputs match
|
||||
print("\nComparing disagg outputs with baseline...")
|
||||
assert isinstance(baseline_outputs, dict), "Baseline outputs should be a dict"
|
||||
assert len(baseline_outputs) == len(output_strs), (
|
||||
f"Length mismatch: baseline has {len(baseline_outputs)}, "
|
||||
f"disagg has {len(output_strs)}"
|
||||
)
|
||||
|
||||
all_match = True
|
||||
for key, baseline_output in baseline_outputs.items():
|
||||
assert key in output_strs, f"{key} not in disagg outputs"
|
||||
|
||||
disagg_output = output_strs[key]
|
||||
if baseline_output == disagg_output:
|
||||
print(f"✅ {key}: MATCH")
|
||||
else:
|
||||
print(f"❌ {key}: MISMATCH")
|
||||
print(f" Baseline: {baseline_output}")
|
||||
print(f" Disagg: {disagg_output}")
|
||||
all_match = False
|
||||
|
||||
assert all_match, "❌❌Disagg outputs do not match baseline!❌❌"
|
||||
if all_match:
|
||||
print("\n✅ All outputs match! Test PASSED")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user