benchmark: enhance configurable multimodal benchmarking in bench_serving (#9812)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
This commit is contained in:
@@ -59,15 +59,16 @@ Select with `--dataset-name`:
|
|||||||
- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
|
- `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
|
||||||
- `random`: random text lengths; sampled from ShareGPT token space
|
- `random`: random text lengths; sampled from ShareGPT token space
|
||||||
- `random-ids`: random token ids (can lead to gibberish)
|
- `random-ids`: random token ids (can lead to gibberish)
|
||||||
- `random-image`: generates random images and wraps them in chat messages; supports custom resolutions via 'heightxwidth' format
|
- `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types
|
||||||
- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
|
- `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
|
||||||
- `mmmu`: samples from MMMU (Math split) and includes images
|
- `mmmu`: samples from MMMU (Math split) and includes images
|
||||||
|
|
||||||
Common dataset flags:
|
Common dataset flags:
|
||||||
|
|
||||||
- `--num-prompts N`: number of requests
|
- `--num-prompts N`: number of requests
|
||||||
- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/random-image
|
- `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image
|
||||||
- `--random-image-num-images`, `--random-image-resolution`: for random-image dataset (supports presets 1080p/720p/360p or custom 'heightxwidth' format)
|
- `--image-count`: Number of images per request (for `image` dataset).
|
||||||
|
|
||||||
- `--apply-chat-template`: apply tokenizer chat template when constructing prompts
|
- `--apply-chat-template`: apply tokenizer chat template when constructing prompts
|
||||||
- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
|
- `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
|
||||||
|
|
||||||
@@ -79,14 +80,16 @@ Generated Shared Prefix flags (for `generated-shared-prefix`):
|
|||||||
- `--gsp-question-len`
|
- `--gsp-question-len`
|
||||||
- `--gsp-output-len`
|
- `--gsp-output-len`
|
||||||
|
|
||||||
Random Image dataset flags (for `random-image`):
|
Image dataset flags (for `image`):
|
||||||
|
|
||||||
- `--random-image-num-images`: Number of images per request
|
- `--image-count`: Number of images per request
|
||||||
- `--random-image-resolution`: Image resolution; supports presets (1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
|
- `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
|
||||||
|
- `--image-format`: Image format (jpeg or png)
|
||||||
|
- `--image-content`: Image content type (random or blank)
|
||||||
|
|
||||||
### Examples
|
### Examples
|
||||||
|
|
||||||
1. To benchmark random-image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
|
1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
|
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
|
||||||
@@ -95,10 +98,10 @@ python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disabl
|
|||||||
```bash
|
```bash
|
||||||
python -m sglang.bench_serving \
|
python -m sglang.bench_serving \
|
||||||
--backend sglang-oai-chat \
|
--backend sglang-oai-chat \
|
||||||
--dataset-name random-image \
|
--dataset-name image \
|
||||||
--num-prompts 500 \
|
--num-prompts 500 \
|
||||||
--random-image-num-images 3 \
|
--image-count 3 \
|
||||||
--random-image-resolution 720p \
|
--image-resolution 720p \
|
||||||
--random-input-len 512 \
|
--random-input-len 512 \
|
||||||
--random-output-len 512
|
--random-output-len 512
|
||||||
```
|
```
|
||||||
@@ -159,9 +162,10 @@ The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for Op
|
|||||||
Printed after each run:
|
Printed after each run:
|
||||||
|
|
||||||
- Request throughput (req/s)
|
- Request throughput (req/s)
|
||||||
- Input token throughput (tok/s)
|
- Input token throughput (tok/s) - includes both text and vision tokens
|
||||||
- Output token throughput (tok/s)
|
- Output token throughput (tok/s)
|
||||||
- Total token throughput (tok/s)
|
- Total token throughput (tok/s) - includes both text and vision tokens
|
||||||
|
- Total input text tokens and Total input vision tokens - per-modality breakdown
|
||||||
- Concurrency: aggregate time of all requests divided by wall time
|
- Concurrency: aggregate time of all requests divided by wall time
|
||||||
- End-to-End Latency (ms): mean/median/std/p99 per-request total latency
|
- End-to-End Latency (ms): mean/median/std/p99 per-request total latency
|
||||||
- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
|
- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
|
||||||
@@ -227,31 +231,48 @@ python3 -m sglang.bench_serving \
|
|||||||
--apply-chat-template
|
--apply-chat-template
|
||||||
```
|
```
|
||||||
|
|
||||||
4) Random images (VLM) with chat template:
|
4) Images (VLM) with chat template:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m sglang.bench_serving \
|
python3 -m sglang.bench_serving \
|
||||||
--backend sglang \
|
--backend sglang \
|
||||||
--host 127.0.0.1 --port 30000 \
|
--host 127.0.0.1 --port 30000 \
|
||||||
--model your-vlm-model \
|
--model your-vlm-model \
|
||||||
--dataset-name random-image \
|
--dataset-name image \
|
||||||
--random-image-num-images 2 \
|
--image-count 2 \
|
||||||
--random-image-resolution 720p \
|
--image-resolution 720p \
|
||||||
--random-input-len 128 --random-output-len 256 \
|
--random-input-len 128 --random-output-len 256 \
|
||||||
--num-prompts 200 \
|
--num-prompts 200 \
|
||||||
--apply-chat-template
|
--apply-chat-template
|
||||||
```
|
```
|
||||||
|
|
||||||
4a) Random images with custom resolution:
|
4a) Images with custom resolution:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python3 -m sglang.bench_serving \
|
python3 -m sglang.bench_serving \
|
||||||
--backend sglang \
|
--backend sglang \
|
||||||
--host 127.0.0.1 --port 30000 \
|
--host 127.0.0.1 --port 30000 \
|
||||||
--model your-vlm-model \
|
--model your-vlm-model \
|
||||||
--dataset-name random-image \
|
--dataset-name image \
|
||||||
--random-image-num-images 1 \
|
--image-count 1 \
|
||||||
--random-image-resolution 512x768 \
|
--image-resolution 512x768 \
|
||||||
|
--random-input-len 64 --random-output-len 128 \
|
||||||
|
--num-prompts 100 \
|
||||||
|
--apply-chat-template
|
||||||
|
```
|
||||||
|
|
||||||
|
4b) 1080p images with PNG format and blank content:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 -m sglang.bench_serving \
|
||||||
|
--backend sglang \
|
||||||
|
--host 127.0.0.1 --port 30000 \
|
||||||
|
--model your-vlm-model \
|
||||||
|
--dataset-name image \
|
||||||
|
--image-count 1 \
|
||||||
|
--image-resolution 1080p \
|
||||||
|
--image-format png \
|
||||||
|
--image-content blank \
|
||||||
--random-input-len 64 --random-output-len 128 \
|
--random-input-len 64 --random-output-len 128 \
|
||||||
--num-prompts 100 \
|
--num-prompts 100 \
|
||||||
--apply-chat-template
|
--apply-chat-template
|
||||||
@@ -325,7 +346,7 @@ python3 -m sglang.bench_serving \
|
|||||||
- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
|
- All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
|
||||||
- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
|
- Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
|
||||||
- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
|
- Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
|
||||||
- Random-image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
|
- Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
|
||||||
- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
|
- Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
|
||||||
|
|
||||||
### Notes
|
### Notes
|
||||||
|
|||||||
@@ -35,6 +35,7 @@ import numpy as np
|
|||||||
import requests
|
import requests
|
||||||
from tqdm.asyncio import tqdm
|
from tqdm.asyncio import tqdm
|
||||||
from transformers import (
|
from transformers import (
|
||||||
|
AutoProcessor,
|
||||||
AutoTokenizer,
|
AutoTokenizer,
|
||||||
PreTrainedTokenizer,
|
PreTrainedTokenizer,
|
||||||
PreTrainedTokenizerBase,
|
PreTrainedTokenizerBase,
|
||||||
@@ -327,8 +328,9 @@ async def async_request_openai_chat_completions(
|
|||||||
"model": request_func_input.model,
|
"model": request_func_input.model,
|
||||||
"messages": messages,
|
"messages": messages,
|
||||||
"temperature": 0.0,
|
"temperature": 0.0,
|
||||||
"max_tokens": request_func_input.output_len,
|
"max_completion_tokens": request_func_input.output_len,
|
||||||
"stream": not args.disable_stream,
|
"stream": not args.disable_stream,
|
||||||
|
"ignore_eos": not args.disable_ignore_eos,
|
||||||
**request_func_input.extra_request_body,
|
**request_func_input.extra_request_body,
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -659,7 +661,30 @@ def get_tokenizer(
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def get_dataset(args, tokenizer):
|
def get_processor(
|
||||||
|
pretrained_model_name_or_path: str,
|
||||||
|
) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
|
||||||
|
assert (
|
||||||
|
pretrained_model_name_or_path is not None
|
||||||
|
and pretrained_model_name_or_path != ""
|
||||||
|
)
|
||||||
|
if pretrained_model_name_or_path.endswith(
|
||||||
|
".json"
|
||||||
|
) or pretrained_model_name_or_path.endswith(".model"):
|
||||||
|
from sglang.srt.hf_transformers_utils import get_processor
|
||||||
|
|
||||||
|
return get_processor(pretrained_model_name_or_path)
|
||||||
|
|
||||||
|
if pretrained_model_name_or_path is not None and not os.path.exists(
|
||||||
|
pretrained_model_name_or_path
|
||||||
|
):
|
||||||
|
pretrained_model_name_or_path = get_model(pretrained_model_name_or_path)
|
||||||
|
return AutoProcessor.from_pretrained(
|
||||||
|
pretrained_model_name_or_path, trust_remote_code=True
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_dataset(args, tokenizer, model_id=None):
|
||||||
tokenize_prompt = getattr(args, "tokenize_prompt", False)
|
tokenize_prompt = getattr(args, "tokenize_prompt", False)
|
||||||
if args.dataset_name == "sharegpt":
|
if args.dataset_name == "sharegpt":
|
||||||
assert not tokenize_prompt
|
assert not tokenize_prompt
|
||||||
@@ -672,7 +697,7 @@ def get_dataset(args, tokenizer):
|
|||||||
prompt_suffix=args.prompt_suffix,
|
prompt_suffix=args.prompt_suffix,
|
||||||
apply_chat_template=args.apply_chat_template,
|
apply_chat_template=args.apply_chat_template,
|
||||||
)
|
)
|
||||||
elif args.dataset_name.startswith("random") and args.dataset_name != "random-image":
|
elif args.dataset_name.startswith("random"):
|
||||||
input_requests = sample_random_requests(
|
input_requests = sample_random_requests(
|
||||||
input_len=args.random_input_len,
|
input_len=args.random_input_len,
|
||||||
output_len=args.random_output_len,
|
output_len=args.random_output_len,
|
||||||
@@ -683,17 +708,18 @@ def get_dataset(args, tokenizer):
|
|||||||
random_sample=args.dataset_name == "random",
|
random_sample=args.dataset_name == "random",
|
||||||
return_text=not tokenize_prompt,
|
return_text=not tokenize_prompt,
|
||||||
)
|
)
|
||||||
elif args.dataset_name == "random-image":
|
elif args.dataset_name == "image":
|
||||||
assert not tokenize_prompt, "random-image does not support --tokenize-prompt"
|
processor = get_processor(model_id)
|
||||||
input_requests = sample_random_image_requests(
|
input_requests = sample_image_requests(
|
||||||
num_requests=args.num_prompts,
|
num_requests=args.num_prompts,
|
||||||
num_images=args.random_image_num_images,
|
image_count=args.image_count,
|
||||||
input_len=args.random_input_len,
|
input_len=args.random_input_len,
|
||||||
output_len=args.random_output_len,
|
output_len=args.random_output_len,
|
||||||
range_ratio=args.random_range_ratio,
|
range_ratio=args.random_range_ratio,
|
||||||
tokenizer=tokenizer,
|
processor=processor,
|
||||||
apply_chat_template=args.apply_chat_template,
|
image_content=args.image_content,
|
||||||
image_resolution=args.random_image_resolution,
|
image_format=args.image_format,
|
||||||
|
image_resolution=args.image_resolution,
|
||||||
)
|
)
|
||||||
elif args.dataset_name == "generated-shared-prefix":
|
elif args.dataset_name == "generated-shared-prefix":
|
||||||
assert not tokenize_prompt
|
assert not tokenize_prompt
|
||||||
@@ -707,12 +733,11 @@ def get_dataset(args, tokenizer):
|
|||||||
args=args,
|
args=args,
|
||||||
)
|
)
|
||||||
elif args.dataset_name == "mmmu":
|
elif args.dataset_name == "mmmu":
|
||||||
assert not tokenize_prompt
|
processor = get_processor(model_id)
|
||||||
input_requests = sample_mmmu_requests(
|
input_requests = sample_mmmu_requests(
|
||||||
num_requests=args.num_prompts,
|
num_requests=args.num_prompts,
|
||||||
tokenizer=tokenizer,
|
processor=processor,
|
||||||
fixed_output_len=args.random_output_len,
|
fixed_output_len=args.random_output_len,
|
||||||
apply_chat_template=args.apply_chat_template,
|
|
||||||
random_sample=True,
|
random_sample=True,
|
||||||
)
|
)
|
||||||
elif args.dataset_name == "mooncake":
|
elif args.dataset_name == "mooncake":
|
||||||
@@ -757,6 +782,8 @@ ASYNC_REQUEST_FUNCS = {
|
|||||||
class BenchmarkMetrics:
|
class BenchmarkMetrics:
|
||||||
completed: int
|
completed: int
|
||||||
total_input: int
|
total_input: int
|
||||||
|
total_input_text: int
|
||||||
|
total_input_vision: int
|
||||||
total_output: int
|
total_output: int
|
||||||
total_output_retokenized: int
|
total_output_retokenized: int
|
||||||
request_throughput: float
|
request_throughput: float
|
||||||
@@ -850,9 +877,17 @@ class DatasetRow:
|
|||||||
prompt: str
|
prompt: str
|
||||||
prompt_len: int
|
prompt_len: int
|
||||||
output_len: int
|
output_len: int
|
||||||
|
text_prompt_len: Optional[int] = None
|
||||||
|
vision_prompt_len: Optional[int] = None
|
||||||
image_data: Optional[List[str]] = None
|
image_data: Optional[List[str]] = None
|
||||||
timestamp: Optional[float] = None
|
timestamp: Optional[float] = None
|
||||||
|
|
||||||
|
def __post_init__(self):
|
||||||
|
if self.text_prompt_len is None:
|
||||||
|
self.text_prompt_len = self.prompt_len
|
||||||
|
if self.vision_prompt_len is None:
|
||||||
|
self.vision_prompt_len = 0
|
||||||
|
|
||||||
|
|
||||||
async def get_mooncake_request_over_time(
|
async def get_mooncake_request_over_time(
|
||||||
input_requests: List[Dict],
|
input_requests: List[Dict],
|
||||||
@@ -929,9 +964,8 @@ async def get_mooncake_request_over_time(
|
|||||||
|
|
||||||
def sample_mmmu_requests(
|
def sample_mmmu_requests(
|
||||||
num_requests: int,
|
num_requests: int,
|
||||||
tokenizer: PreTrainedTokenizerBase,
|
processor: AutoProcessor,
|
||||||
fixed_output_len: Optional[int] = None,
|
fixed_output_len: Optional[int] = None,
|
||||||
apply_chat_template: bool = True,
|
|
||||||
random_sample: bool = True,
|
random_sample: bool = True,
|
||||||
) -> List[DatasetRow]:
|
) -> List[DatasetRow]:
|
||||||
"""
|
"""
|
||||||
@@ -1010,54 +1044,12 @@ def sample_mmmu_requests(
|
|||||||
question = example.get("question")
|
question = example.get("question")
|
||||||
|
|
||||||
# Construct the prompt
|
# Construct the prompt
|
||||||
prompt = f"Question: {question}\n\nAnswer: "
|
text_prompt = f"Question: {question}\n\nAnswer: "
|
||||||
if apply_chat_template:
|
|
||||||
try:
|
|
||||||
is_phi4_multimodal = (
|
|
||||||
"phi-4-multimodal" in tokenizer.name_or_path.lower()
|
|
||||||
)
|
|
||||||
if is_phi4_multimodal:
|
|
||||||
# <|endoftext10|> is the image token used in the phi-4-multimodal model.
|
|
||||||
content = prompt.replace("image 1", "<|endoftext10|>")
|
|
||||||
else:
|
|
||||||
content = [
|
|
||||||
{
|
|
||||||
"type": "image_url",
|
|
||||||
"image_url": {"url": image_data},
|
|
||||||
},
|
|
||||||
{"type": "text", "text": prompt},
|
|
||||||
]
|
|
||||||
prompt = tokenizer.apply_chat_template(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
"role": "user",
|
|
||||||
"content": content,
|
|
||||||
}
|
|
||||||
],
|
|
||||||
add_generation_prompt=True,
|
|
||||||
tokenize=False,
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
# Note (Xinyuan): This is a workaround for an issue where some tokenizers do not support content as a list. (e.g. InternVL)
|
|
||||||
print(
|
|
||||||
f"Error applying chat template: {e}, fallback to <image> tag"
|
|
||||||
)
|
|
||||||
prompt = f"<image>{prompt}"
|
|
||||||
|
|
||||||
# Calculate token lengths for text only (without image data)
|
|
||||||
prompt_token_ids = tokenizer.encode(prompt)
|
|
||||||
prompt_len = len(prompt_token_ids)
|
|
||||||
|
|
||||||
output_len = fixed_output_len if fixed_output_len is not None else 256
|
output_len = fixed_output_len if fixed_output_len is not None else 256
|
||||||
|
data_row = create_mm_data_row(
|
||||||
filtered_dataset.append(
|
text_prompt, [image], [image_data], output_len, processor
|
||||||
DatasetRow(
|
|
||||||
prompt=prompt,
|
|
||||||
prompt_len=prompt_len,
|
|
||||||
output_len=output_len,
|
|
||||||
image_data=[image_data],
|
|
||||||
)
|
|
||||||
)
|
)
|
||||||
|
filtered_dataset.append(data_row)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error processing example {i}: {e}")
|
print(f"Error processing example {i}: {e}")
|
||||||
@@ -1145,7 +1137,11 @@ def sample_sharegpt_requests(
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
filtered_dataset.append(
|
filtered_dataset.append(
|
||||||
DatasetRow(prompt=prompt, prompt_len=prompt_len, output_len=output_len)
|
DatasetRow(
|
||||||
|
prompt=prompt,
|
||||||
|
prompt_len=prompt_len,
|
||||||
|
output_len=output_len,
|
||||||
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
print(f"#Input tokens: {np.sum([x.prompt_len for x in filtered_dataset])}")
|
print(f"#Input tokens: {np.sum([x.prompt_len for x in filtered_dataset])}")
|
||||||
@@ -1256,7 +1252,7 @@ def sample_random_requests(
|
|||||||
return input_requests
|
return input_requests
|
||||||
|
|
||||||
|
|
||||||
def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]:
|
def parse_image_resolution(image_resolution: str) -> Tuple[int, int]:
|
||||||
"""Parse image resolution into (width, height).
|
"""Parse image resolution into (width, height).
|
||||||
|
|
||||||
Supports presets '1080p', '720p', '360p' and custom 'heightxwidth' format
|
Supports presets '1080p', '720p', '360p' and custom 'heightxwidth' format
|
||||||
@@ -1281,24 +1277,79 @@ def parse_random_image_resolution(image_resolution: str) -> Tuple[int, int]:
|
|||||||
return (width, height)
|
return (width, height)
|
||||||
|
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Unsupported random-image resolution: {image_resolution}. "
|
f"Unsupported image resolution: {image_resolution}. "
|
||||||
"Choose from 4k, 1080p, 720p, 360p, or provide custom 'heightxwidth' (e.g., 1080x1920)."
|
"Choose from 4k, 1080p, 720p, 360p, or provide custom 'heightxwidth' (e.g., 1080x1920)."
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def sample_random_image_requests(
|
def create_mm_data_row(text_prompt, images, images_base64, output_len, processor):
|
||||||
|
try:
|
||||||
|
content_items = [
|
||||||
|
{"type": "image_url", "image_url": {"url": img_url}}
|
||||||
|
for img_url in images_base64
|
||||||
|
]
|
||||||
|
content_items.append({"type": "text", "text": text_prompt})
|
||||||
|
prompt_str = processor.apply_chat_template(
|
||||||
|
[{"role": "user", "content": content_items}],
|
||||||
|
add_generation_prompt=True,
|
||||||
|
tokenize=False,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
# Some tokenizers do not support list content; fall back to a placeholder in the text
|
||||||
|
prompt_str = f"<image>{text_prompt}"
|
||||||
|
|
||||||
|
# Calculate total tokens (text + vision)
|
||||||
|
prompt_len = processor(
|
||||||
|
text=[prompt_str],
|
||||||
|
images=images,
|
||||||
|
padding=False,
|
||||||
|
return_tensors="pt",
|
||||||
|
)["input_ids"].numel()
|
||||||
|
|
||||||
|
# Calculate text-only tokens
|
||||||
|
try:
|
||||||
|
# Create text-only version of the prompt
|
||||||
|
text_only_prompt = processor.apply_chat_template(
|
||||||
|
[{"role": "user", "content": text_prompt}],
|
||||||
|
add_generation_prompt=True,
|
||||||
|
tokenize=False,
|
||||||
|
)
|
||||||
|
text_prompt_len = processor(
|
||||||
|
text=[text_only_prompt],
|
||||||
|
padding=False,
|
||||||
|
return_tensors="pt",
|
||||||
|
)["input_ids"].numel()
|
||||||
|
except Exception:
|
||||||
|
# Fallback: just tokenize the text prompt directly
|
||||||
|
text_prompt_len = len(processor.tokenizer.encode(text_prompt))
|
||||||
|
|
||||||
|
# Vision tokens = total tokens - text tokens
|
||||||
|
vision_prompt_len = prompt_len - text_prompt_len
|
||||||
|
|
||||||
|
return DatasetRow(
|
||||||
|
prompt=text_prompt,
|
||||||
|
prompt_len=prompt_len,
|
||||||
|
output_len=output_len,
|
||||||
|
text_prompt_len=text_prompt_len,
|
||||||
|
vision_prompt_len=vision_prompt_len,
|
||||||
|
image_data=images_base64,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def sample_image_requests(
|
||||||
num_requests: int,
|
num_requests: int,
|
||||||
num_images: int,
|
image_count: int,
|
||||||
input_len: int,
|
input_len: int,
|
||||||
output_len: int,
|
output_len: int,
|
||||||
range_ratio: float,
|
range_ratio: float,
|
||||||
tokenizer: PreTrainedTokenizerBase,
|
processor: AutoProcessor,
|
||||||
apply_chat_template: bool = True,
|
image_content: str,
|
||||||
image_resolution: str = "1080p",
|
image_format: str,
|
||||||
|
image_resolution: str,
|
||||||
) -> List[DatasetRow]:
|
) -> List[DatasetRow]:
|
||||||
"""Generate requests with random images.
|
"""Generate requests with images.
|
||||||
|
|
||||||
- Each request includes ``num_images`` random images.
|
- Each request includes ``image_count`` images.
|
||||||
- Supported resolutions: 4k (3840x2160), 1080p (1920x1080), 720p (1280x720), 360p (640x360),
|
- Supported resolutions: 4k (3840x2160), 1080p (1920x1080), 720p (1280x720), 360p (640x360),
|
||||||
or custom 'heightxwidth' (e.g., 1080x1920).
|
or custom 'heightxwidth' (e.g., 1080x1920).
|
||||||
- Text lengths follow the 'random' dataset sampling rule. ``prompt_len``
|
- Text lengths follow the 'random' dataset sampling rule. ``prompt_len``
|
||||||
@@ -1313,12 +1364,12 @@ def sample_random_image_requests(
|
|||||||
) from e
|
) from e
|
||||||
|
|
||||||
# Parse resolution (supports presets and 'heightxwidth')
|
# Parse resolution (supports presets and 'heightxwidth')
|
||||||
width, height = parse_random_image_resolution(image_resolution)
|
width, height = parse_image_resolution(image_resolution)
|
||||||
|
|
||||||
# Check for potentially problematic combinations and warn user
|
# Check for potentially problematic combinations and warn user
|
||||||
if width * height >= 1920 * 1080 and num_images * num_requests >= 100:
|
if width * height >= 1920 * 1080 and image_count * num_requests >= 100:
|
||||||
warnings.warn(
|
warnings.warn(
|
||||||
f"High resolution ({width}x{height}) with {num_images * num_requests} total images "
|
f"High resolution ({width}x{height}) with {image_count * num_requests} total images "
|
||||||
f"may take a long time. Consider reducing resolution or image count.",
|
f"may take a long time. Consider reducing resolution or image count.",
|
||||||
UserWarning,
|
UserWarning,
|
||||||
stacklevel=2,
|
stacklevel=2,
|
||||||
@@ -1332,53 +1383,50 @@ def sample_random_image_requests(
|
|||||||
int(output_len * range_ratio), output_len + 1, size=num_requests
|
int(output_len * range_ratio), output_len + 1, size=num_requests
|
||||||
)
|
)
|
||||||
|
|
||||||
def _gen_random_image_data_uri(width: int = width, height: int = height) -> str:
|
def _gen_random_image_data_uri(
|
||||||
arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8)
|
width: int = width, height: int = height
|
||||||
img = Image.fromarray(arr, mode="RGB")
|
) -> (Image, str, int):
|
||||||
|
if image_content == "blank":
|
||||||
|
# Generate blank white image
|
||||||
|
arr = np.full((height, width, 3), 255, dtype=np.uint8)
|
||||||
|
else:
|
||||||
|
# Generate random colored image
|
||||||
|
arr = (np.random.rand(height, width, 3) * 255).astype(np.uint8)
|
||||||
|
img = Image.fromarray(arr)
|
||||||
buf = io.BytesIO()
|
buf = io.BytesIO()
|
||||||
img.save(buf, format="JPEG", quality=85)
|
img.save(buf, format=image_format, quality=85)
|
||||||
encoded = pybase64.b64encode(buf.getvalue()).decode("utf-8")
|
encoded = pybase64.b64encode(buf.getvalue()).decode("utf-8")
|
||||||
return f"data:image/jpeg;base64,{encoded}"
|
image_data = f"data:image/{image_format};base64,{encoded}"
|
||||||
|
image_bytes = len(image_data.encode("utf-8"))
|
||||||
|
return img, image_data, image_bytes
|
||||||
|
|
||||||
dataset: List[DatasetRow] = []
|
dataset: List[DatasetRow] = []
|
||||||
|
total_image_bytes = 0
|
||||||
for i in range(num_requests):
|
for i in range(num_requests):
|
||||||
# Generate text prompt
|
# Generate text prompt
|
||||||
text_prompt = gen_prompt(tokenizer, int(input_lens[i]))
|
text_prompt = gen_prompt(processor.tokenizer, int(input_lens[i]))
|
||||||
|
|
||||||
# Generate image list
|
# Generate image list
|
||||||
images = [_gen_random_image_data_uri() for _ in range(num_images)]
|
images, images_base64, images_bytes = zip(
|
||||||
|
*[_gen_random_image_data_uri() for _ in range(image_count)]
|
||||||
prompt_str = text_prompt
|
|
||||||
if apply_chat_template:
|
|
||||||
try:
|
|
||||||
content_items = [
|
|
||||||
{"type": "image_url", "image_url": {"url": img_url}}
|
|
||||||
for img_url in images
|
|
||||||
]
|
|
||||||
content_items.append({"type": "text", "text": text_prompt})
|
|
||||||
prompt_str = tokenizer.apply_chat_template(
|
|
||||||
[{"role": "user", "content": content_items}],
|
|
||||||
add_generation_prompt=True,
|
|
||||||
tokenize=False,
|
|
||||||
)
|
|
||||||
except Exception:
|
|
||||||
# Some tokenizers do not support list content; fall back to a placeholder in the text
|
|
||||||
prompt_str = f"<image>{text_prompt}"
|
|
||||||
|
|
||||||
prompt_token_ids = tokenizer.encode(prompt_str)
|
|
||||||
prompt_token_len = len(prompt_token_ids)
|
|
||||||
|
|
||||||
dataset.append(
|
|
||||||
DatasetRow(
|
|
||||||
prompt=prompt_str,
|
|
||||||
prompt_len=prompt_token_len,
|
|
||||||
output_len=int(output_lens[i]),
|
|
||||||
image_data=images,
|
|
||||||
)
|
|
||||||
)
|
)
|
||||||
|
total_image_bytes += sum(list(images_bytes))
|
||||||
|
|
||||||
|
data_row = create_mm_data_row(
|
||||||
|
text_prompt,
|
||||||
|
list(images),
|
||||||
|
list(images_base64),
|
||||||
|
int(output_lens[i]),
|
||||||
|
processor,
|
||||||
|
)
|
||||||
|
|
||||||
|
dataset.append(data_row)
|
||||||
|
|
||||||
print(f"#Input tokens: {np.sum([x.prompt_len for x in dataset])}")
|
print(f"#Input tokens: {np.sum([x.prompt_len for x in dataset])}")
|
||||||
print(f"#Output tokens: {np.sum([x.output_len for x in dataset])}")
|
print(f"#Output tokens: {np.sum([x.output_len for x in dataset])}")
|
||||||
|
print(
|
||||||
|
f"\nCreated {len(dataset)} {image_content} {image_format} images with average {total_image_bytes//num_requests} bytes per request"
|
||||||
|
)
|
||||||
return dataset
|
return dataset
|
||||||
|
|
||||||
|
|
||||||
@@ -1450,7 +1498,9 @@ def sample_generated_shared_prefix_requests(
|
|||||||
|
|
||||||
input_requests.append(
|
input_requests.append(
|
||||||
DatasetRow(
|
DatasetRow(
|
||||||
prompt=full_prompt, prompt_len=prompt_len, output_len=output_len
|
prompt=full_prompt,
|
||||||
|
prompt_len=prompt_len,
|
||||||
|
output_len=output_len,
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
total_input_tokens += prompt_len
|
total_input_tokens += prompt_len
|
||||||
@@ -1532,6 +1582,8 @@ def calculate_metrics(
|
|||||||
output_lens: List[int] = []
|
output_lens: List[int] = []
|
||||||
retokenized_output_lens: List[int] = []
|
retokenized_output_lens: List[int] = []
|
||||||
total_input = 0
|
total_input = 0
|
||||||
|
total_input_text = 0
|
||||||
|
total_input_vision = 0
|
||||||
completed = 0
|
completed = 0
|
||||||
itls: List[float] = []
|
itls: List[float] = []
|
||||||
tpots: List[float] = []
|
tpots: List[float] = []
|
||||||
@@ -1545,7 +1597,9 @@ def calculate_metrics(
|
|||||||
tokenizer.encode(outputs[i].generated_text, add_special_tokens=False)
|
tokenizer.encode(outputs[i].generated_text, add_special_tokens=False)
|
||||||
)
|
)
|
||||||
retokenized_output_lens.append(retokenized_output_len)
|
retokenized_output_lens.append(retokenized_output_len)
|
||||||
total_input += outputs[i].prompt_len
|
total_input += input_requests[i].prompt_len
|
||||||
|
total_input_text += input_requests[i].text_prompt_len
|
||||||
|
total_input_vision += input_requests[i].vision_prompt_len
|
||||||
if output_len > 1:
|
if output_len > 1:
|
||||||
tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
|
tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
|
||||||
itls += outputs[i].itl
|
itls += outputs[i].itl
|
||||||
@@ -1567,6 +1621,8 @@ def calculate_metrics(
|
|||||||
metrics = BenchmarkMetrics(
|
metrics = BenchmarkMetrics(
|
||||||
completed=completed,
|
completed=completed,
|
||||||
total_input=total_input,
|
total_input=total_input,
|
||||||
|
total_input_text=total_input_text,
|
||||||
|
total_input_vision=total_input_vision,
|
||||||
total_output=sum(output_lens),
|
total_output=sum(output_lens),
|
||||||
total_output_retokenized=sum(retokenized_output_lens),
|
total_output_retokenized=sum(retokenized_output_lens),
|
||||||
request_throughput=completed / dur_s,
|
request_throughput=completed / dur_s,
|
||||||
@@ -1815,6 +1871,10 @@ async def benchmark(
|
|||||||
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
|
||||||
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
|
||||||
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
|
||||||
|
print("{:<40} {:<10}".format("Total input text tokens:", metrics.total_input_text))
|
||||||
|
print(
|
||||||
|
"{:<40} {:<10}".format("Total input vision tokens:", metrics.total_input_vision)
|
||||||
|
)
|
||||||
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
|
||||||
print(
|
print(
|
||||||
"{:<40} {:<10}".format(
|
"{:<40} {:<10}".format(
|
||||||
@@ -1884,6 +1944,8 @@ async def benchmark(
|
|||||||
"duration": benchmark_duration,
|
"duration": benchmark_duration,
|
||||||
"completed": metrics.completed,
|
"completed": metrics.completed,
|
||||||
"total_input_tokens": metrics.total_input,
|
"total_input_tokens": metrics.total_input,
|
||||||
|
"total_input_text_tokens": metrics.total_input_text,
|
||||||
|
"total_input_vision_tokens": metrics.total_input_vision,
|
||||||
"total_output_tokens": metrics.total_output,
|
"total_output_tokens": metrics.total_output,
|
||||||
"total_output_tokens_retokenized": metrics.total_output_retokenized,
|
"total_output_tokens_retokenized": metrics.total_output_retokenized,
|
||||||
"request_throughput": metrics.request_throughput,
|
"request_throughput": metrics.request_throughput,
|
||||||
@@ -1918,11 +1980,11 @@ async def benchmark(
|
|||||||
output_file_name = args.output_file
|
output_file_name = args.output_file
|
||||||
else:
|
else:
|
||||||
now = datetime.now().strftime("%m%d")
|
now = datetime.now().strftime("%m%d")
|
||||||
if args.dataset_name == "random-image":
|
if args.dataset_name == "image":
|
||||||
output_file_name = (
|
output_file_name = (
|
||||||
f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_"
|
f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_"
|
||||||
f"{args.random_output_len}_{args.random_image_num_images}imgs_"
|
f"{args.random_output_len}_{args.image_count}imgs_"
|
||||||
f"{args.random_image_resolution}.jsonl"
|
f"{args.image_resolution}.jsonl"
|
||||||
)
|
)
|
||||||
elif args.dataset_name.startswith("random"):
|
elif args.dataset_name.startswith("random"):
|
||||||
output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl"
|
output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl"
|
||||||
@@ -2098,6 +2160,12 @@ def run_benchmark(args_: argparse.Namespace):
|
|||||||
"Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n"
|
"Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if args.dataset_name in ["image", "mmmu"]:
|
||||||
|
args.apply_chat_template = True
|
||||||
|
assert (
|
||||||
|
not args.tokenize_prompt
|
||||||
|
), "`--tokenize-prompt` not compatible with image dataset"
|
||||||
|
|
||||||
print(f"{args}\n")
|
print(f"{args}\n")
|
||||||
|
|
||||||
# Read dataset
|
# Read dataset
|
||||||
@@ -2105,7 +2173,7 @@ def run_benchmark(args_: argparse.Namespace):
|
|||||||
model_id = args.model
|
model_id = args.model
|
||||||
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
|
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
|
||||||
tokenizer = get_tokenizer(tokenizer_id)
|
tokenizer = get_tokenizer(tokenizer_id)
|
||||||
input_requests = get_dataset(args, tokenizer)
|
input_requests = get_dataset(args, tokenizer, model_id)
|
||||||
|
|
||||||
# compatible with SimpleNamespace
|
# compatible with SimpleNamespace
|
||||||
if not hasattr(args, "flush_cache"):
|
if not hasattr(args, "flush_cache"):
|
||||||
@@ -2186,7 +2254,7 @@ if __name__ == "__main__":
|
|||||||
"random-ids",
|
"random-ids",
|
||||||
"generated-shared-prefix",
|
"generated-shared-prefix",
|
||||||
"mmmu",
|
"mmmu",
|
||||||
"random-image",
|
"image",
|
||||||
"mooncake",
|
"mooncake",
|
||||||
],
|
],
|
||||||
help="Name of the dataset to benchmark on.",
|
help="Name of the dataset to benchmark on.",
|
||||||
@@ -2226,37 +2294,49 @@ if __name__ == "__main__":
|
|||||||
"--random-input-len",
|
"--random-input-len",
|
||||||
type=int,
|
type=int,
|
||||||
default=1024,
|
default=1024,
|
||||||
help="Number of input tokens per request, used only for random dataset.",
|
help="Number of input tokens per request, used only for random and image dataset.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--random-output-len",
|
"--random-output-len",
|
||||||
default=1024,
|
default=1024,
|
||||||
type=int,
|
type=int,
|
||||||
help="Number of output tokens per request, used only for random dataset.",
|
help="Number of output tokens per request, used only for random and image dataset.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--random-range-ratio",
|
"--random-range-ratio",
|
||||||
type=float,
|
type=float,
|
||||||
default=0.0,
|
default=0.0,
|
||||||
help="Range of sampled ratio of input/output length, "
|
help="Range of sampled ratio of input/output length, "
|
||||||
"used only for random dataset.",
|
"used only for random and image dataset.",
|
||||||
)
|
)
|
||||||
# random-image dataset args
|
# image dataset args
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--random-image-num-images",
|
"--image-count",
|
||||||
type=int,
|
type=int,
|
||||||
default=1,
|
default=1,
|
||||||
help="Number of images per request (only available with the random-image dataset)",
|
help="Number of images per request (only available with the image dataset)",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--random-image-resolution",
|
"--image-resolution",
|
||||||
type=str,
|
type=str,
|
||||||
default="1080p",
|
default="1080p",
|
||||||
help=(
|
help=(
|
||||||
"Resolution of random images for random-image dataset. "
|
"Resolution of images for image dataset. "
|
||||||
"Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)."
|
"Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--image-format",
|
||||||
|
type=str,
|
||||||
|
default="jpeg",
|
||||||
|
help=("Format of images for image dataset. " "Supports jpeg and png."),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--image-content",
|
||||||
|
type=str,
|
||||||
|
default="random",
|
||||||
|
help=("Content for images for image dataset. " "Supports random and blank."),
|
||||||
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--request-rate",
|
"--request-rate",
|
||||||
type=float,
|
type=float,
|
||||||
|
|||||||
Reference in New Issue
Block a user