Organize public APIs (#809)
This commit is contained in:
@@ -12,5 +12,5 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang LLa
|
||||
- Add `EntryClass` at the end.
|
||||
- Test correctness by comparing the final logits and outputs of the two following commands:
|
||||
- `python3 playground/reference_hf.py --model [new model]`
|
||||
- `python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code`
|
||||
- `python3 -m sglang.benchmarks.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code`
|
||||
- Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](../README.md).
|
||||
|
||||
@@ -4,15 +4,15 @@
|
||||
Make sure your changes do not slow down the following benchmarks
|
||||
```
|
||||
# single gpu
|
||||
python -m sglang.bench_latency --model-path meta-llama/Llama-2-7b-chat-hf --mem-fraction-static 0.8 --batch 32 --input-len 512 --output-len 256
|
||||
python -m sglang.bench_latency --model-path meta-llama/Llama-2-7b-chat-hf --mem-fraction-static 0.8 --batch 1 --input-len 512 --output-len 256
|
||||
python -m sglang.benchmarks.bench_latency --model-path meta-llama/Llama-2-7b-chat-hf --mem-fraction-static 0.8 --batch 32 --input-len 512 --output-len 256
|
||||
python -m sglang.benchmarks.bench_latency --model-path meta-llama/Llama-2-7b-chat-hf --mem-fraction-static 0.8 --batch 1 --input-len 512 --output-len 256
|
||||
|
||||
# multiple gpu
|
||||
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-70B --tp 8 --mem-fraction-static 0.6 --batch 32 --input-len 8192 --output-len 1
|
||||
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-70B --tp 8 --mem-fraction-static 0.6 --batch 1 --input-len 8100 --output-len 32
|
||||
python -m sglang.benchmarks.bench_latency --model-path meta-llama/Meta-Llama-3-70B --tp 8 --mem-fraction-static 0.6 --batch 32 --input-len 8192 --output-len 1
|
||||
python -m sglang.benchmarks.bench_latency --model-path meta-llama/Meta-Llama-3-70B --tp 8 --mem-fraction-static 0.6 --batch 1 --input-len 8100 --output-len 32
|
||||
|
||||
# moe model
|
||||
python -m sglang.bench_latency --model-path databricks/dbrx-base --tp 8 --mem-fraction-static 0.6 --batch 4 --input-len 1024 --output-len 32
|
||||
python -m sglang.benchmarks.bench_latency --model-path databricks/dbrx-base --tp 8 --mem-fraction-static 0.6 --batch 4 --input-len 1024 --output-len 32
|
||||
```
|
||||
|
||||
### High-level API
|
||||
|
||||
Reference in New Issue
Block a user