diff --git a/README.md b/README.md index b1e372ae8..a699b9c2a 100644 --- a/README.md +++ b/README.md @@ -326,15 +326,17 @@ response = client.chat.completions.create( print(response) ``` -In above example, the server uses the chat template specified in the model tokenizer. -You can override the chat template if needed when launching the server: + +By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3. + +If needed, you can also override the chat template when launching the server: ``` python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 ``` If the chat template you are looking for is missing, you are welcome to contribute it. -Meanwhile, you can also temporary register your chat template as follows: +Meanwhile, you can also temporarily register your chat template as follows: ```json { diff --git a/benchmark/latency_throughput/test_latency.py b/benchmark/latency_throughput/test_latency.py index bdbf33925..732ddf543 100644 --- a/benchmark/latency_throughput/test_latency.py +++ b/benchmark/latency_throughput/test_latency.py @@ -30,7 +30,8 @@ if __name__ == "__main__": response = requests.post( url + "/generate", json={ - "input_ids": [[1,2,3], [1,2,3]], + "text": f"{a}, ", + #"input_ids": [[2] * 256] * 196, "sampling_params": { "temperature": 0, "max_new_tokens": max_new_tokens,