docs/accuracy evaluation (#3114)

Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-02 20:01:39 +01:00
parent d9eb9358cc
commit c27c378a19
2 changed files with 61 additions and 0 deletions
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -57,6 +57,7 @@ The core features include:
   references/sampling_params.md
   references/hyperparameter_tuning.md
   references/benchmark_and_profiling.md
+   references/accuracy_evaluation.md
   references/custom_chat_template.md
   references/deepseek.md
   references/llama_405B.md
--- a/docs/references/accuracy_evaluation.md
+++ b/docs/references/accuracy_evaluation.md
@@ -0,0 +1,60 @@
+# Measuring Model Accuracy in SGLang
+
+This guide shows how to evaluate model accuracy using SGLang's [built-in benchmarks](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark).
+
+## Benchmarking Model Accuracy
+
+This is a reference workflow for the [MMLU benchmark](https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu). For more details or other benchmarks, please refer to the README in each specific benchmark folder under [sglang/benchmark](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark).
+
+```bash
+# Step 1: Download the dataset
+bash download_data.sh
+
+# Step 2: Launch the server
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen2.5-Math-1.5B-Instruct \  # Model selection
+  --port 30000 \  # Network configuration
+  --mem-fraction-static 0.8  # Memory optimization
+
+# Step 3: Run the benchmark script
+python3 bench_sglang.py --nsub 10  # Test 10 subjects
+
+# Step 4: Extract the accuracy
+cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'
+```
+
+## Customizing Benchmark Scripts
+
+Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [[Qwen2.5-Math]](https://github.com/QwenLM/Qwen2.5-Math)'s reported 76.8% GSM8K accuracy, customization is required.
+
+```python
+# The GSM8K benchmark script includes few shot examples for evaluation by default.
+# Here we exclude them.
+for i in range(len(lines[num_shots:num_questions])):
+    questions.append(get_one_example(lines, i, False))
+    labels.append(get_answer_value(lines[i]["answer"]))
+```
+
+```python
+@sgl.function
+def few_shot_gsm8k(s, question):
+    # System prompt given in https://github.com/QwenLM/Qwen2.5-Math
+    s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
+    s += few_shot_examples + question
+    # Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
+    s += sgl.gen(
+        "answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
+    )
+```
+
+These adjustments give us the us the reported accuracy.
+
+## Extending Evaluation Capabilities
+
+1. **Contribute New Benchmarks**
+   * Follow our [contribution guidelines](https://docs.sglang.ai/references/contribution_guide.html) to add new test scripts
+2. **Request Implementations**
+   * Feel free to open an issue describing your evaluation needs
+3. **Use Alternative Tools**
+   * [OpenCompass](https://opencompass.org.cn)
+   * [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)