初始化项目，由ModelHub XC社区提供模型

Model: nv-community/Nemotron-Cascade-8B Source: Original Platform
2026-04-24 22:32:56 +08:00
commit c979c18a17
174 changed files with 62291 additions and 0 deletions
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -0,0 +1,277 @@
+# LLM Evaluation Framework
+
+This directory contains tools for evaluating large language models on various benchmarks.
+
+## Overview
+
+The evaluation framework supports multiple benchmark datasets across different domains:
+
+- **Math**: AIME24, AIME25 (evaluation scripts provided)
+- **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
+- **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
+- **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits)
+- **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit)
+
+## Installation
+
+Install required dependencies:
+
+```bash
+pip install transformers vllm torch tqdm pandas
+```
+
+## Directory Structure
+
+```
+evaluation/
+├── inference.py                    # Main inference script
+├── arguments.py                    # Command-line argument definitions
+│
+├── data/                          # Benchmark datasets and preprocessing
+│   ├── benchmark.py               # Dataset preprocessing functions
+│   ├── aime24/, aime25/           # AIME competition problems
+│   ├── gpqa/                      # GPQA dataset
+│   ├── livecodebench/             # LiveCodeBench v5 and v6
+│   ├── mmlu/, mmlu_pro/           # MMLU variants
+│   ├── arena-hard-v0.1/, arena-hard-v2.0/  # Arena-Hard benchmarks
+│   ├── ifeval/, IFBench/          # Instruction following benchmarks
+│   └── mt_bench/                  # MT-Bench data
+│
+├── eval/                          # Evaluation scripts
+│   ├── get_scores_math.py         # Math benchmarks (AIME24, AIME25)
+│   ├── get_scores_mmlu_batch.py   # MMLU, MMLU-Pro evaluation
+│   ├── get_scores_gpqa.py         # GPQA evaluation
+│   ├── get_scores_code.py         # Code benchmarks (LiveCodeBench)
+│   └── tools/                     # Evaluation utilities
+│       ├── grader.py              # Math answer grading
+│       ├── code_verifier_utils.py # Code execution and verification
+│       └── latex2sympy/           # LaTeX to SymPy conversion
+│
+├── run.sh                         # Example single benchmark run
+├── run_local.sh                   # Local evaluation script
+├── run_all.sh                     # Run multiple benchmarks in parallel
+└── README.md                      # This file
+```
+
+## Usage
+
+### Quick Start
+
+1. Edit `run.sh` to configure your model and data paths
+2. Run the evaluation:
+
+```bash
+bash run.sh
+```
+
+### Advanced Usage
+
+Run inference directly with custom parameters:
+
+```bash
+python inference.py \
+    --model-folder /path/to/models \
+    --model-name your-model \
+    --tokenizer-folder /path/to/tokenizers \
+    --tokenizer-name your-tokenizer \
+    --benchmark-folder /path/to/benchmarks \
+    --eval-dataset aime24 \
+    --temperature 0.6 \
+    --topp 0.95 \
+    --batch-size 2048
+```
+
+We suggest following the paper config and running benchmarks with k different random seeds.
+
+### Key Arguments
+
+#### Model Configuration (Required)
+- `--model-folder`: Directory containing model weights
+- `--model-name`: Name of the model subdirectory
+- `--tokenizer-folder`: Directory containing tokenizer files
+- `--tokenizer-name`: Name of the tokenizer subdirectory
+
+#### Dataset Selection (Required for evaluation)
+- `--benchmark-folder`: Root directory containing all benchmark datasets
+- `--eval-dataset`: Name of the evaluation dataset (see supported datasets above)
+
+#### Inference Parameters (Optional)
+- `--temperature`: Sampling temperature (default: 0 for greedy decoding)
+- `--topp`: Top-p (nucleus) sampling threshold (default: 1.0)
+- `--topk`: Top-k sampling threshold (default: 1)
+- `--max-output-len`: Maximum output length in tokens (default: 2048)
+- `--batch-size`: Batch size for inference (default: 16)
+- `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1)
+
+#### Dataset Subsetting (Optional)
+- `--start-idx`: Starting index for dataset subsetting (default: -1, disabled)
+- `--end-idx`: Ending index for dataset subsetting (default: -1, disabled)
+
+#### Other Options
+- `--seed`: Random seed for reproducibility (default: 42)
+- `--no-think`: Disable thinking mode (flag, thinking enabled by default)
+- `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1)
+- `--device-id`: Comma-separated GPU device IDs (optional)
+- `--model-output-path`: Path to first turn output (required for mtbench_secondturn only)
+
+## Supported Datasets
+
+- `aime24` / `aime25`: AIME competition problems
+- `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6)
+- `mmlu`: MMLU 5-shot evaluation
+- `mmlu_pro`: MMLU Pro dataset
+- `gpqa_diamond`: GPQA Diamond subset
+- `ifeval`: IFEval instruction following
+- `ifbench`: IFBench instruction following
+- `arena_hard`: Arena-Hard v0.1
+
+## Running Evaluation Scripts
+
+After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory.
+
+We also attach our cached generation files in the corresponding model repo for reproducibility. 
+
+### Math Benchmarks (AIME24, AIME25)
+
+Evaluate math problem-solving performance:
+
+```bash
+cd eval
+python get_scores_math.py \
+    --modelfolder /path/to/model/outputs \
+    --testfolder /path/to/test_benchmarks
+```
+
+This script:
+- Evaluates AIME24 and AIME25 benchmarks
+- Extracts answers from `\boxed{}` and other formats
+- Computes accuracy with mathematical equivalence checking
+- Reports mean accuracy and standard deviation across multiple runs
+
+### Multiple Choice (MMLU, MMLU-Pro, GPQA)
+
+Evaluate MMLU and variants:
+
+```bash
+cd eval
+python get_scores_mmlu_batch.py \
+    --modelfolder /path/to/model/outputs \
+    --testfolder /path/to/test_benchmarks \
+    --verbose  # Optional: print per-category accuracy
+```
+
+This script evaluates:
+- **MMLU**: Standard MMLU with 4 choices (A-D)
+- **MMLU-Pro**: Extended version with up to 16 choices (A-P)
+
+Features:
+- Supports boxed answer format (e.g., `\boxed{A}`)
+- Extracts letter choices from various formats (parentheses, text, etc.)
+- Handles batch-split output files automatically
+- Computes accuracy across all MMLU variants
+- Optional per-category breakdown with `--verbose` flag
+
+Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:
+
+```bash
+cd eval
+python get_scores_gpqa.py \
+    --modelfolder /path/to/model/outputs \
+    --testfolder /path/to/test_benchmarks
+```
+
+This script:
+- Evaluates GPQA Diamond subset
+- Extracts answers from boxed and text formats
+- Uses mathematical equivalence checking for complex answers
+- Reports accuracy with standard deviation
+
+### Code Generation (LiveCodeBench)
+
+Evaluate code generation performance:
+
+```bash
+cd eval
+python get_scores_code.py \
+    --modelfolder /path/to/model/outputs \
+    --testfolder /path/to/test_benchmarks
+```
+
+This script:
+- Evaluates LiveCodeBench v5 and v6
+- Executes generated code against test cases
+- Computes pass rate (percentage of problems solved correctly)
+- Reports finish rate (percentage of valid code generations)
+
+**Note**: Code execution requires:
+```bash
+pip install numpy tqdm
+```
+
+### Other Benchmarks
+
+For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:
+
+- **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto)
+- **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval)
+- **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench)
+
+These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.
+
+## Output Format
+
+Results are saved as JSONL files in:
+```
+{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl
+```
+
+Each line contains:
+- `task_id` or `question_id`: Unique identifier for the question
+- `output`: Model's generated response
+- `reason`: Whether reasoning was used (boolean)
+- `reason_text`: The reasoning/thinking content (if applicable)
+- Additional dataset-specific fields
+
+## Adding New Datasets
+
+To add a new dataset:
+
+1. Add a preprocessing function in `data/benchmark.py`:
+   ```python
+   def preprocess_your_dataset(data_file):
+       """Preprocess your dataset.
+       
+       Args:
+           data_file: Path to dataset file
+       
+       Returns:
+           tuple: (prompt_list, qid_list) or just prompt_list
+       """
+       # Your preprocessing logic
+       pass
+   ```
+
+2. Add the dataset path argument in `arguments.py`:
+   ```python
+   group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')
+   ```
+
+3. Add the dataset case in `inference.py` in the `get_prompt_list()` function:
+   ```python
+   elif args.eval_dataset == "your_dataset":
+       from data.benchmark import preprocess_your_dataset
+       input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
+       prompt_list, qid_list = preprocess_your_dataset(input_datapath)
+   ```
+
+## Notes
+
+- The framework uses vLLM for efficient inference with batching and tensor parallelism support
+- Special handling is provided for models like DeepSeek-R1 that require eager mode
+- Thinking mode (`<think>` tags) is supported for models trained with reasoning capabilities
+- YaRN RoPE scaling is supported for extended context lengths
+
+## License
+
+See the main repository LICENSE file for licensing information.
+