初始化项目,由ModelHub XC社区提供模型
Model: zjunlp/DataMind-Analysis-Qwen2.5-7B Source: Original Platform
This commit is contained in:
193
README.md
Normal file
193
README.md
Normal file
@@ -0,0 +1,193 @@
|
||||
---
|
||||
base_model:
|
||||
- Qwen/Qwen2.5-7B-Instruct
|
||||
license: apache-2.0
|
||||
pipeline_tag: text-generation
|
||||
library_name: transformers
|
||||
tags:
|
||||
- data-analysis
|
||||
- code-generation
|
||||
- qwen
|
||||
---
|
||||
|
||||
This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
|
||||
|
||||
**Paper Abstract:**
|
||||
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
|
||||
|
||||
For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).
|
||||
|
||||
<h1 align="center"> ✨ DataMind </h1>
|
||||
|
||||
## 🔧 Installation
|
||||
|
||||
#### 🔩Manual Environment Configuration
|
||||
|
||||
Conda virtual environments offer a light and flexible setup.
|
||||
|
||||
**Prerequisites**
|
||||
|
||||
- Anaconda Installation
|
||||
- GPU support (recommended CUDA version: 12.4)
|
||||
|
||||
**Configure Steps**
|
||||
|
||||
1. Clone the repository:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DataMind.git
|
||||
```
|
||||
|
||||
2. Enter the working directory, and all subsequent commands should be executed in this directory.
|
||||
|
||||
```bash
|
||||
cd DataMind/eval
|
||||
```
|
||||
|
||||
3. Create a virtual environment using `Anaconda`.
|
||||
|
||||
```bash
|
||||
conda create -n DataMind python=3.10
|
||||
conda activate DataMind
|
||||
```
|
||||
|
||||
4. Install all required Python packages.
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage (Text Generation for Data Analysis)
|
||||
|
||||
You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks.
|
||||
|
||||
First, ensure you have the `transformers` library installed:
|
||||
|
||||
```bash
|
||||
pip install transformers torch
|
||||
```
|
||||
|
||||
Then, you can load and use the model as follows:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
import torch
|
||||
|
||||
model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available
|
||||
|
||||
# Load the model and tokenizer
|
||||
# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
|
||||
# Use device_map="auto" to automatically distribute the model across available devices
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="auto",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
||||
|
||||
# Example: Generate Python code for data analysis
|
||||
messages = [
|
||||
{"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
|
||||
]
|
||||
|
||||
# Apply chat template for Qwen models
|
||||
text = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
|
||||
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||||
|
||||
# Generate response
|
||||
generated_ids = model.generate(
|
||||
model_inputs.input_ids,
|
||||
max_new_tokens=512,
|
||||
do_sample=True,
|
||||
temperature=0.7,
|
||||
top_p=0.8,
|
||||
repetition_penalty=1.05,
|
||||
eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
|
||||
)
|
||||
|
||||
# Decode and print the generated text
|
||||
response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
|
||||
print(response)
|
||||
```
|
||||
|
||||
## 🧐 Evaluation
|
||||
|
||||
> Note:
|
||||
>
|
||||
> - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment.
|
||||
> - If you have more questions, feel free to open an issue with us.
|
||||
> - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**.
|
||||
|
||||
**Step 1: Prepare the parameter configuration**
|
||||
|
||||
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
|
||||
|
||||
You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
|
||||
|
||||
Here is the example:
|
||||
**`config.yaml`**
|
||||
|
||||
```yaml
|
||||
api_key: your_api_key # your API key for the model with API service. No need for open-source models.
|
||||
data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path)
|
||||
```
|
||||
|
||||
**`run_eval.sh`**
|
||||
|
||||
```bash
|
||||
python do_generate.py \
|
||||
--model_name DataMind-Qwen2.5-7B \ # Model name to use.
|
||||
--check_model gpt-4o-mini \ # Check model to use.
|
||||
--output results \ # Output directory path.
|
||||
--dataset_name QRData \ # Dataset name to use, chosen from QRData, DiscoveryBench.
|
||||
--max_round 25 \ # Maximum number of steps.
|
||||
--api_port 8000 \ # API port number, it is necessary if the local model is used.
|
||||
--bidx 0 \ # Begin index (inclusive), `None` indicates that there is no restriction.
|
||||
--eidx None \ # End index (exclusive), `None` indicates that there is no restriction.
|
||||
--temperature 0.0 \ # Temperature for sampling.
|
||||
--top_p 1 \ # Top p for sampling.
|
||||
--add_random False \ # Whether to add random files.
|
||||
```
|
||||
|
||||
**(Optional)`local_model.sh`**
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \
|
||||
--model $MODEL_PATH \ # Local model path.
|
||||
--served-model-name $MODEL_NAME \ # The model name specified by you.
|
||||
--tensor-parallel-size $i \ # Set the size of tensor parallel processing.
|
||||
--port $port # API port number, which is consistent with the `api_port` above.
|
||||
```
|
||||
|
||||
**Step 2: Run the shell script**
|
||||
|
||||
**(Optional)** Deploy the local model if you need.
|
||||
|
||||
```bash
|
||||
bash local_model.sh
|
||||
```
|
||||
|
||||
Run the shell script to start the process.
|
||||
|
||||
```bash
|
||||
bash run_eval.sh
|
||||
```
|
||||
|
||||
## ✍️ Citation
|
||||
|
||||
If you find our work helpful, please use the following citations.
|
||||
|
||||
```
|
||||
@article{zhu2025open,
|
||||
title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
|
||||
author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},
|
||||
journal={arXiv preprint arXiv:2506.19794},
|
||||
year={2025}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user