初始化项目,由ModelHub XC社区提供模型
Model: OpenDFM/ChemDFM-v1.5-8B Source: Original Platform
This commit is contained in:
112
README.md
Normal file
112
README.md
Normal file
@@ -0,0 +1,112 @@
|
||||
---
|
||||
license: agpl-3.0
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- chemistry
|
||||
language:
|
||||
- en
|
||||
- zh
|
||||
---
|
||||
# ChemDFM: A Large Language Foundation Model for Chemistry
|
||||
|
||||

|
||||
|
||||
ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science, which is built based on LLaMa-13B. ChemDFM outperforms the open-sourced LLMs in all the typical tasks of chemistry, and even reaches comparable or higher performances to GPT-4. For more details, please refer to [our paper](https://arxiv.org/abs/2401.14818).
|
||||
|
||||
## News
|
||||
|
||||
* **2025-04-16**: Our paper is accepted by *Cell Report Physical Science*. The published version can be accessed [HERE](https://www.sciencedirect.com/science/article/pii/S2666386425001225)
|
||||
* **2024-11-09**: [ChemDFM-v1.5-8B](https://huggingface.co/OpenDFM/ChemDFM-v1.5-8B) is released! We implemented our domain pre-training and instruction tuning precedure on a stronger base model LLaMA-3-8B.
|
||||
* **2024-06-13**: The results on the comprehensive science benchmark [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) show that "ChemDFM emerged as one of the top open-source models by continuing pre-training and fine-tuning on a vast corpus of scientific literature".
|
||||
* **2024-04-17**: The evaluation data (including instructions) we used in our paper is released on [GitHub](https://github.com/OpenDFM/ChemDFM)
|
||||
* **2024-03-12**: The parameter of [ChemDFM-v1.0-13B](ihttps://huggingface.co/OpenDFM/ChemDFM-v1.0-13B) is open-sourced!
|
||||
* **2024-01-26**: The paper of ChemDFM-13B is released on arXiv: [ChemDFM: Dialogue Foundation Model for Chemistry](https://arxiv.org/abs/2401.14818)
|
||||
|
||||
## Usage Details
|
||||
|
||||
The online demo of ChemDFM will be up soon!
|
||||
|
||||
### local inference
|
||||
|
||||
To load and run ChemDFM locally, here is an example:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
|
||||
|
||||
model_name_or_id = "OpenDFM/ChemDFM-v1.5-8B"
|
||||
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id)
|
||||
model = LlamaForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto")
|
||||
|
||||
input_text = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
|
||||
input_text = f"[Round 0]\nHuman: {input_text}\nAssistant:"
|
||||
|
||||
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
generation_config = GenerationConfig(
|
||||
do_sample=True,
|
||||
top_k=20,
|
||||
top_p=0.9,
|
||||
temperature=0.9,
|
||||
max_new_tokens=1024,
|
||||
repetition_penalty=1.05,
|
||||
eos_token_id=tokenizer.eos_token_id
|
||||
)
|
||||
|
||||
outputs = model.generate(**inputs, generation_config=generation_config)
|
||||
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][len(input_text):]
|
||||
print(generated_text.strip())
|
||||
```
|
||||
|
||||
### input format
|
||||
|
||||
To get better responses, we recommend to preprocess your input and history with the dialogue templates which are used during instruction tuning of ChemDFM. Specifically, for an input queries
|
||||
```python
|
||||
{'current_query': current_query, 'history': [(query1, answer1), (query2, answer2), ...]}
|
||||
```
|
||||
, you can use the following code to preprocess the input and history:
|
||||
```python
|
||||
def formatting_input(current_query, history):
|
||||
input_text = ''
|
||||
for idx, (query, answer) in history:
|
||||
input_text += f"[Round {idx}]\nHuman: {query}\nAssistant: {answer}\n"
|
||||
input_text += f"[Round {len(history)}]\nHuman: {current_query}\nAssistant:"
|
||||
return input_text
|
||||
```
|
||||
|
||||
### SMILES preprocess
|
||||
|
||||
When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
|
||||
```python
|
||||
from rdkit import Chem
|
||||
def canonicalize_smiles(smiles):
|
||||
mol = Chem.MolFromSmiles(smiles)
|
||||
if mol is None:
|
||||
return None
|
||||
return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
|
||||
```
|
||||
or directly:
|
||||
```python
|
||||
from rdkit import Chem
|
||||
def canonicalize_smiles(smiles):
|
||||
return Chem.CanonSmiles(smiles, useChiral=True)
|
||||
```
|
||||
|
||||
## Citation
|
||||
```bibtex
|
||||
@article{zhao2025developing,
|
||||
title={Developing ChemDFM as a large language foundation model for chemistry},
|
||||
author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},
|
||||
journal={Cell Reports Physical Science},
|
||||
volume={6},
|
||||
number={4},
|
||||
year={2025},
|
||||
publisher={Elsevier}
|
||||
}
|
||||
```
|
||||
|
||||
## Disclaimer
|
||||
Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.
|
||||
|
||||
## Contact
|
||||
|
||||
If you have any questions or further requests, please contact [Zihan Zhao](mailto:zhao_mengxin@sjtu.edu.cn) and [Lu Chen](mailto:chenlusz@sjtu.edu.cn).
|
||||
Reference in New Issue
Block a user