113 lines
4.7 KiB
Markdown
113 lines
4.7 KiB
Markdown
|
|
---
|
||
|
|
license: agpl-3.0
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
tags:
|
||
|
|
- chemistry
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
- zh
|
||
|
|
---
|
||
|
|
# ChemDFM: A Large Language Foundation Model for Chemistry
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science, which is built based on LLaMa-13B. ChemDFM outperforms the open-sourced LLMs in all the typical tasks of chemistry, and even reaches comparable or higher performances to GPT-4. For more details, please refer to [our paper](https://arxiv.org/abs/2401.14818).
|
||
|
|
|
||
|
|
## News
|
||
|
|
|
||
|
|
* **2025-04-16**: Our paper is accepted by *Cell Report Physical Science*. The published version can be accessed [HERE](https://www.sciencedirect.com/science/article/pii/S2666386425001225)
|
||
|
|
* **2024-11-09**: [ChemDFM-v1.5-8B](https://huggingface.co/OpenDFM/ChemDFM-v1.5-8B) is released! We implemented our domain pre-training and instruction tuning precedure on a stronger base model LLaMA-3-8B.
|
||
|
|
* **2024-06-13**: The results on the comprehensive science benchmark [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) show that "ChemDFM emerged as one of the top open-source models by continuing pre-training and fine-tuning on a vast corpus of scientific literature".
|
||
|
|
* **2024-04-17**: The evaluation data (including instructions) we used in our paper is released on [GitHub](https://github.com/OpenDFM/ChemDFM)
|
||
|
|
* **2024-03-12**: The parameter of [ChemDFM-v1.0-13B](ihttps://huggingface.co/OpenDFM/ChemDFM-v1.0-13B) is open-sourced!
|
||
|
|
* **2024-01-26**: The paper of ChemDFM-13B is released on arXiv: [ChemDFM: Dialogue Foundation Model for Chemistry](https://arxiv.org/abs/2401.14818)
|
||
|
|
|
||
|
|
## Usage Details
|
||
|
|
|
||
|
|
The online demo of ChemDFM will be up soon!
|
||
|
|
|
||
|
|
### local inference
|
||
|
|
|
||
|
|
To load and run ChemDFM locally, here is an example:
|
||
|
|
|
||
|
|
```python
|
||
|
|
import torch
|
||
|
|
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
|
||
|
|
|
||
|
|
model_name_or_id = "OpenDFM/ChemDFM-v1.5-8B"
|
||
|
|
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id)
|
||
|
|
model = LlamaForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto")
|
||
|
|
|
||
|
|
input_text = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
|
||
|
|
input_text = f"[Round 0]\nHuman: {input_text}\nAssistant:"
|
||
|
|
|
||
|
|
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||
|
|
generation_config = GenerationConfig(
|
||
|
|
do_sample=True,
|
||
|
|
top_k=20,
|
||
|
|
top_p=0.9,
|
||
|
|
temperature=0.9,
|
||
|
|
max_new_tokens=1024,
|
||
|
|
repetition_penalty=1.05,
|
||
|
|
eos_token_id=tokenizer.eos_token_id
|
||
|
|
)
|
||
|
|
|
||
|
|
outputs = model.generate(**inputs, generation_config=generation_config)
|
||
|
|
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][len(input_text):]
|
||
|
|
print(generated_text.strip())
|
||
|
|
```
|
||
|
|
|
||
|
|
### input format
|
||
|
|
|
||
|
|
To get better responses, we recommend to preprocess your input and history with the dialogue templates which are used during instruction tuning of ChemDFM. Specifically, for an input queries
|
||
|
|
```python
|
||
|
|
{'current_query': current_query, 'history': [(query1, answer1), (query2, answer2), ...]}
|
||
|
|
```
|
||
|
|
, you can use the following code to preprocess the input and history:
|
||
|
|
```python
|
||
|
|
def formatting_input(current_query, history):
|
||
|
|
input_text = ''
|
||
|
|
for idx, (query, answer) in history:
|
||
|
|
input_text += f"[Round {idx}]\nHuman: {query}\nAssistant: {answer}\n"
|
||
|
|
input_text += f"[Round {len(history)}]\nHuman: {current_query}\nAssistant:"
|
||
|
|
return input_text
|
||
|
|
```
|
||
|
|
|
||
|
|
### SMILES preprocess
|
||
|
|
|
||
|
|
When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
|
||
|
|
```python
|
||
|
|
from rdkit import Chem
|
||
|
|
def canonicalize_smiles(smiles):
|
||
|
|
mol = Chem.MolFromSmiles(smiles)
|
||
|
|
if mol is None:
|
||
|
|
return None
|
||
|
|
return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
|
||
|
|
```
|
||
|
|
or directly:
|
||
|
|
```python
|
||
|
|
from rdkit import Chem
|
||
|
|
def canonicalize_smiles(smiles):
|
||
|
|
return Chem.CanonSmiles(smiles, useChiral=True)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
```bibtex
|
||
|
|
@article{zhao2025developing,
|
||
|
|
title={Developing ChemDFM as a large language foundation model for chemistry},
|
||
|
|
author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},
|
||
|
|
journal={Cell Reports Physical Science},
|
||
|
|
volume={6},
|
||
|
|
number={4},
|
||
|
|
year={2025},
|
||
|
|
publisher={Elsevier}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Disclaimer
|
||
|
|
Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.
|
||
|
|
|
||
|
|
## Contact
|
||
|
|
|
||
|
|
If you have any questions or further requests, please contact [Zihan Zhao](mailto:zhao_mengxin@sjtu.edu.cn) and [Lu Chen](mailto:chenlusz@sjtu.edu.cn).
|