ChemDFM-v1.5-8B/README.md

---
license: agpl-3.0
pipeline_tag: text-generation
tags:
- chemistry
language:
- en
- zh
---
# ChemDFM: A Large Language Foundation Model for Chemistry

![Main Image](https://github.com/OpenDFM/ChemDFM/raw/main/docs/static/images/main.png)

ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science, which is built based on LLaMa-13B. ChemDFM outperforms the open-sourced LLMs in all the typical tasks of chemistry, and even reaches comparable or higher performances to GPT-4. For more details, please refer to [our paper](https://arxiv.org/abs/2401.14818).

## News

* **2025-04-16**: Our paper is accepted by *Cell Report Physical Science*. The published version can be accessed [HERE](https://www.sciencedirect.com/science/article/pii/S2666386425001225)
* **2024-11-09**: [ChemDFM-v1.5-8B](https://huggingface.co/OpenDFM/ChemDFM-v1.5-8B) is released! We implemented our domain pre-training and instruction tuning precedure on a stronger base model LLaMA-3-8B.
* **2024-06-13**: The results on the comprehensive science benchmark [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) show that "ChemDFM emerged as one of the top open-source models by continuing pre-training and fine-tuning on a vast corpus of scientific literature".
* **2024-04-17**: The evaluation data (including instructions) we used in our paper is released on [GitHub](https://github.com/OpenDFM/ChemDFM)
* **2024-03-12**: The parameter of [ChemDFM-v1.0-13B](ihttps://huggingface.co/OpenDFM/ChemDFM-v1.0-13B) is open-sourced!
* **2024-01-26**: The paper of ChemDFM-13B is released on arXiv: [ChemDFM: Dialogue Foundation Model for Chemistry](https://arxiv.org/abs/2401.14818)

## Usage Details

The online demo of ChemDFM will be up soon!

### local inference

To load and run ChemDFM locally, here is an example:

```python
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

model_name_or_id = "OpenDFM/ChemDFM-v1.5-8B"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id)
model = LlamaForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto")

input_text = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
input_text = f"[Round 0]\nHuman: {input_text}\nAssistant:"

inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.9,
    max_new_tokens=1024,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][len(input_text):]
print(generated_text.strip())
```

### input format

To get better responses, we recommend to preprocess your input and history with the dialogue templates which are used during instruction tuning of ChemDFM. Specifically, for an input queries
```python
{'current_query': current_query, 'history': [(query1, answer1), (query2, answer2), ...]}
```
, you can use the following code to preprocess the input and history:
```python
def formatting_input(current_query, history):
    input_text = ''
    for idx, (query, answer) in history:
        input_text += f"[Round {idx}]\nHuman: {query}\nAssistant: {answer}\n"
    input_text += f"[Round {len(history)}]\nHuman: {current_query}\nAssistant:"
    return input_text
```

### SMILES preprocess

When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
```
or directly:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    return Chem.CanonSmiles(smiles, useChiral=True)
```

## Citation
```bibtex
@article{zhao2025developing,
  title={Developing ChemDFM as a large language foundation model for chemistry},
  author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},
  journal={Cell Reports Physical Science},
  volume={6},
  number={4},
  year={2025},
  publisher={Elsevier}
}
```

## Disclaimer
Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.

## Contact

If you have any questions or further requests, please contact [Zihan Zhao](mailto:zhao_mengxin@sjtu.edu.cn) and [Lu Chen](mailto:chenlusz@sjtu.edu.cn).
初始化项目，由ModelHub XC社区提供模型 Model: OpenDFM/ChemDFM-v1.5-8B Source: Original Platform 2026-04-30 05:10:51 +08:00			`---`
			`license: agpl-3.0`
			`pipeline_tag: text-generation`
			`tags:`
			`- chemistry`
			`language:`
			`- en`
			`- zh`
			`---`
			`# ChemDFM: A Large Language Foundation Model for Chemistry`

			`![Main Image](https://github.com/OpenDFM/ChemDFM/raw/main/docs/static/images/main.png)`

			`ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science, which is built based on LLaMa-13B. ChemDFM outperforms the open-sourced LLMs in all the typical tasks of chemistry, and even reaches comparable or higher performances to GPT-4. For more details, please refer to [our paper](https://arxiv.org/abs/2401.14818).`

			`## News`

			`* 2025-04-16: Our paper is accepted by Cell Report Physical Science. The published version can be accessed [HERE](https://www.sciencedirect.com/science/article/pii/S2666386425001225)`
			`* 2024-11-09: [ChemDFM-v1.5-8B](https://huggingface.co/OpenDFM/ChemDFM-v1.5-8B) is released! We implemented our domain pre-training and instruction tuning precedure on a stronger base model LLaMA-3-8B.`
			`* 2024-06-13: The results on the comprehensive science benchmark [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) show that "ChemDFM emerged as one of the top open-source models by continuing pre-training and fine-tuning on a vast corpus of scientific literature".`
			`* 2024-04-17: The evaluation data (including instructions) we used in our paper is released on [GitHub](https://github.com/OpenDFM/ChemDFM)`
			`* 2024-03-12: The parameter of [ChemDFM-v1.0-13B](ihttps://huggingface.co/OpenDFM/ChemDFM-v1.0-13B) is open-sourced!`
			`* 2024-01-26: The paper of ChemDFM-13B is released on arXiv: [ChemDFM: Dialogue Foundation Model for Chemistry](https://arxiv.org/abs/2401.14818)`

			`## Usage Details`

			`The online demo of ChemDFM will be up soon!`

			`### local inference`

			`To load and run ChemDFM locally, here is an example:`

			```python
			`import torch`
			`from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig`

			`model_name_or_id = "OpenDFM/ChemDFM-v1.5-8B"`
			`tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id)`
			`model = LlamaForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto")`

			`input_text = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"`
			`input_text = f"[Round 0]\nHuman: {input_text}\nAssistant:"`

			`inputs = tokenizer(input_text, return_tensors="pt").to("cuda")`
			`generation_config = GenerationConfig(`
			`do_sample=True,`
			`top_k=20,`
			`top_p=0.9,`
			`temperature=0.9,`
			`max_new_tokens=1024,`
			`repetition_penalty=1.05,`
			`eos_token_id=tokenizer.eos_token_id`
			`)`

			`outputs = model.generate(**inputs, generation_config=generation_config)`
			`generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0][len(input_text):]`
			`print(generated_text.strip())`
			```

			`### input format`

			`To get better responses, we recommend to preprocess your input and history with the dialogue templates which are used during instruction tuning of ChemDFM. Specifically, for an input queries`
			```python
			`{'current_query': current_query, 'history': [(query1, answer1), (query2, answer2), ...]}`
			```
			`, you can use the following code to preprocess the input and history:`
			```python
			`def formatting_input(current_query, history):`
			`input_text = ''`
			`for idx, (query, answer) in history:`
			`input_text += f"[Round {idx}]\nHuman: {query}\nAssistant: {answer}\n"`
			`input_text += f"[Round {len(history)}]\nHuman: {current_query}\nAssistant:"`
			`return input_text`
			```

			`### SMILES preprocess`

			When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
			```python
			`from rdkit import Chem`
			`def canonicalize_smiles(smiles):`
			`mol = Chem.MolFromSmiles(smiles)`
			`if mol is None:`
			`return None`
			`return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)`
			```
			`or directly:`
			```python
			`from rdkit import Chem`
			`def canonicalize_smiles(smiles):`
			`return Chem.CanonSmiles(smiles, useChiral=True)`
			```

			`## Citation`
			```bibtex
			`@article{zhao2025developing,`
			`title={Developing ChemDFM as a large language foundation model for chemistry},`
			`author={Zhao, Zihan and Ma, Da and Chen, Lu and Sun, Liangtai and Li, Zihao and Xia, Yi and Chen, Bo and Xu, Hongshen and Zhu, Zichen and Zhu, Su and others},`
			`journal={Cell Reports Physical Science},`
			`volume={6},`
			`number={4},`
			`year={2025},`
			`publisher={Elsevier}`
			`}`
			```

			`## Disclaimer`
			`Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.`

			`## Contact`

			`If you have any questions or further requests, please contact [Zihan Zhao](mailto:zhao_mengxin@sjtu.edu.cn) and [Lu Chen](mailto:chenlusz@sjtu.edu.cn).`