This repository contains information on the creation, evaluation, and benchmark models for the L+M-24 Dataset. L+M-24 will be featured as the shared task at The Language + Molecules Workshop at ACL 2024.
Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) template-based built on prediction datasets. In this document, we detail the L+M-24 dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.
Please see the manuscript for this dataset here.
-
The official leaderboard is now available!! We will upload the scripts used for the final evaluation and ranking soon.
-
Submissions can now be uploaded to Codabench! See the competitions at: Molecule Captioning and Molecule Generation. See instructions on the website.
-
Example MolT5-Small submission files are available as "MolT5-Small_cap2smi_submit.zip" and "MolT5-Small_smi2cap_submit.zip".
-
We have updated the code for ``text_property_metrics.py'' to produce more intuitive results for missing properties in the validation set. We will update tables 3 and 4 in the dataset manuscript very soon to address this change.
Datasets are made available to download through HuggingFace datasets.
Split | Link | Description |
Train | LPM-24_train | The full training data for the shared task. |
Train-Extra | LPM-24_train-extra | Extra training data for the shared task with 5 captions generated for each molecule. |
Evaluation -- Molecule Generation | LPM-24_eval-molgen | The evaluation data for molecule generation. Only input captions are included. |
Evaluation -- Caption Generation | LPM-24_eval-caption | The evaluation data for molecule caption generation. |
Further, datasets are available in zipped file data.zip
. Some files that may be useful for training or necessary for evaluation are contained in additional_data.zip
.
Evaluation code and instructions can be found in evaluation.
We would like to thank the input databases we used to construct this dataset!
If you found this dataset or code useful, please cite:
@article{edwards2024_LPM24,
title={L+M-24: Building a Dataset for Language+Molecules @ ACL 2024},
author={Edwards, Carl and Wang, Qingyun and Zhou, Lawrence and Ji, Heng},
journal={arXiv preprint arXiv:2403.00791},
year={2024}
}
@inproceedings{edwards-etal-2022-translation,
title = "Translation between Molecules and Natural Language",
author = "Edwards, Carl and
Lai, Tuan and
Ros, Kevin and
Honke, Garrett and
Cho, Kyunghyun and
Ji, Heng",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.26",
pages = "375--413",
}
as well as the source datasets.