Implementation of the Paper "Neuraldecipher - Reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures" by Tuan Le, Robin Winter, Frank Noé and Djork-Arné Clevert.1
rdkit==2020.03.2
numpy==1.18.1
tqdm==4.46.1
h5py==2.10.0
jupyter==1.0.0
Create a new enviorment:
git clone URL
cd neuraldecipher
conda env create -f environment.yml
conda activate neuraldecipher
conda install pytorch==1.4.0 torchvision==0.5.0 -c pytorch # GPU
# conda install pytorch==1.4.0 torchvision cpuonly -c pytorch # CPU
- cddd
To complete the reverse-engineering workflow, the decoder network from Winter et al. (see Workflow) is needed in the final evaluation. Note, it suffices to clone thecddd
repository and start from the installation oftensorflow-gpu==1.10.0
without creating the environment. It is important to have thecddd
module installed within theneuraldecipher
environment for latter inference. To use tensordboard with pytorch, remove thetensorboard==1.10.0
from the cddd dependency
pip uninstall tensorboard
pip install tensorboard==1.14.0
We included this workaround to still be able to use the CDDD inference server and tensorboard to log the training of the Neuraldecipher.
The CDDD server is also needed to compute the CDDD vector representation from the SMILES to train the Neuraldecipher.
We provided a Jupyter Notebook insource/get_cddd.ipynb
to compute the CDDD representations from the ChEMBL25 dataset.
The repository consists of several subdirectories:
data
consists of the training and test data.logs
consists of the tensorboard log files for each training runparams
consists of the json parameter files for each run. See example.models
consists of the saved models. In case the Neuraldecipher was trained on bit-ECFPs, the results are saved inmodels/bits_results
. Otherwise the models are saved inmodels
.source
consists of all necessary python scripts for execution.
The provided data consists of:
data/smiles.npy
: List of SMILES from the filtered ChEMBL25 database saved as numpy array.data/smiles_temporal.npy
: List of temporal SMILES from the filtered ChEMBL26 database saved as numpy array.data/cluster.npy
: List of cluster assignment from thesmiles.npy
array. This array is needed to create train and validation datasets.
Computing several extended-connectivity fingerprints (ECFPs) depending on length k and bond diameter d
The python script in source/get_ecfp.py
computes the extended-connectivity fingerprints.
The options for the script are the following:
--all: Boolean flag whether or not all ECFP configurations as described in the paper1 should be computed. Defaults to False. In this case on the ECFP with bond-diameter d=6 and fingerprint size k=1024 are computed for the binary and count representations.
--nworkers: Integer of number of parallel cpu-workers to use in order to compute the ECFP representations. Defaults to 1. In order to speed up the computation, it is recommended to use more workers.
Execution:
python source/get_ecfp.py -h # in order to see the information for the arguments
python source/get_ecfp.py --all False --nworkers 10 # only compute one ECFP setting and use 10 cpus for multiprocessing
The Jupyter Notebook in source/get_cddd.ipynb
shows how to generate CDDD representations from the data/smiles.npy
array.
The python script in source/main.py
excutes the training for the Neuraldecipher.
The options for the script are the following:
--config: String to the params.json file that consists the information for Neuraldecipher network architecture and training settings. Defaults to params/1024_config_bit_gpu.json
--split: String to select if the cluster
or random
split should be used (see reference 1) for details.
Defaults to cluster
.
--workers: Integer of number of parallel cpu-workers for the dataloader. Defaults to 5
--cosineloss: Boolean flag whether or not the cosineloss should be used within the training. Defaults to False
. This flag can be set to True
to additionally add the cosine similarity loss next to the difference loss (e.g. L2, or logcosh).
Execution:
python source/main.py -h # in order to see the information for the arguments
python source/main.py --config params/1024_config_bit_gpu.json --split cluster --workers 5 --cosineloss False
Since tensorboard-gpu==1.10.0
is installed within the neuraldecipher
environment, we cannot run tensorboard==1.14.0
within the neuraldecipher
environment. We merely included tensorboard==1.14.0
to the neuraldecipher
environment to log the training of our Neuraldecipher.
To monitor the training, please create a new environment tb
and install tensorflow==1.14.0
(CPU version) which also includes tensorboard==1.14.0
in its installation.
conda create -n tb python=3.6.10 tensorflow==1.14.0
conda activate tb
Run tensorboard command in a new shell (here to localhost:8888):
tensorboard --logdir logs/ --port 8888 --host localhost
We provide the model weights for the trained model on ECFP6 representations of length 1024 trained on the cluster split and show the performance on the
cluster validation dataset and temporal dataset in in the Notebook source/evaluation.ipynb
.
[1] T. Le, R. Winter, F. Noe and D. Clevert, Chem. Sci., 2020, DOI: 10.1039/D0SC03115A