Skip to content

Latest commit

 

History

History
executable file
·
682 lines (592 loc) · 37.4 KB

README.md

File metadata and controls

executable file
·
682 lines (592 loc) · 37.4 KB

Scripts for training and validating neural network models for gnina

  • train.py - Takes a model template and train/test file(s)
  • predict.py - Takes a model, learned weights, and an input file and outputs probabilities of binding
  • clustering.py - Takes an input file, and outputs clustered cross-validation train/test file(s) for train.py. Note: take a long time to compute
  • compute_seqs.py - Takes input file for clustering.py and creates the input for compute_row.py
  • compute_row.py - Computes 1 row of NxN matrix in clustering.py. This allows for more parallelization of clustering.py
  • combine_rows.py - Script to take outputs of compute_row.py & combine them to avoid needing to do the computation in clustering.py
  • simple_grid_visualization.py - Script to evaluate 2-D single ligatom+receptor atom systems to visualize how CNN score changes with distance between atoms.
  • grid_visualization.py - Script to evaluate a box of single atoms around a receptor of interest to visualize how CNN scores ligand contexts
  • generate_unique_lig_poses.py - Script for counter-example generation which computes all of the unique ligand poses in a directory
  • counterexample_generation_jobs.py - Script which generates a file containing all of the gnina commands to generate new counter-examples
  • generate_counterexample_typeslines.py - Script which generates a file containing the lines to add to the types file for a pocket.
  • types_extender.py - Script to generate a new types file containing the lines generated from the counterexamples from an existing types file.

Dependencies

sudo pip install matplotlib scipy sklearn scikit-image protobuf psutil numpy seaborn
export PYTHONPATH=/usr/local/python:$PYTHONPATH

rdkit -- see installation instructions here

openbabel -- see installation instructions here

Training

usage: train.py [-h] -m MODEL -p PREFIX [-d DATA_ROOT] [-n FOLDNUMS] [-a] [-i ITERATIONS] [-s SEED] [-t TEST_INTERVAL] [-o OUTPREFIX] [-g GPU] [-c CONT] [-k] [-r]
                [--percent_reduced PERCENT_REDUCED] [--avg_rotations] [--checkpoint] [--keep_best] [--dynamic] [--cyclic] [--solver SOLVER] [--lr_policy LR_POLICY] [--step_reduce STEP_REDUCE]
                [--step_end STEP_END] [--step_end_cnt STEP_END_CNT] [--step_when STEP_WHEN] [--base_lr BASE_LR] [--momentum MOMENTUM] [--weight_decay WEIGHT_DECAY] [--gamma GAMMA]
                [--power POWER] [--weights WEIGHTS] [-p2 PREFIX2] [-d2 DATA_ROOT2] [--data_ratio DATA_RATIO] [--test_only] [--clip_gradients CLIP_GRADIENTS] [--skip_full]
                [--display_iter DISPLAY_ITER] [--update_ratio UPDATE_RATIO]

Train neural net on .types data.

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Model template. Must use TRAINFILE and TESTFILE
  -p PREFIX, --prefix PREFIX
                        Prefix for training/test files: <prefix>[train|test][num].types
  -d DATA_ROOT, --data_root DATA_ROOT
                        Root folder for relative paths in train/test files
  -n FOLDNUMS, --foldnums FOLDNUMS
                        Fold numbers to run, default is to determine using glob
  -a, --allfolds        Train and test file with all data folds, <prefix>.types
  -i ITERATIONS, --iterations ITERATIONS
                        Number of iterations to run,default 250,000
  -s SEED, --seed SEED  Random seed, default 42
  -t TEST_INTERVAL, --test_interval TEST_INTERVAL
                        How frequently to test (iterations), default 1000
  -o OUTPREFIX, --outprefix OUTPREFIX
                        Prefix for output files, default <model>.<pid>
  -g GPU, --gpu GPU     Specify GPU to run on
  -c CONT, --cont CONT  Continue a previous simulation from the provided iteration (snapshot must exist)
  -k, --keep            Don't delete prototxt files
  -r, --reduced         Use a reduced file for model evaluation if exists(<prefix>[reducedtrain|reducedtest][num].types). Incompatible with --percent_reduced
  --percent_reduced PERCENT_REDUCED
                        Create a reduced set on the fly based on types file, using the given percentage: to use 10 percent pass 10. Range (0,100). Incompatible with --reduced
  --avg_rotations       Use the average of the testfile's 24 rotations in its evaluation results
  --checkpoint          Enable automatic checkpointing
  --keep_best           Store snapshots everytime test AUC improves
  --dynamic             Attempt to adjust the base_lr in response to training progress, default True
  --cyclic              Vary base_lr in range of values: 0.015 to 0.001
  --solver SOLVER       Solver type. Default is SGD
  --lr_policy LR_POLICY
                        Learning policy to use. Default is fixed.
  --step_reduce STEP_REDUCE
                        Reduce the learning rate by this factor with dynamic stepping, default 0.1
  --step_end STEP_END   Terminate training if learning rate gets below this amount
  --step_end_cnt STEP_END_CNT
                        Terminate training after this many lr reductions
  --step_when STEP_WHEN
                        Perform a dynamic step (reduce base_lr) when training has not improved after this many test iterations, default 5
  --base_lr BASE_LR     Initial learning rate, default 0.01
  --momentum MOMENTUM   Momentum parameters, default 0.9
  --weight_decay WEIGHT_DECAY
                        Weight decay, default 0.001
  --gamma GAMMA         Gamma, default 0.001
  --power POWER         Power, default 1
  --weights WEIGHTS     Set of weights to initialize the model with
  -p2 PREFIX2, --prefix2 PREFIX2
                        Second prefix for training/test files for combined training: <prefix>[train|test][num].types
  -d2 DATA_ROOT2, --data_root2 DATA_ROOT2
                        Root folder for relative paths in second train/test files for combined training
  --data_ratio DATA_RATIO
                        Ratio to combine training data from 2 sources
  --test_only           Don't train, just evaluate test nets once
  --clip_gradients CLIP_GRADIENTS
                        Clip gradients threshold (default 10)
  --skip_full           Use reduced testset on final evaluation, requires passing --reduced
  --display_iter DISPLAY_ITER
                        Print out network outputs every so many iterations
  --update_ratio UPDATE_RATIO
                        Improvements during training need to be better than this ratio. IE (best-current)/best > update_ratio. Defaults to 0.001

MODEL is a caffe model file and is required. It should have a MolGridDataLayer for each phase, TRAIN and TEST. The source parameter of these layers should be placeholder values "TRAINFILE" and "TESTFILE" respectively.

PREFIX is the prefix of pre-specified training and test files. For example, if the prefix is "all" then there should be files "alltrainX.types" and "alltestX.types" for each X in FOLDNUMS. FOLDNUMS is a comma-separated list of ints, for example 0,1,2. With the --allfolds flag set, a model is also trained and tested on a single file that hasn't been split into train/test folds, for example "all.types" in the previous example.

The model trained on "alltrain0.types" will be tested on "alltest0.types". Each model is trained for up to ITERATIONS iterations and tested each TEST_INTERVAL iterations.

The train/test files are of the form 1 set1/159/rec.gninatypes set1/159/docked_0.gninatypes where the first value is the label, the second the receptor, and the third the ligand. Additional whitespace delimited fields are ignored. gninatypes files are created using gninatyper. The receptor and label paths in these files can be absolute, or they can be relative to a path provided as the DATA_ROOT argument. To use this option, the root_folder parameter in each MolGridDataLayer of the model file should be the placeholder value "DATA_ROOT". This can also be hard-coded into the model.

The prefix of all generated output files will be OUTPREFIX. If not specified, it will be the model name and the process id. While running, OUTPREFIX.X.out files are generated. A line is written at every TEST_INTERVAL and consists of the test auc, the train auc, the loss, and the learning rate at that iteration.

The entire train and test sets are evaluated every TEST_INTERVAL. If they are very large this may be undesirable. Alternatively, if -r is passed, pre-specified reduced train/test sets can be used for monitoring.

Once each fold is complete OUTPREFIX.X.finaltest is written with the final predicted values.

After the completion of all folds, OUTPREFIX.test and OUTPREFIX.train are written which contain the average AUC and and individual AUCs for each fold at eash test iteration. Also, a total finaltest file of all the predictions. Graphs of the training behavior (OUTPREFIX_train.pdf) and final ROC (OUTPREFIX_roc.pdf) are also created as well as caffe files.

The GPU to use can be specified with -g.

Previous training runs can be continued with -c. The same prefix etc. should be used.

A large number of training hyperparameters are available as options. The defaults should be pretty reasonable.

Generating Clustered-Cross Validation splits of data

There are 2 strategies: 1) Running clustering.py directly; 2) Running compute_seqs.py --> compute_row.py --> combine_rows.py. Strategy 1 works best when there is a small number of training examples (around 4000), but the process is rather slow. Strategy 2 is to upscale computing the clusters (typically for a supercomputing cluster) where each row would correspond to 1 job.

Note that these scripts assume that the input files point to a relative path from the current working directory.

Case 1: Using clustering.py

usage: clustering.py [-h] [--pdbfiles PDBFILES] [--cpickle CPICKLE] [-i INPUT]
                     [-o OUTPUT] [-c CHECK] [-n NUMBER] [-s SIMILARITY]
                     [-s2 SIMILARITY_WITH_SIMILAR_LIGAND]
                     [-l LIGAND_SIMILARITY] [-d DATA_ROOT] [--posedir POSEDIR]
                     [--randomize RANDOMIZE] [-v] [--reduce REDUCE]

create train/test sets for cross-validation separating by sequence similarity
of protein targets and rdkit fingerprint similarity

optional arguments:
  -h, --help            show this help message and exit
  --pdbfiles PDBFILES   file with target names, paths to pbdfiles of targets,
                        paths to ligand smile (separated by space)
  --cpickle CPICKLE     cpickle file for precomputed distance matrix and
                        ligand similarity matrix
  -i INPUT, --input INPUT
                        input .types file to create folds from, it is assumed
                        receptors in pdb named directories
  -o OUTPUT, --output OUTPUT
                        output name for clustered folds
  -c CHECK, --check CHECK
                        input name for folds to check for similarity
  -n NUMBER, --number NUMBER
                        number of folds to create/check. default=3
  -s SIMILARITY, --similarity SIMILARITY
                        what percentage similarity to cluster by. default=0.5
  -s2 SIMILARITY_WITH_SIMILAR_LIGAND, --similarity_with_similar_ligand SIMILARITY_WITH_SIMILAR_LIGAND
                        what percentage similarity to cluster by when ligands
                        are similar default=0.3
  -l LIGAND_SIMILARITY, --ligand_similarity LIGAND_SIMILARITY
                        similarity threshold for ligands, default=0.9
  -d DATA_ROOT, --data_root DATA_ROOT
                        path to target dirs
  --posedir POSEDIR     subdir of target dirs where ligand poses are located
  --randomize RANDOMIZE
                        randomize inputs to get a different split, number is
                        random seed
  -v, --verbose         verbose output
  --reduce REDUCE       Fraction to sample by for reduced files. default=0.05

INPUT is a types file that you want to create clusters for

Either CPICKLE or PDBFILES needs to be input for the script to work.

PDBFILES is a file of target_name, path to pdbfile of target, and path to the ligand smile (separated by space) CPICKLE is either the dump from running clustering.py one time, or the output from Case 2 (below) and allows you to avoid recomputing the costly protein sequence and ligand similarity matrices needed for clustering.

When running with PDBFILES, the script will output PDBFILES.pickle which contains (distanceMatrix, target_names, ligansim), where distanceMatrix is the matrix of cUTDM2 distance between the protein sequences, target_names is the list of targets, and ligandsim is the matrix of ligand similarities.

When running with CPICKLE, only the new .types files will be output.

A typical usage case would be to create 5 different seeds of 5fold cross-validation. First, we create seed0, which also will compute the matrices needed. This depends on having INPUT a types file that we want to generate clusters for PDBFILES (target_name path_to_pdb_file path_to_ligand_smile) for each target in types

clustering.py --pdbfiles my_info --input my_types.types --output my_types_cv_seed0_ --randomize 0 --number 5

Next we run the following four commands to generate the other 4 seeds

clustering.py --cpickle matrix.pickle --input my_types.types --output my_types_cv_seed1_ --randomize 1 --number 5
clustering.py --cpickle matrix.pickle --input my_types.types --output my_types_cv_seed2_ --randomize 2 --number 5
clustering.py --cpickle matrix.pickle --input my_types.types --output my_types_cv_seed3_ --randomize 3 --number 5
clustering.py --cpickle matrix.pickle --input my_types.types --output my_types_cv_seed4_ --randomize 4 --number 5

Case 2: Running compute_*.py pipeline

First, we will use the compute_seqs.py to generate the needed input files

usage: compute_seqs.py [-h] --pdbfiles PDBFILES [--out OUT]

Output the needed input for compute_row. This takes the format of
"<target_name> <ligand smile> <target_sequence>" separated by spaces

optional arguments:
  -h, --help           show this help message and exit
  --pdbfiles PDBFILES  file with target names, paths to pbdfiles of targets,
                       and path to smiles file of ligand (separated by space)
  --out OUT            output file (default stdout)

PDBFILES is the same input that would be given to clustering.py.

For the rest of this pipeline, I will consider the output of compute_seqs.py to be comp_seq_out.

Second, we will run compute_row.py for each line in the output of compute_seqs.py

usage: compute_row.py [-h] --pdbseqs PDBSEQS -r ROW [--out OUT]

Compute a single row of a distance matrix and ligand similarity matrix from a
pdbinfo file.

optional arguments:
  -h, --help         show this help message and exit
  --pdbseqs PDBSEQS  file with target names, ligand smile, and sequences
                     (chains separated by space)
  -r ROW, --row ROW  row to compute
  --out OUT          output file (default stdout)

Here PDBSEQS is the output of compute_seqs.py. For example, to compute row zero and store the output into the file row0:

compute_row.py --pdbseqs comp_seq_out --row 0 --out row0

For the next part, I assume that the output of compute_row.py is row[num] where [num] is the row that was computed.

Third, we will run combine_rows.py to create the cpickle file needed for input into clustering.py

combine_rows.py row*

combine_rows.py accepts any number of input files, and outputs matrix.pickle

Lastly, we run clustering.py as follows

clustering.py --cpickle matrix.pickle --input my_types.types --output my_types_cv_

Adding new counterexamples to types files

There are 4 scripts here which form a pipeline to generate new counter-examples for a data directory.

The pipeline is as follows: 1) generate_unique_lig_poses.py; 2) counterexample_generation_jobs.py; 3) generate_counterexample_typeslines.py; 4) types_extender.py.

Global Assumptions: 1) The data directory structure is: <ROOT>/<POCKET>/<FILES>, 2) Crystal ligand files are named: <PDBid>_<ligname><CRYSTAL SUFFIX>, 3) Receptors are PDB files, 4) output poses are SDF files.

Step 1) Generating the unique poses for a Pocket

In order to avoid extra calculations, we need to find the unique poses. NOTE - This process needs to be done exactly once when generating new counterexamples. After a round of counterexamples are generated, script 3 in the pipeline will generate the updated unique_poses.sdf file.

WARNING -- this script performs an O(n^2) calcualtion for each unique ligand name in the pocket!!

This can cause this to run very slowly if there are many receptors for the ligand to be docked into. It also can cause problems if there are many crystal ligands for 1 ligand name. In this case, we recommend creating a subdirectory for that pocket, and putting the extra crystal ligand files there.

usage: generate_unique_lig_poses.py [-h] -p POCKET -r ROOT [-ds DOCKED_SUFFIX]
                                    [-cs CRYSTAL_SUFFIX] -os OUT_SUFFIX
                                    [--unique_threshold UNIQUE_THRESHOLD]

Create ligname<OUTSUFFIX> files for use with generate_counterexample_typeslines.py.

optional arguments:
  -h, --help            show this help message and exit
  -p POCKET, --pocket POCKET
                        Name of the pocket that you will be generating the
                        file for.
  -r ROOT, --root ROOT  PATH to the ROOT of the pockets.
  -ds DOCKED_SUFFIX, --docked_suffix DOCKED_SUFFIX
                        Expression to glob docked poses. These contain the
                        poses that need to be uniqified. Default is
                        "_tt_docked.sdf"
  -cs CRYSTAL_SUFFIX, --crystal_suffix CRYSTAL_SUFFIX
                        Expression to glob the crystal ligands. Default is
                        "_lig.pdb"
  -os OUT_SUFFIX, --out_suffix OUT_SUFFIX
                        End of the filename for LIGNAME<OUTSUFFIX>. This will
                        be the --old_unique_suffix for
                        generate_counterexample_typeslines.py.
  --unique_threshold UNIQUE_THRESHOLD
                        RMSD threshold for unique poses. IE poses with RMSD >
                        thresh are considered unique. Defaults to 0.25.

The workflow for this script is to first generate a list of the pockets that you wish to analyze. We provide the pockets used for our CrossDocked2020 models in cd2020_pockets.txt.

You can then run the script in a bash for loop:

for d in `cat cd2020_pockets.txt` do python3 generate_unique_lig_poses.py -p $d -r MYROOT -os _initial_unique.sdf; done

WARNING -- this will be VERY SLOW. We HIGHLY RECOMMEND running this in a parallel fashion on a super computing cluster if possible.

The output when completed will be a series of LIGNAME_initial_unique.sdf files in each pocket directory.

Step 2 -- Generating the gnina commands to generate the counterexamples

We need to create the gnina commands to generate the new counterexample poses.

usage: counterexample_generation_jobs.py [-h] -o OUTFILE [-r ROOT]
                                         [-ri REC_ID] [-cs CRYSTAL_SUFFIX]
                                         [-ds DOCKED_SUFFIX] -i ITERATION
                                         [--num_modes NUM_MODES] [--cnn CNN]
                                         [--cnn_model CNN_MODEL]
                                         [--cnn_weights CNN_WEIGHTS]
                                         [--seed SEED] [--dirs DIRS]

Create cnn_minimize jobs for a dataset. Assumes dataset file structure is
<ROOT>/<Identifier>/<FILES>

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Name for gnina job commands output file.
  -r ROOT, --root ROOT  ROOT for data directory structure. Defaults to current
                        working directory.
  -ri REC_ID, --rec_id REC_ID
                        Regular expression to identify the receptor PDB.
                        Defaults to ...._._rec.pdb
  -cs CRYSTAL_SUFFIX, --crystal_suffix CRYSTAL_SUFFIX
                        Expresssion to glob the crystal ligand PDB. Defaults
                        to _lig.pdb. Assumes filename is
                        PDBid_LignameLIGSUFFIX
  -ds DOCKED_SUFFIX, --docked_suffix DOCKED_SUFFIX
                        Expression to glob docked poses. These contain the
                        poses that need to be minimized. Default is
                        "_tt_docked.sdf"
  -i ITERATION, --iteration ITERATION
                        Sets what iteration number we are doing. Adds
                        _it#_docked.sdf to the output file for the gnina job
                        line.
  --num_modes NUM_MODES
                        Sets the --num_modes argument for the gnina command.
                        Defaults to 20.
  --cnn CNN             Sets the --cnn command for the gnina command. Defaults
                        to dense. Must be dense, general_default2018, or
                        crossdock_default2018.
  --cnn_model CNN_MODEL
                        Override --cnn with a user provided caffe model file.
                        If used, requires the user to pass in a weights file
                        as well.
  --cnn_weights CNN_WEIGHTS
                        The weights file to use with the supplied caffemodel
                        file.
  --seed SEED           Seed for the gnina commands. Defaults to 42
  --dirs DIRS           Supplied file containing a subset of the dataset (one pocket per line).
                        Default behavior is to do every directory.

The default behavior is to generate the output file for each directory in ROOT. For CrossDocked2020, we supply more pockets than we used to analyze, so you can pass the cd2020_pockets.txt file in the DIRS argument. The default values match what we used for CrossDocked2020.

Example -- generate the file it3_to_run.txt for CrossDocked2020 available at my_root for iteration 3, using the built-in dense net in gnina

python3 counter_example_generation_jobs.py -o it3_to_run.txt -r MYROOT -i 3 --cnn dense --dirs cd2020_pockets.txt

Once this has been created, each of these commands will need to be executed. Note, there will be many commands to run, so we recommend running in parallel across many GPUs on a computing cluster if possible.

Step 3 -- Generating the lines to add to the types files.

Running this script will generate OUTNAME in the supplied pocket, which will contain the lines to add to the types files for that pocket.

WARNING -- This script also performs an O(n^2) calculation per unique ligand in the pocket! This can take a very long time, and scales with the number of receptors and crystal ligands with the same ligand name present in the pocket. If this is to much, we recommend a downsampling strategy by moving files into a sub-directory prior to running the pipeline.

WARNING 2 -- As the calculations are O(n^2), we recommend running each pocket as its own job on a computing cluster if possible.

usage: generate_counterexample_typeslines.py [-h] -p POCKET -r ROOT -i INPUT
                                             [-cs CRYSTAL_SUFFIX]
                                             [--old_unique_suffix OLD_UNIQUE_SUFFIX]
                                             [-us UNIQUE_SUFFIX]
                                             [--unique_threshold UNIQUE_THRESHOLD]
                                             [--lower_confusing_threshold LOWER_CONFUSING_THRESHOLD]
                                             [--upper_confusing_threshold UPPER_CONFUSING_THRESHOLD]
					     [--good_pose_thresh GOOD_POSE_THRESHOLD]
					     [--bad_pose_thresh BAD_POSE_THRESHOLD]
                                             -o OUTNAME [-a AFFINITY_LOOKUP]

Create lines to add to types files from counterexample generation. Assumes
data file structure is ROOT/POCKET/FILES.

optional arguments:
  -h, --help            show this help message and exit
  -p POCKET, --pocket POCKET
                        Name of the pocket that you will be generating the
                        lines for.
  -r ROOT, --root ROOT  PATH to the ROOT of the pockets.
  -i INPUT, --input INPUT
                        File that is output from
                        counterexample_generation_jobs.py
  -cs CRYSTAL_SUFFIX, --crystal_suffix CRYSTAL_SUFFIX
                        Expresssion to glob the crystal ligand PDB. Defaults
                        to _lig.pdb. Needs to match what was used with
                        counterexample_generation_jobs.py
  --old_unique_suffix OLD_UNIQUE_SUFFIX
                        Suffix for the unique ligand sdf file from a previous
                        run. If set we will load that in and add to it.
                        Default behavior is to generate it from provided input
                        file.
  -us UNIQUE_SUFFIX, --unique_suffix UNIQUE_SUFFIX
                        Suffix for the unique ligand sdf file for this run.
                        Defaults to _it1___.sdf. One will be created for each
                        ligand in the pocket.
  --unique_threshold UNIQUE_THRESHOLD
                        RMSD threshold for unique poses. IE poses with RMSD >
                        thresh are considered unique. Defaults to 0.25.
  --lower_confusing_threshold LOWER_CONFUSING_THRESHOLD
                        CNNscore threshold for identifying confusing good
                        poses. Score < thresh & under 2RMSD is kept and
                        labelled 1. 0<thresh<1. Default 0.5
  --upper_confusing_threshold UPPER_CONFUSING_THRESHOLD
                        CNNscore threshold for identifying confusing poor
                        poses. If CNNscore > thresh & over 2RMSD pose is kept
                        and labelled 0. lower<thresh<1. Default 0.9

  --good_pose_thresh GOOD_POSE_THRESH
                        RMSD threshold to identify a good pose.
			If ligand RMSD to crystal < this value, the pose
			is labeled bad. Defaults to 2.0.
  --bad_pose_thresh BAD_POSE_THRESH
                        RMSD threshold to identify a bad pose.
			If ligand RMSD to crystal >= this value, the pose
			is labeled bad. Defaults to 2.0.
  -o OUTNAME, --outname OUTNAME
                        Name of the text file to write the new lines in. DO
                        NOT WRITE THE FULL PATH!
  -a AFFINITY_LOOKUP, --affinity_lookup AFFINITY_LOOKUP
                        File mapping the PDBid and ligname of the ligand to
                        its pK value. Assmes space delimited "PDBid ligname
                        pK". Defaults to pdbbind2017_affs.txt

Example -- finishing the pipeline from the previous examples for ZIPA_ECOLI_187_328_0

python3 generate_counterexample_typeslines.py -p ZIPA_ECOLI_187_328_0 -r MYROOT -i it3_to_run.txt -us _it3___.sdf -o it3_typeslines_toadd.txt --old_unique_suffix _initial_unique.sdf

The above command will be need to run for each directory in cd2020_pockets.txt. It will create the it3_typeslines_toadd.txt in the pocket directory.

That text file contains the lines that need to be added to the training/test types files. The default values match what we used for the CrossDocked2020 paper.

Step 4 -- Adding the lines for the counterexamples to the types file

Now that the lines we need to add are generated for each pocket, we can run types_extender.py on each of the types files that we use for training and testing to generate new types files with these added lines.

usage: types_extender.py [-h] -i INPUT -o OUTPUT -n NAME [-r ROOT]

Add lines to types file and create a new one. Assumes data file structure is
ROOT/POCKET/FILES.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Types file you will be extending.
  -o OUTPUT, --output OUTPUT
                        Name of the extended types file.
  -n NAME, --name NAME  Name of the file containing the lines to add for a
                        given pocket. This is the output of
                        generate_counterexample_typeslines.py.
  -r ROOT, --root ROOT  Root of the data directory. Defaults to current
                        working directory.

Continuing our example, after running script 3 there will be an it3_typeslines_toadd.txt file in each pocket. So now we generate a new train types file and new test types file as below:

python3 types_extender.py -i my_initial_train.types -o my_new_train.types -n it3_typeslines_toadd.txt -r MYROOT
python3 types_extender.py -i my_initial_test.types -o my_new_test.types -n it3_typeslines_toadd.txt -r MYROOT

Using visualization script

There are two scripts to help you visualize how the model scores atoms: 1) simple_grid_visualization.py; 2) grid_visualization.py

Script 1 fixes a single receptor atom, & places a lig atom along specified points on the Xaxis & score them. This gives insight into how the model behaves in a simplified 2D coordinate system.

Script 2 creates a cube of single atom points around a specified receptor, which are all scored. This gives insight into how the model behaves in a simplified 3D system. A glycine tripeptide is included in this directory as a starting point.

Note: both of these scripts need to be run twice in order to complete their entire function.

Case 1: Using simple_grid_visualization.py

usage: simple_grid_visualization.py [-h] -r RECATOMS -l LIGATOMS [-o OUTNAME]
                                    [-t TYPESROOT] -m MODEL -w WEIGHTS
                                    [-n NUM_POINTS] [-i INCREMENT]
                                    [-b BOX_SIZE] [--plot] [-d DATAROOT]

Script for generating the jobs that need to be run for simple visualization.
Generates types files & a text file that needs to be run. This results in
separating atoms along the x-axis. Can then graph the results.

optional arguments:
  -h, --help            show this help message and exit
  -r RECATOMS, --recatoms RECATOMS
                        File containing Receptor atom types of your modelfile
                        (1 per line)
  -l LIGATOMS, --ligatoms LIGATOMS
                        File containing Ligand atom types of your modelfile (1
                        per line)
  -o OUTNAME, --outname OUTNAME
                        File containing commands to be evaluated to predict
                        grid points. Note: Requires GNINASCRIPTSDIR to be set
                        environment variable. (default:
                        simplegrid_predicts.txt)
  -t TYPESROOT, --typesroot TYPESROOT
                        Root folder for gninatypes data generated from script.
                        (default: simpletypes/)
  -m MODEL, --model MODEL
                        Model file that predictions will be made with. Must
                        end in .model
  -w WEIGHTS, --weights WEIGHTS
                        Weights for the model file that the predictions will
                        be made with.
  -n NUM_POINTS, --num_points NUM_POINTS
                        Number of points. Defaults are reasonable. (default:
                        200)
  -i INCREMENT, --increment INCREMENT
                        increment (Angstroms) between points. Combines with
                        num_points multiplicitavely. Defaults for both result
                        in 200 points spanning 20 angstroms (default: 0.1)
  -b BOX_SIZE, --box_size BOX_SIZE
                        Size of the box. Used for sanity check that you are
                        not trying to predict outside of box for gnina. MUST
                        MATCH BOX OF MODEL. Defaults are default grid size for
                        gnina (default: 24)
  --plot                Flag to make 1 large plot from the data. Assumes
                        job(s) have completed. Requires Hydrogen to be a vaild
                        receptor. Saves pdf called simple_vis.pdf in the
                        current working directory (default: False)
  -d DATAROOT, --dataroot DATAROOT
                        Root folder of data resulting from output of running
                        the OUTNAME file (default: simpledata/)

The workflow for this script is the following: 1) Generate OUTNAME, 2) run each command present in OUTNAME, 3) Plot

As an example, I will use the default values of the script, RECATOMS=my_recatoms.txt, LIGATOMS=my_ligatoms.txt, MODEL=my_model.model, and WEIGHTS=my_modelweights.caffemodel.

First, we need to generate the commands to run with gnina. (this is OUTNAME, which will be simplegrid_predicts.txt)

python simple_grid_visualization.py -m my_model.model -w my_modelweights.caffemodel -r my_recatoms.txt -l my_ligatoms.txt

This will generate 2 new directories: simpledata (empty) and simpletypes. Simpleteypes should have a directory for every unique rec and lig atom. Additionally there will be a .types file for every rec+lig atom combination. Each types file should be NUM_POINTS lines (200 in this case).

Additionally, in the current working directory, there should be a new file called OUTNAME (simplegrid_predicts.txt). It will have 1 line per rec and lig atom combination.

Now we need to set the GNINASCRIPTSDIR environment variable. This would correspond to where this repo is installed.

export GNINASCRIPTSDIR=~/git/gnina/scripts

Third, we need to evaluate each line in simplegrid_predicts.txt. Note: this could take a fair amount of time on one machine, as this CANNOT be parallelized due to each job requiring the use of the GPU.

sh simplegrid_predicts.txt

After the above command has completed, simpledata should now be populated with reclig_predictscores files. These files can be utilized to make graphs now.

python simple_grid_visualization.py -m my_model.model -w my_modelweights.caffemodel -r my_recatoms.txt -l my_ligatoms.txt --plot

This will make 1 large pdf containing all the plots, simple_vis.pdf.

Case 2: Using grid_visualization.py

usage: grid_visualization.py [-h] -r RECATOMS -l LIGATOMS [-o OUTNAME]
                             [-t TYPESROOT] -m MODEL -w WEIGHTS [-p TEST_PDB]
                             [-c CUBE_LENGTH] [-n NUM_POINTS] [--make_dx]
                             [-d DATAROOT]

Script for generating the jobs that need to be run for visualization.
Generates types files & a text file that needs to be run. Can make a DX file
for visualization

optional arguments:
  -h, --help            show this help message and exit
  -r RECATOMS, --recatoms RECATOMS
                        File containing Receptor atom types of your modelfile (1 per line)
  -l LIGATOMS, --ligatoms LIGATOMS
                        File containing Ligand atom types of your modelfile (1 per line)
  -o OUTNAME, --outname OUTNAME
                        File containing commands to be evaluated to predict
                        grid points. Note: Requires GNINASCRIPTSDIR to be a
                        set environment variable. (default: grid_predicts.txt)
  -t TYPESROOT, --typesroot TYPESROOT
                        Root folder for gninatypes data generated from script.
                        (default: types/)
  -m MODEL, --model MODEL
                        Model file that predictions will be made with. Must
                        end in .model
  -w WEIGHTS, --weights WEIGHTS
                        Weights for the model file that the predictions will
                        be made with.
  -p TEST_PDB, --test_pdb TEST_PDB
                        pdbfile of receptor, centered at the origin for
                        visualization (default: gly_gly_gly.pdb)
  -c CUBE_LENGTH, --cube_length CUBE_LENGTH
                        Width of cube for grid box of points. Defaults are
                        reasonable (default: 24.0)
  -n NUM_POINTS, --num_points NUM_POINTS
                        Number of points per half of the box (ex 20 means
                        there will be 39x39x39 points total). Defaults are
                        reasonable. (default: 20)
  --make_dx             Flag to make dx files from the data. Assumes job(s)
                        have completed. (default: False)
  -d DATAROOT, --dataroot DATAROOT
                        Root folder of data resulting from output (default:
                        data/)

The first run of this script is WITHOUT the --make_dx flag. This will produce a file OUTNAME wherin each line will need to be executed, and GITSCRIPTSDIR is a set environment variable which points to the location where you have cloned this repository. The gly_gly_gly.pdb file is provided in this directory, and is 3 connected glycine residues whose center of mass is at (0,0,0).

As before I will be evaluating the defaults of the script with RECATOMS=my_recatoms.txt, LIGATOMS=my_ligatoms.txt, MODEL=my_model.model, and WEIGHTS=my_modelweights.caffemodel. Note: gly_gly_gly.pdb needs to be in the current working directory, and the gninatyper tool needs to be installed on your machine (it is installed alongside gnina).

First we must make the textfile with all of the commands to be run with gnina.

python grid_visualization.py -m my_model.model -w my_modelweights.caffemodel -r my_recatoms.txt -l my_ligatoms.txt

Next, we need to execute each command in OUTNAME (grid_predicts.txt). NOTE: this cannot be parallelized on one machine as each commands requires the GPU to be utilized.

sh grid_predicts.txt

After all of the lines in OUTNAME have been completed, rerun this script with the --make_dx flag and with the same arguments as before.

python grid_visualization.py -m my_model.model -w my_modelweights.caffemodel -r my_recatoms.txt -l my_ligatoms.txt --make_dx

DATAROOT will now contain the corresponding dx files. In order to visualize load the receptor and a dx file of interest via pymol: pymol TEST_PDB my_dxfile. Visualizations can be performed by adjusting the volume slider & color of the dx object.