RhoFold+: Accurate RNA 3D structure prediction using a language model-based deep learning approach

This is the open source code for RhoFold+.

Citation

@article{shen2022e2efold,
  title={E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D Structure Prediction},
  author={Shen, Tao and Hu, Zhihang and Peng, Zhangzhi and Chen, Jiayang and Xiong, Peng and Hong, Liang and Zheng, Liangzhen and Wang, Yixuan and King, Irwin and Wang, Sheng and others},
  journal={arXiv preprint arXiv:2207.01586},
  year={2022}
}

Table of contents

Recent Updates
Online Server
Local Environment Setup
- For Linux Users
- Download Pre-trained Model
Usage
Training Data
Citations
License

Updates

*** Dec 31 / 2023 ***

Integrated inferencing with clustered, sampled MSAs in RhoFold+.

*** Oct 10 / 2023 ***

Initial commits:

Pretrained model is provided.

Online Server

No need to create the environment locally, you can also access RhoFold+ easily through its online server: https://proj.cse.cuhk.edu.hk/aihlab/RhoFold/

Local Environment Setup

Create Environment with Conda First, download the repository and create the environment.

Linux Users

(MacOS is currently not supported)

git clone https://github.com/ml4bio/RhoFold.git
cd ./RhoFold
conda env create -f ./envs/environment_linux.yaml

Then, activate the "RhoFold" environment.

conda activate RhoFold
python setup.py install

Download pre-trained model

cd ./pretrained
wget https://proj.cse.cuhk.edu.hk/aihlab/RhoFold/api/download?filename=RhoFold_pretrained.pt -O RhoFold_pretrained.pt
cd ../

Usage

Input Arguments

python inference.py

  --input_fas INPUT_FAS
                        Path to the input fasta file. Valid nucleic acids in RNA sequence: A, U, G, C. Input of sequence standalone is in testing. It's not as accurate as inputs of sequences combined with MSA. The former is only for the user to generate a quick reference structure.
  --input_a3m INPUT_A3M
                        Path to the input msa file, default None.
                        If --input_a3m is not given (set to None), MSA will be generated automatically.
  --output_dir OUTPUT_DIR
                        Path to the output dir. 
                        Tertiary Structure prediction is saved in .pdb format (pLDDT score is recorded in the B-factor column). 
                        Distogram prediction is saved in .npz format.
                        Secondary structure prediction is save in .ct format.     
  --device DEVICE       
                        Default cpu. If GPUs are available, you can set --device cuda:<GPU_index> for faster prediction.
  --ckpt CKPT           
                        Path to the pretrained model. Default ./pretrained/model_20221010_params.pt
  --relax_steps RELAX_STEPS
                        Num of steps for structure refinement, default 1000.
  --single_seq_pred 
                        Default False.
                        If --single_seq_pred is set to True, the modeling will run using single sequence only (input_fas)
  --database_dpath      
                        Path to the sequence database for MSA construction. Default ./database
  --binary_dpath
                        Path to the executable. Default ./RhoFold/data/bin

Output Files

The outputs will be saved in the directory provided via the --output_dir flag of inference.py. The outputs include the unrelaxed structures, relaxed structures, prediction metadata, and running log. The --output_dir directory will have the following structure:

<--output_dir>/
    results.npz
    ss.ct
    unrelaxed_model.pdb
    relaxed_{relax_steps}_model.pdb
    log.txt

The contents of each output file are as follows:

results.npz – A .npz file containing the distogram prediction of RhoFold+ in NumPy arrays.
ss.ct – A .ct format text file containing the predicted secondary structure.
unrelaxed_model.pdb – A PDB format file containing the predicted structure from deep learning.
relaxed_{relax_steps}_model.pdb – A PDB format file containing the amber relaxed structure from unrelaxed_model.pdb.
log.txt – A txt file containing the running log.

Examples

Below are examples on how to use RhoFold+ in different scenarios.

Folding with sequence and given MSA as input

python inference.py --input_fas ./example/input/3owzA/3owzA.fasta --input_a3m ./example/input/3owzA/3owzA.a3m --output_dir ./example/output/3owzA/ --ckpt ./pretrained/RhoFold_pretrained.pt

Folding with sampled, clustered MSA as input

python ./scripts/rhofold_msa_sampler_clust.py -i MSA_PATH -o OUT_DIR -n NUM_CLUST
python inference.py --input_fas ./example/input/3owzA/3owzA.fasta --input_a3m OUT_DIR --output_dir ./example/output/3owzA/ --ckpt ./pretrained/RhoFold_pretrained.pt

Folding with single sequence as input

1.Sequence standalone
This function is in testing. It's not as accurate as the MSA version. It's only for the user to generate a quick reference structure.

python inference.py --input_fas ./example/input/3owzA/3owzA.fasta --single_seq_pred True --output_dir ./example/output/3owzA/ --ckpt ./pretrained/RhoFold_pretrained.pt

2.With our constructed MSA (Full version of RhoFold+)

To support MSA construction, 3 sequence databases (RNAcentral, Rfam, and nt) totaling about 900GB need to be downloaded.

Warning: you should ensure that there are adequate spaces for saving the data! Otherwise you can directly utilize our online server, or download our off-the-shelf MSAs instead of regenerating them.

./database/bin/builddb.sh

Then you can run the following command lines:

python inference.py --input_fas ./example/input/3owzA/3owzA.fasta --output_dir ./example/output/3owzA/ --ckpt ./pretrained/RhoFold_pretrained.pt

Training Data

You can access training data (13.86G) from the google drive link. The file includes the off-the-shelf MSAs of training data, which can be fed into RhoFold+ directly.

Citations

@article{shen2022e2efold,
  title={E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D Structure Prediction},
  author={Shen, Tao and Hu, Zhihang and Peng, Zhangzhi and Chen, Jiayang and Xiong, Peng and Hong, Liang and Zheng, Liangzhen and Wang, Yixuan and King, Irwin and Wang, Sheng and others},
  journal={arXiv preprint arXiv:2207.01586},
  year={2022}
}

License

This source code is licensed under the Apache license found in the LICENSE file in the root directory of this source tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RhoFold+: Accurate RNA 3D structure prediction using a language model-based deep learning approach

Updates

Online Server

Local Environment Setup

Linux Users

Download pre-trained model

Usage

Input Arguments

Output Files

Examples

Folding with sequence and given MSA as input

Folding with sampled, clustered MSA as input

Folding with single sequence as input

Training Data

Citations

License

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
database		database
envs		envs
example/input		example/input
pretrained		pretrained
rhofold		rhofold
scripts		scripts
LICENSE		LICENSE
README.md		README.md
View.png		View.png
inference.py		inference.py
setup.py		setup.py

License

ml4bio/RhoFold

Folders and files

Latest commit

History

Repository files navigation

RhoFold+: Accurate RNA 3D structure prediction using a language model-based deep learning approach

Updates

Online Server

Local Environment Setup

Linux Users

Download pre-trained model

Usage

Input Arguments

Output Files

Examples

Folding with sequence and given MSA as input

Folding with sampled, clustered MSA as input

Folding with single sequence as input

Training Data

Citations

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages