Skip to content

Tool for fully-automated inference of species trees from raw genome assemblies

License

Notifications You must be signed in to change notification settings

TurakhiaLab/ROADIES

Repository files navigation

Reference-free Orthology-free Annotation-free DIscordance aware Estimation of Species tree (ROADIES)

License Build Status

Table of Contents


Introduction

Welcome to the official repository of ROADIES, a novel pipeline designed for phylogenetic tree inference of the species directly from their raw genomic assemblies. ROADIES offers a fully automated, easy-to-use, scalable solution, eliminating any manual steps and providing unique flexibility in adjusting the tradeoff between accuracy and runtime.

For more detailed information on all the features and settings of ROADIES, please refer to our Wiki.


ROADIES Pipeline Stages Figure: ROADIES Pipeline Stages

Quick Install

Using ROADIES Bioconda package (recommended)

To run ROADIES using Bioconda package, follow these steps:

Note: You need to have conda installed in your system. Also make sure you have updated version of glibc in your system (GLIBC >= 2.29).

To install and use conda in Ubuntu machine, execute the set of commands below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

export PATH="$HOME/miniconda3/bin:$PATH"
source ~/.bashrc

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

After this, try running conda in your terminal to check if conda is properly installed. Once it is installed, follow the steps below:

  1. Create and activate custom conda environment with Python version 3.9
conda create -n myenv python=3.9
conda activate myenv
  1. Install ROADIES bioconda package
conda install roadies

All files of ROADIES along with dependencies will be found in <conda_install_path>/miniconda3/envs/myenv/ROADIES.

Using DockerHub

To run ROADIES using DockerHub, follow these steps:

  1. Pull the ROADIES Docker image from DockerHub:
docker pull ang037/roadies:latest
  1. Run the Docker container:
docker run -it ang037/roadies:latest

Using Docker locally

First, clone the repository (requires git to be installed in the system):

git clone https://github.com/TurakhiaLab/ROADIES.git
cd ROADIES

Then build and run the Docker container:

docker build -t roadies_image .
docker run -it roadies_image

Using installation script (requires sudo access)

First clone the repository:

git clone https://github.com/TurakhiaLab/ROADIES.git
cd ROADIES

Then, execute the installation script:

chmod +x roadies_env.sh
source roadies_env.sh

This will install and build all tools and dependencies. Once the setup is complete, it will print Setup complete in the terminal and activate the roadies_env environment with all Conda packages installed.

Required dependencies

To run this script, ensure the following dependencies are installed:

For Ubuntu, you can install these dependencies with:

sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptools git default-jre libgomp1 libboost-all-dev cmake

Note: If you encounter issues with the Boost library, add its path to $CPLUS_LIBRARY_PATH and save it in ~/.bashrc.


Quick Start

Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine:

  1. Go to ROADIES repository directory if not there:
cd ROADIES
  1. Create a directory for the test data and download the test datasets (using the following one line command):
mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}'
  1. Run the pipeline with the following command (from ROADIES directory):
python run_roadies.py --cores 16

The second command will download the 11 Drosophila genomic datasets (links provided in test/input_genome_links.txt) and save them in the test/test_data directory. The third command will run ROADIES pipeline for those 11 Drosophila genomes and save the final newick tree as roadies.nwk in a separate output_files folder upon completion.


Run ROADIES with your own datasets

To run ROADIES with your own datasets, follow these steps:

  1. Specify Input Genomic Dataset: Update the config.yaml file (found in the ROADIES directory - config folder) to include the path to your input datasets under the GENOMES parameter. Ensure all input genomic assemblies are in .fa or .fa.gz format and named according to the species' name (e.g., Aardvark.fa).

Note: Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (fasplit can be used: faSplit byname <input_dir> <output_dir>).

  1. Configure Other Parameters: Adjust other parameters in config.yaml as needed. Detailed information on each parameter is available in the Usage section.

  2. Run the Pipeline: Execute the pipeline with the following command (example for 16 cores):

python run_roadies.py --cores 16

The output species tree in Newick format will be saved as roadies.nwk in the output_files folder.

  1. Modes of operation: ROADIES supports multiple modes of operation (fast, balanced, accurate) by controlling the accuracy-runtime tradeoff. Use any one of the following commands to select a mode (accurate mode is the default):
python run_roadies.py --cores 16 --mode accurate

python run_roadies.py --cores 16 --mode balanced

python run_roadies.py --cores 16 --mode fast

For troubleshooting and contribution details, refer to Wiki


Citing ROADIES

If you use ROADIES in your research or publications, please cite the following paper:

Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. bioRxiv. https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1.

Accessing ROADIES output files

The output files with the gene trees and species trees generated by ROADIES in the manuscript are deposited to Dryad. To access it, please refer to the following:

Gupta, Anshu; Mirarab, Siavash; Turakhia, Yatish (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES [Dataset]. Dryad. https://doi.org/10.5061/dryad.tht76hf73.