A pipeline for predicting and masking transposable elements in multiple genomes.
This pipeline is currently mostly developed for personal use, some of the default parameter combinations or tools may be specific for the kinds of organisms I work on (plant-pathogenic Fungi). I do make an effort to make it usable more generally though.
PanTE takes a population of genomes and runs several repeat, transposable element, and non-coding RNA prediction tools and merges the results to yield a reasonably comprehensive picture of repeats in your genomes. My intended use case is for multiple genomes from the same species, but I suppose you could do closely related organisms in the same run too.
To run PanTE you'll just need your genomes and optionally a copy of the RepBase repeat masker formatted database.
The pipeline follows these main steps:
- Predict non-coding RNA elements using tRNAScan-SE, Infernal (searching against Rfam), and optionally RNAmmer.
- Predict transposable elements using RepeatModeler, LTRHarvest/LTRDigest, EAHelitron, MiteFinder 2, and MMSeqs2 profile searches against GyDB, selected Pfam models, and a custom set of TE proteins derived from the TransposonPSI and LTR_retriever libraries.
- Combine all TE predictions (except LTRDigest/Harvest) and cluster them to form conservative families using vsearch.
- Filter the families based on minimum abundance within each genome and presence across the population.
- Compute multiple sequence alignments for the families using DECIPHER.
- Classify the families using RepeatClassifier (part of RepeatModeler).
- Search all genomes for more distant matches to the families (and optionally species models from RepBase/DFAM) using RepeatMasker.
- Combine all TE and ncRNA predictions into a final GFF files and soft-mask the genomes using this combined set.
The reason that LTRDigest/LTRHarvest is currently excluded from the family clustering is that the predicted LTRs tend to be quite big or contain nested elements, which tends to group non-related TEs into single clusters. Eventually I might implement a method to split the LTR predictions up to avoid this issue, but RepeatModeler tends to pick up important parts of LTRs anyway.
There are a couple of pipelines that do repeat annotation, but I haven't seen any that handle multiple genomes particularly well.
Here are some honourable mentions:
- REPET is very comprehensive but famously buggy and difficult to install/configure.
- EDTA looks fairly promising and is probably a good choice for plant genomes.
- PiRATE is quite comprehensive. It is distributed as a virtual machine, and is run via Galaxy within that VM. This is probably convenient for people that only have a few genomes to run and would prefer to avoid the command line.
Other pipelines tend to be focussed on inferring repeats from raw reads (e.g. RepeatExplorer and don't offer much for genome annotation. There is another category of TE pipeline that focusses on insertion site prediction, including McClintock, TEA, and STEAK. These pipelines are really only useful for organisms with existing, well-curated repeat families and for enabling specific TE-focussed sequencing experiments.
Assuming you have singularity and nextflow installed (See INSTALL).
Say you have a bunch of genome fasta files in a folder genomes/*.fasta
.
nextflow run darcyabjones/pante -profile singularity -resume --genomes "genomes/*.fasta"
Will run the full pipeline (except for RNAmmer and RepeatMasker using the "species" model). Additional databases like Dfam, Rfam, GyDB, and selected Pfam models will be downloaded as part of the pipeline, but you can also download them beforehand and provide them as an argument.
The results will be written to the results
folder.
If you provide the --species
parameter, a separate pass of RepeatMasker will be run using the Dfam and/or RepBase databases instead of the custom libraries.
If you would like to include this extra step I would highly recommend providing the RepBase repeat masker database and the corresponsing RepeatMasker metadata files if you have access.
If you do have access to RepBase, it's probably worth using it even if you aren't using the --species
option because it might help improve family annotation.
The value given to --species
can be any NCBI taxonomy name and is provided to the RepeatMasker option -species
.
nextflow run darcyabjones/pante -profile singularity -resume \
--genomes "genomes/*.fasta" \
--repbase "downloads/RepBaseRepeatMaskerEdition-20181026.tar.gz" \
--rm_meta "downloads/RepeatMaskerMetaData-20181026.tar.gz" \
--species "fungi"
If you would like to include RNAmmer rDNA predictions, you'll need to either install it on all machines that you're running the pipeline on, or you can build the extra container that does the install for you (See containers/README.md).
Then you can provide the --rnammer
flag to enable those steps.
Here i'm assuming that you've installed RNAmmer locally.
nextflow run darcyabjones/pante -profile singularity -resume \
--genomes "genomes/*.fasta" \
--species "fungi" \
--rnammer
The pipeline is written in Nextflow, which you will need to install on the executing computer.
The pipeline itself has many dependencies and I have customised RepeatMasker/RepeatModeler a bit, so I HIGHLY recommend that you use the docker or singularity containers.
If you really want to install the software yourself look in the containers
folder and follow the Dockerfiles
.
To run the containers, you'll need to install either Singularity (recommended) or Docker.
The pipeline itself will pull the containers for you from Sylabs Cloud or DockerHub.
On an ubuntu server, the process to install nextflow and singularity might look like this.
set -eu
sudo apt-get update
sudo apt-get install -y \
default-jre-headless \
build-essential \
libssl-dev \
uuid-dev \
libgpgme11-dev \
squashfs-tools \
libseccomp-dev \
wget \
pkg-config \
git
VERSION=1.12
OS=linux
ARCH=amd64
wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz
sudo tar -C /usr/local -xzvf go$VERSION.$OS-$ARCH.tar.gz
rm go$VERSION.$OS-$ARCH.tar.gz
echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
VERSION=3.4.0
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz
tar -xzf singularity-${VERSION}.tar.gz
cd singularity
./mconfig
make -C builddir
sudo make -C builddir install
cd ..
rm -rf -- singularity
curl -s https://get.nextflow.io | bash
./nextflow run darcyabjones/pante --help
Because RNAmmer has a restricted license, you'll need to download the source files yourself and either install it locally or build a special container that includes it. There are instructions for doing this here.
The strength of pipeline engines like nextflow is that you can run it on different compute systems simply by switching some configuration files.
Some preset config files are included in this repo.
You can view these config files in the conf
directory.
The configuration to use at runtime is controlled by the -profile
parameter.
Multiple profiles can be specified by separating them with a comma e.g. -profile standard,singularity
.
PanTE generally has a separate config file for a compute environment (e.g. cloud, HPC, laptop), and for a software environment (e.g. singularity, docker, local).
It's likely that you'll have to tailor the compute configuration, but you shouldn't need to change the software config so this allows you to mix-and-match.
Available profiles for containerised software environments are:
singularity
- Use a pre-built singularity image containing all non-proprietary software available from https://cloud.sylabs.io/library/darcyabjones/default/pante.singularity_indiv
- Uses individual singularity images for each tool build locally and stored incontainers/singularity
.singularity_plus
- Uses an extended version of the pre-built image which you must build locally and is stored ascontainers/singularity/pante-plus.sif
. Alternatively you could use the argument-with-singularity path/to/pante-plus.sif
.docker
- Use a pre-build docker container. Likesingularity
. Available from https://cloud.docker.com/repository/docker/darcyabjones/pante.docker_indiv
- Use individual docker images which must be built locally. Likesingularity_indiv
.docker_plus
- Likesingularity_plus
but with docker.
If you don't specify a software environment profile, it is assumed that all dependencies are installed locally and available on your PATH
.
NOTE: the docker profiles assume that running docker
does not require sudo
.
To use this you'll need to configure docker for "sudo-less" operation (instructions here).
If you don't like this, try singularity :)
Available compute profiles are:
standard
- (Default) Appropriate for running on a laptop with 4 CPUs and ~8GB RAM.nimbus
- Appropriate for cloud VMs or a local desktop with 16 CPUs and ~32 GB RAM each.pawsey_zeus
- Is a config for running on the Pawsey Zeus compute cluster using SLURM. Use this as more of a template for setting up your own profile as HPC configuration is pretty specific (and in this case contains some hard coded user options, sorry).
To add your own profile, you can use the files in ./conf
as a template, and make sure you add them to nextflow.config
under the profiles
block.
For more info on configuration see the nextflow documentation. You can also raise an issue on the github repository and I'll try to help.
parameter | default | description |
---|---|---|
--genomes |
Required | A glob of the fasta genomes to search for genes in. The basename of the file is used as the genome name. |
--outdir |
results |
The directory to store the results in. |
--repbase |
Optional | The RepBase RepeatMasker edition tarball to use to construct the repeatmasker database. Download from https://www.girinst.org/server/RepBase/index.php. |
--rm_meta |
Optional | The RepeatMasker meta tarball to use to construct the repeatmasker database. Download from http://www.repeatmasker.org/libraries/. Make sure the version matches the version of Repbase if you're using RepBase. |
--dfam_hmm |
Optional | Pre downloaded Dfam HMMs to use. Will download latest if this isn't provided. |
--dfam_hmm_url |
URL to Dfam.hmm.gz |
The url to download the Dfam HMMs from if --dfam_hmm isn't provided. |
--dfam_embl |
Optional | Pre downloaded Dfam consensus sequences to use. Will download latest if this isn't provided. |
--dfam_embl_url |
URL to Dfam.embl.gz |
The url to download the Dfam consensus sequences from if --dfam_embl isn't provided. |
--rm_repeatpeps |
Optional | Repeat proteins to use for repeatmasker. By default this is taken from the RepeatMasker Library/RepeatPeps.lib and assumes that you're using the containers. |
--rm_species |
Optional | An NCBI taxonomy name to use to predict transposable elements from RepBase with. Something like fungi usually works fine. |
--mitefinder_profiles |
Optional | A text file for MiteFinderII containing profiles to search for. Corresponds to [https://github.com/screamer/miteFinder/blob/master/profile/pattern_scoring.txt]. By default will use a file pointed to by the MITEFINDER_PROFILE environment variable, which is set in the provided containers. |
--noinfernal |
false | Don't run Infernal cmscan against Rfam. This can save some time. |
--rfam |
Optional | Pre-downloaded Rfam CM models (un-gzipped) to use. |
--rfam_url |
URL to Rfam.cm.gz |
The url to download Rfam CM models from if --rfam isn't provided. Will not download if --noinfernal |
--rfam_clanin |
Optional | Pre-downloaded Rfam clan information to use. |
--rfam_clanin_url |
URL to Rfam.clanin |
The URL to download Rfam clan info from if --rfam_clanin isn't provided. |
--rfam_gomapping |
Optional | Pre-downloaded Rfam GO term mappings to use. |
--rfam_gomapping_url |
URL to rfam2go |
The URL to download Rfam GO term mappings from if --rfam_gomapping isn't provided. |
--rnammer |
false | Run RNAmmer analyses on the genomes. Assumes that you are using the containers with RNAmmer installed or have otherwise set RNAmmer. Will fail if it isn't installed. |
--pfam |
Optional | A glob of Pfam stockholm formatted alignments (not gzipped) to use to search against the genomes. |
--pfam_ids |
data/pfam_ids.txt |
A file containing a list of Pfam accessions to download and use if --pfam isn't provided. |
--gypsydb |
Optional | A glob of stockholm formatted alignments from GyDB to search against the genomes. |
--gypsydb_url |
URL to GyDB_collection.zip |
The URL to download GyDB from if --gypsydb is not provided. |
--protein_families |
data/proteins/families.stk |
A stockholm formatted file of custom aligned protein families to search against the genomes. |
--infernal_max_evalue |
0.00001 | The maximum e-value to use to consider cmscan matches significant. |
--mmseqs_max_evalue |
0.001 | The maximum e-value to use to consider mmseqs profile matches against the genomes significant. |
--min_intra_frequency |
4 | The minimum number of copies a clustered repeat family must have within a genome for it to be considered "present". |
--min_inter_proportion |
0.2 | The minimum proportion of genomes that the clustered repeat family must be present in (after --min_intra_frequency ) to be considered a geniune family. |
--repeatmodeler_min_len |
10000 | The minimum scaffold length to allow for predicting repeats in repeatmodeler. Scaffolds smaller than this will be removed to avoid sampling bias. |
--eahelitron_three_prime_fuzzy_level |
3 | Passed on to the EAHelitron parameter -r . |
--eahelitron_upstream_length |
3000 | Passed on to the EAHElitron parameter -u . |
--eahelitron_downstream_length |
50 | Passed on to the EAHelitron parameter d . |
--ltrharvest_similar |
85 | Passed on to the LTRHarvest parameter -similar . |
--ltrharvest_vic |
10 | Passed on to the LTRHarvest parameter -vic . |
--ltrharvest_seed |
20 | Passed on to the LTRHarvest parameter -seed . |
--ltrharvest_minlenltr |
100 | Passed on to the LTRHarvest parameter -minlenltr . |
--ltrharvest_maxlenltr |
7000 | Passed on to the LTRHarvest parameter -maxlenltr . |
--ltrharvest_mintsd |
4 | Passed on to the LTRHarvest parameter -mintsd . |
--ltrharvest_maxtsd |
6 | Passed on to the LTRHarvest parameter -maxtsd . |
--mitefinder_threshold |
0.5 | Passed on to the MiteFinderII parameter -threshold . |
--trans_table |
1 | The ncbi translation table number to use for MMSeqs searches. |
A test dataset and example command is provided in the test
folder.
On a laptop this takes about an hour to run.
I'm hoping to add some better error-handling in the future to provide more useful/nextflow-agnostic tips to users. In the meantime, it's just input parameter validation that is handled elegantly.
- 0: All ok.
- 1: Incomplete parameter inputs.