Skip to content
Darcy Jones edited this page Aug 24, 2022 · 25 revisions

Predector

Table of contents

Predector is a pipeline to run numerous secretome and fungal effector prediction tools, and to combine them in usable and informative ways.

The pipeline currently includes: SignalP (3, 4, 5, 6), TargetP (v2), DeepLoc, TMHMM, Phobius, DeepSig, CAZyme finding (with dbCAN), Pfamscan, searches against PHI-base, Pepstats, ApoplastP, LOCALIZER, Deepredeff and EffectorP 1, 2, and 3. These results are summarised as a table that includes most information that would typically be used for secretome analysis. Effector candidates are ranked using a learning-to-rank machine learning method, which balances the tradeoff between secretion prediction and effector property prediction, with higher-sensitivity, comparable specificity, and better ordering than naive combinations of these features. We recommend that users incorporate these ranked effector scores with experimental evidence or homology matches to prioritise other more expensive efforts (e.g. cloning or structural modelling).

We hope that Predector can become a platform enabling multiple secretome analyses, with a special focus on eukaryotic (currently only Fungal) effector discovery. We also seek to establish data informed best practises for secretome analysis tasks, where previously there was only a loose consensus, and to make it easy to follow them.

Predector is designed to be run on complete predicted proteomes, as you would get after gene prediction or from databases like uniprot. Although the pipeline will happily run with processed mature proteins or peptide fragments, the analyses that are run as part of the pipeline are not intended for this purpose and any results from such input should be considered with extreme caution.

Quick install

Requirements

  • 4 CPUs
  • 8 GB RAM
  • About 20-30 GB of free disk space (~15 GB for all of the software, the rest depends on what you're running).
  • A bash terminal in a unix-type environment, we primarily test on the current ubuntu LTS.

Only the last two are really hard requirements. You might get away with running smaller genomes on a smaller computer, but I would say most modern laptops meet these criteria.

Optional - Remove previous software environments for old versions of the pipeline

The software environments that we provide are quite specific to different versions of the pipeline. If you are updating to use a new version of the pipeline, you should also re-build the software environment for the new version.

To remove old versions of the software environment:

# conda
conda env remove -n predector
# or if you installed to a directory
conda env remove -p /path/to/conda/env

# docker
OLD_VERSION=1.1.1
docker rmi -f predector/predector-base:${OLD_VERSION}
docker rmi -f predector/predector:${OLD_VERSION}

Singularity container files can simply be deleted if you like. Docker commands may require sudo depending on how your computer is set up.

1. Install Conda, Docker, or Singularity

We provide automated ways of installing dependencies using conda environments (linux OS only), or docker or singularity containers.

Please follow the instructions at one of the following links to install:

We cannot support conda environments on Mac or Windows. This is because some older software included in SignalP 3 and 4 is not compiled for these operating systems, and being closed source we cannot re-compile them. Even windows WSL2 does not seem to play well with SignalP 4.

Please use a full Linux virtual machine (e.g. a cloud server or locally in VirtualBox) or one of the containerised options.

If you are running conda, we can also build the environment using Mamba. Functionally there is no difference between mamba and conda environments, but mamba is faster at building the environment. Just install mamba into your base conda environment (or install mambaforge instead of miniconda) and select the mamba option later.

Please note. We strongly recommend the containerised environments (docker or singularity) if the option is available to you. The vast majority of issues we've helped people with have been related to poor isolation from the host environment or broken conda dependencies (which we have little control over other than vigilance). Containers sidestep both of these issues. We're more than happy to support conda and help you with any issues, but if you want to avoid issues I suggest using docker or singularity.

2. Download the proprietary software dependencies

Predector runs several tools that we cannot download for you automatically. Please register for and download each of the following tools, and place them all somewhere that you can access from your terminal. Where you have a choice between versions for different operating systems, you should always take the Linux version (even if using Mac or Windows).

Note that DTU (SignalP etc) don't keep older patches and minor versions available. If the specified version isn't available to download, another version with the same major number should be fine. But please also let us know that the change has happened, so that we can update documentation and make sure our installers handle them correctly.

I suggest storing these all in a folder and just copying the whole lot around. If you use Predector often, you'll likely re-build the environment fairly often.

We've been having some teething problems with SignalP 6. Until this is resolved I've made installing and running it optional. You don't have to install SignalP6, though I recommend you try to.

3. Build the conda environment or container

We provide an install script that should install the dependencies for the majority of users.

In the following command, substitute the assigned value of ENVIRONMENT for conda, mamba, docker, or singularity as suitable. Make sure you're in the same directory as the proprietary source archives. If the names below don't match the filenames you have exactly, adjust the command accordingly. For singularity and docker container building you may be prompted for your root password (via sudo).

ENVIRONMENT=docker

curl -s "https://raw.githubusercontent.com/ccdmb/predector/1.2.7/install.sh" \
| bash -s "${ENVIRONMENT}" \
    -3 signalp-3.0.Linux.tar.Z \
    -4 signalp-4.1g.Linux.tar.gz \
    -5 signalp-5.0b.Linux.tar.gz \
    -6 signalp-6.0g.fast.tar.gz \
    -t targetp-2.0.Linux.tar.gz \
    -d deeploc-1.0.All.tar.gz \
    -m tmhmm-2.0c.Linux.tar.gz \
    -p phobius101_linux.tar.gz

This will create the conda environment (named predector), or the docker (tagged predector/predector:1.2.7) or singularity (file ./predector.sif) containers.

Take note of the message given upon completion, which will tell you how to use the container or environment with Predector.

If you don't want to install SignalP 6 you can exclude the -6 filename.tar.gz argument.

The install script has some minor options that allow you to customise how and where things are built. You can also save the install script locally and run install.sh --help to find more information.

-n|--name         -- For conda, sets the environment name (default: 'predector').
                     For docker, sets the image tag (default: 'predector/predector:1.2.7').
                     For singularity, sets the output image filename (default: './predector.sif').
-c|--conda-prefix -- If set, use this as the location to store the built conda
                     environment instead of setting a name and using the default
                     prefix. This is useful if the location that conda is installed in is restricted somehow,
                     or if it isn't available as a shared drive if on a cluster.
--conda-template  -- Use this conda environment.yml file instead of downloading it from github.
                     Only affects conda installs.
-v|--version      -- The version of the pipeline that you want to
                     setup dependencies for. Note that this may not work in
                     general, and you're recommended to use the install.sh
                     script for the targeted version.

Note that the -s in bash -s is a bash flag rather than for our install script. It is only necessary if the script is coming into bash from stdin (as in the curl -s URL | bash command), and it tells bash that any further arguments should go to the script rather than the bash runtime itself. If you are running like bash ./install.sh ... you don't need the -s.

4. Install NextFlow

NextFlow requires a bash compatible terminal, and Java version 8+. We require NextFlow version 21 or above. Extended install instructions are available at: https://www.nextflow.io/.

curl -s https://get.nextflow.io | bash

Or using conda:

conda install -c bioconda nextflow>=21

5. Test the pipeline

Use one of the commands below using information given upon completion of dependency install script. Make sure you use the environment that you specified in Step 3.

Using conda:

nextflow run -profile test -with-conda /home/username/path/to/environment -resume -r 1.2.7 ccdmb/predector

Using docker:

nextflow run -profile test,docker -resume -r 1.2.7 ccdmb/predector

# if your docker configuration requires sudo use this profile instead
nextflow run -profile test,docker_sudo -resume -r 1.2.7 ccdmb/predector

Using singularity:

nextflow run -profile test -with-singularity path/to/predector.sif -resume -r 1.2.7 ccdmb/predector

# or if you've build the container using docker and it's in your local docker registry.
nextflow run -profile test,singularity -resume -r 1.2.7 ccdmb/predector

Extended dependency install guide

If the quick install method doesn't work for you, you might need to run the environment build steps manually. It would be great if you could also contact us to report the issue, so that we can get the quick install instructions working for more people.

The following guides assume that you have successfully followed the steps 1, 2, and 4, and aim to replace step 3.

Building the conda environment the long way

We provide a conda environment file that can be downloaded and installed. This environment contains several "placeholder" packages to deal with the proprietary software. Essentially, these placeholder packages contain scripts to take the source files of the proprietary software, and install them into the conda environment for you.

It is necessary to run both of the code blocks below to properly install the environment.

First we create the conda environment, which includes the non-proprietary dependencies and the "placeholder" packages.

# Download the environment config file.
curl -o environment.yml https://raw.githubusercontent.com/ccdmb/predector/1.2.7/environment.yml

# Create the environment
conda env create -f environment.yml
conda activate predector

To complete the installation we need to run the *-register scripts, which install the proprietary source archives you downloaded yourself. You can copy-paste the entire command below directly into your terminal. Modify the source tar archive filenames in the commands if necessary.

signalp3-register signalp-3.0.Linux.tar.Z \
&& signalp4-register signalp-4.1g.Linux.tar.gz \
&& signalp5-register signalp-5.0b.Linux.tar.gz \
&& signalp6-register signalp-6.0g.fast.tar.gz \
&& targetp2-register targetp-2.0.Linux.tar.gz \
&& deeploc-register deeploc-1.0.All.tar.gz \
&& phobius-register phobius101_linux.tar.gz \
&& tmhmm2-register tmhmm-2.0c.Linux.tar.gz

If any of the *-register scripts fail, please contact the authors or raise an issue on github (we'll try to have an FAQ setup soon).

Building the Docker container the long way

For docker and anything that supports docker images we have a prebuilt container on DockerHub containing all of the open-source components. To install the proprietary software we use this image as a base to build on with a new dockerfile. To build the new image with the proprietary dependencies, you need to run the command below which can all be copy-pasted directly into your terminal. Modify the source .tar archive filenames in the command if necessary. Depending on how you installed docker you may need to use sudo docker in place of docker.

curl -s https://raw.githubusercontent.com/ccdmb/predector/1.2.7/Dockerfile \
| docker build \
  --build-arg SIGNALP3=signalp-3.0.Linux.tar.Z \
  --build-arg SIGNALP4=signalp-4.1g.Linux.tar.gz \
  --build-arg SIGNALP5=signalp-5.0b.Linux.tar.gz \
  --build-arg SIGNALP6=signalp-6.0g.fast.tar.gz \
  --build-arg TARGETP2=targetp-2.0.Linux.tar.gz \
  --build-arg PHOBIUS=phobius101_linux.tar.gz \
  --build-arg TMHMM=tmhmm-2.0c.Linux.tar.gz \
  --build-arg DEEPLOC=deeploc-1.0.All.tar.gz \
  -t predector/predector:1.2.7 \
  -f - \
  .

Your container should now be available as predector/predector:1.2.7 in your docker registry docker images.

Building the Singularity container the long way

There are a few ways to build the singularity image with the proprietary software installed (the filename predector.sif in the sections below).

If you only have singularity installed, you can build the container directly by downloading the .def file and setting some environment variables with the paths to the proprietary source archives. The following commands will build this image for you, and can be copy-pasted directly into your terminal. Modify the source tar archive filenames if necessary.

# This is used to emulate the --build-args functionality of docker.
# Singularity lacks this feature. You can unset the variables after you're done.
export SIGNALP3=signalp-3.0.Linux.tar.Z
export SIGNALP4=signalp-4.1g.Linux.tar.gz
export SIGNALP5=signalp-5.0b.Linux.tar.gz
export SIGNALP6=signalp-6.0g.fast.tar.gz
export TARGETP2=targetp-2.0.Linux.tar.gz
export PHOBIUS=phobius101_linux.tar.gz
export TMHMM=tmhmm-2.0c.Linux.tar.gz
export DEEPLOC=deeploc-1.0.All.tar.gz

# Download the .def file
curl -o ./singularity.def https://raw.githubusercontent.com/ccdmb/predector/1.2.7/singularity.def

# Build the .sif singularity image.
# Note that `sudo -E` is important, it tells sudo to keep the environment variables
# that we just set.
sudo -E singularity build \
  predector.sif \
  ./singularity.def

If you've already built the container using docker, you can convert them to singularity format. You don't need to use sudo even if your docker installation usually requires it.

singularity build predector.sif docker-daemon://predector/predector:1.2.7

Because the container images are quite large, singularity build will sometimes fail if your /tmp partition isn't big enough. In that case, set the following environment variables and remove the cache directory (rm -rf -- "${PWD}/cache") when singularity build is finished.

export SINGULARITY_CACHEDIR="${PWD}/cache"
export SINGULARITY_TMPDIR="${PWD}/cache"
export SINGULARITY_LOCALCACHEDIR="${PWD}/cache"

Copying environments to places where you don't have root user permission

We can't really just put the final container images up on dockerhub or singularity hub, since that would violate the proprietary license agreements. So if you don't have root user permission on the computer (e.g. a supercomputing cluster) you're going to run the analysis on you can either use the conda environments or build a container on a different computer and copy the image up.

If the option is available to you, I would recommend using the singularity containers for HPC. Singularity container .sif files can be simply copied to whereever you're running the analysis.

Some supercomputing centres will have shifter installed, which allows you to run jobs with docker containers. Note that there are two versions of shifter and nextflow only supports one of them (the nersc one). Docker containers can be saved as a tarball and copied wherever you like.

# You could pipe this through gzip if you wanted.
docker save predector/predector:1.2.7 > predector.tar

And on the other end

docker load -i predector.tar

Conda environments should be able to be built anywhere, since they don't require root user permission. You should just be able to follow the instructions described earlier. Just make sure that you install the environment on a shared filesystem (i.e. one that all nodes in your cluster can access).

There are also options for "packing" a conda environment into something that you can copy around (e.g. conda-pack), though we haven't tried this yet.

Hopefully, one of these options will work for you.

Running the pipeline

To run predector you just need your input proteomes as uncompressed fasta files.

Assuming that you've installed the dependencies, and know which dependency system you're using (conda, docker, or singularity), you can run like so:

Conda:

nextflow run \
  -resume \
  -r 1.2.7 \
  -with-conda /path/to/conda/env \
  ccdmb/predector \
  --proteome "my_proteomes/*.faa"

Docker:

nextflow run \
  -resume \
  -r 1.2.7 \
  -profile docker \
  ccdmb/predector \
  --proteome "my_proteomes/*.faa"

Singularity:

nextflow run \
  -resume \
  -r 1.2.7 \
  -with-singularity ./path/to/singularity.sif \
  ccdmb/predector \
  --proteome "my_proteomes/*.faa"

Note that a peculiarity of nextflow is that any globbing patterns (e.g. *) need to be in quotes (single or double is fine), and you can't directly provide multiple filenames as you might expect. See below for some ways you can typically provide files to the --proteome parameter.

Use case Correct Incorrect
Single protein file --proteome my.fasta
All fasta files in a folder --proteome "folder/*.fasta" --proteome folder/*.fasta
Directly specify two files --proteome "{folder/file1.fasta,other/file2.fasta}" (Ensure no spaces at the separating comma) --proteome "folder/file1.fasta other/file2.fasta"

You can find more info on the Globbing operations that are supported by Nextflow in the Java documentation.

Predector is designed to run with typical proteomes, e.g. with an average of ~15000 proteins. Internally we de-duplicate sequences and split the fasta files into smaller chunks to reduce redundant computation, enhance parallelism, and control peak memory usage. You do not need to concatenate your proteomes together, instead you should keep them separate and use the globbing patterns above. Inputting a single very large fasta file will potentially cause the pipeline to fail in the final steps producing the final ranking and analysis tables, as the "re-duplicated" results can be extremely large. If you are running a task that doesn't naturally separate (e.g. a multi-species dataset downloaded from a UniProtKB query), it's best to chunk the fasta into sets of roughly 20000 (e.g. using seqkit) and use the globbing pattern on those split fastas.

Command line parameters

To get a list of all available parameters, use the --help argument.

nextflow run ccdmb/predector --help

Important parameters are:

--proteome <path or glob>
  Path to the fasta formatted protein sequences.
  Multiple files can be specified using globbing patterns in quotes.

--phibase <path>
  Path to the PHI-base fasta dataset.

--pfam_hmm <path>
  Path to already downloaded gzipped pfam HMM database
  default: download the hmms

--pfam_dat <path>
  Path to already downloaded gzipped pfam DAT database
  default: download the DAT file

--dbcan <path>
  Path to already downloaded gzipped dbCAN HMM database
  default: download the hmms

--effectordb <path>
  Path to already downloaded gzipped HMMs of effectors.
  default: download from <https://doi.org/10.6084/m9.figshare.16973665>

--precomputed_ldjson <path>
  Path to an ldjson formatted file from previous Predector runs.
  These records will be skipped when re-running the pipeline
  where the sequence is identical and the versions of software
  and databases (where applicable) are the same.
  default: don't use any precomputed results.

--chunk_size <int>
  The number of proteins to run as a single chunk in the pipeline.
  The input fasta files are split into chunks for checkpointing
  and parallelism. Reduce this if you are running into RAM errors,
  but note that nextflow can create a lot of files so this may slow
  your filesystem down. Increase this to produce fewer files,
  but note that the runtime of each task will be longer so increase
  resources accordingly. For a typical fungal proteome (~15k proteins), setting
  to 1000 is suitable. If running >100k proteins, increasing
  chunk size to ~10000 may be appropriate.
  default: 5000

--signalp6_bsize <int>
  This sets the batch size used by the SignalP6 neural network.
  SP6 can use a lot of RAM, and reducing the batch size reduces the memory use
  at the cost of slower speeds. For computers with lots of RAM (e.g. >16GB),
  increasing this to 64 or higher will speed up.
  For smaller computers try reducing to 10.
  default: 32

--no_localizer
  Don't run LOCALIZER, which can take a long time and isn't strictly needed
  for prediction of effectors (it's more useful for evaluation).

--no_signalp6
  Don't run SignalP v6. We've had several issues running SignalP6.
  This option is primarily here to give users experiencing issues
  to finish the pipeline without it.
  If you didn't install SignalP6 in the Predector environment,
  the pipeline will automatically detect this and skip running SignalP6.
  In that case this flag isn't strictly necessary, but potentially useful
  for documenting what was run.
  THIS OPTION WILL BE REMOVED IN A FUTURE RELEASE.

--no_pfam
  Don't download and/or run Pfam and Pfamscan. Downloading Pfam is quite slow,
  even though it isn't particularly big. Sometimes the servers are down too.
  You might also run your proteomes through something like interproscan, in which
  case you might not need these results. This means you can keep going without it.

--no_dbcan
  Don't download and/or run searches against the dbCAN CAZyme dataset.
  If you're doing this analysis elsewhere, the dbCAN2 servers are down,
  or just don't need it, this lets to go without it.

--no_phibase
  Don't download and/or run searches against PHI-base.

--no_effectordb
  Don't download and/or run searches against Effector HMMs.

-r <version>
  Use a specific version of the pipeline. This version must match one of the
  tags on github <https://github.com/ccdmb/predector/tags>.
  In general it is best to specify a version, and all example commands
  in the documentation include this flag.

-latest
  Pull the pipeline again from github. If you have previously run
  Predector, and are specifying a new version to -r, you will need to use
  this parameter.
  See <https://github.com/ccdmb/predector/wiki#running-different-pipeline-versions>

-params-file <path>
  Load command line parameters from this JSON or YAML file rather.

-profile <string>
  Specify a pre-set configuration profile to use.
  Multiple profiles can be specified by separating them with a comma.
  Common choices: test, docker, docker_sudo

-c | -config <path>
  Provide a custom configuration file.
  If you want to customise things like how many CPUs different tasks
  can use, whether to use the SLURM scheduler etc, this is the way
  to do it. See the Predector or Nextflow documentation for details
  on how to write these.

-with-conda <path>
  The path to a conda environment to use for dependencies.

-with-singularity <path>
  Path to the singularity container file to use for dependencies.

--outdir <path>
  Base directory to store the pipeline results
  default: 'results'

--tracedir <path>
  Directory to store pipeline runtime information
  default: 'results/pipeline_info'

--nostrip
  Don't strip the proteome filename extension when creating the output filenames
  default: false

--symlink
  Create symlinks to the pipeline results files in the 'results' folder (instead of copying them there).
  Note that this behaviour will save disk space, but the results must be copied (following the symlinks)
  to a different location before cleaning the working directory with 'nextflow clean', see the
  'accessing-and-copying-the-results' section in the documentation for more details.
  default: false

-ansi-log=<true|false>
  The default Nextflow feedback prints and deletes the screen so that it appears as an updating block of text.
  Predector runs a lot of steps so this view usually takes up more than the full screen.
  Additionally this default mode doesn't play well if you re-direct the output to a file (e.g. using nohup or on a slurm cluster).
  Nextflow is supposed to switch when in a non-interactive shell, but I find that it often doesn't.
  If you would like to explicitly disable this Nextflow single screen colourful output, please specify `-ansi-log=false`.

Note. The difference in parameters starting with - and -- are deliberate and shouldn't be mixed up. Those starting with a single hyphen - are Nextflow runtime parameters, which are described here https://www.nextflow.io/docs/latest/cli.html#run. Those starting with two hyphens -- are Predector defined parameters.

Manual ranking scores

In the pipeline ranking output tables we also provide a manual (i.e. not machine learning) ranking score for both effectors manual_effector_score and secretion manual_secretion_score. This was provided so that you could customise the ranking if the ML ranker isn't what you want.

NOTE: If you decide not to run specific analyses (e.g. signalp6 or Pfam), this may affect comparability between different runs of the pipeline.

These scores are computed by a relatively simple linear function weighting features in the ranking table. You can customise the weights applied to the features from the command line.

In the following tables, the sum of all feature * weight pairs will compute the overall score. The feature names match those in the *-ranked.tsv file. The effector score includes all of the secretion scores. It is built on-top of it with additional effector-relevant features.

Note that for some tools we subtract 0.5 and multiply by 2. This is done for some classifiers so that the value is between 1 and -1. So it can both penalise and increase scores.

I've added a special column in here "has_effector_match" which is not in the ranking table. It is composed of four other columns like this:

has_effector_match = has_phibase_effector_match
                  or (effector_matches != '.')
                  or has_dbcan_virulence_match
                  or has_pfam_virulence_match
score feature default weight command line option
secretion is_secreted 2.0 --secreted_weight
secretion signalp3_hmm 0.0001 --sigpep_ok_weight
secretion signalp3_nn 0.0001 --sigpep_ok_weight
secretion phobius 0.0001 --sigpep_ok_weight
secretion deepsig 0.0001 --sigpep_ok_weight
secretion signalp4 0.003 --sigpep_good_weight
secretion signalp5 0.003 --sigpep_good_weight
secretion signalp6 0.003 --sigpep_good_weight
secretion targetp_secreted 0.003 --sigpep_good_weight
secretion multiple_transmembrane -1 --multiple_transmembrane_weight
secretion single_transmembrane -0.7 --single_transmembrane_weight
secretion deeploc_extracellular 1.3 --deeploc_extracellular_weight
secretion deeploc_nucleus -1.3 --deeploc_intracellular_weight
secretion deeploc_cytoplasm -1.3 --deeploc_intracellular_weight
secretion deeploc_mitochondrion -1.3 --deeploc_intracellular_weight
secretion deeploc_cell_membrane -1.3 --deeploc_intracellular_weight
secretion deeploc_endoplasmic_reticulum -1.3 --deeploc_intracellular_weight
secretion deeploc_plastid -1.3 --deeploc_intracellular_weight
secretion deeploc_golgi -1.3 --deeploc_intracellular_weight
secretion deeploc_lysosome -1.3 --deeploc_intracellular_weight
secretion deeploc_peroxisome -1.3 --deeploc_intracellular_weight
secretion deeploc_membrane -1.3 --deeploc_membrane_weight
secretion targetp_mitochondrial_prob -0.5 --targetp_mitochondrial_weight
effector 2 * (effectorp1 - 0.5) 0.5 --effectorp1_weight
effector 2 * (effectorp2 - 0.5) 2.5 --effectorp2_weight
effector effectorp3_apoplastic 0.5 --effectorp3_apoplastic_weight
effector effectorp3_cytoplasmic 0.5 --effectorp3_cytoplastmic_weight
effector effectorp3_noneffector -2.5 --effectorp3_noneffector_weight
effector 2 * (deepredeff_fungi - 0.5) 0.1 --deepredeff_fungi_weight
effector 2 * (deepredeff_oomycete - 0.5) 0.0 --deepredeff_oomycete_weight
effector has_effector_match 2.0 --effector_homology_weight
effector (!has_effector_match) and has_phibase_virulence_match 0.5 --virulence_homology_weight
effector has_phibase_lethal_match -2 --lethal_homology_weight

Note that all DeepLoc probability values except deeploc_membrane will sum to 1 because they result from a single multi-class classifier (see the common Softmax function for details on how this happens). So the total penalty for DeepLoc "intracellular" localisation can only ever be a maximum of --deeploc_intracellular_weight which requires that deeploc_extracellular is 0. And the increase from extracellular localisation can only ever be a maximum of --deeploc_extracellular_weight, which will happen if deeploc_extracellular is 1 (so all others must be 0).

The high weight assigned to is_secreted and relatively low weights assigned to individual classifiers is intended to give a general bump to things that have signal peptides and no TM domains etc, but then a slight boost for proteins that are positively predicted by multiple tools.

Profiles and configuration

Nextflow uses configuration files to specify how many CPUs or RAM a task can use, or whether to use a SLURM scheduler on a supercomputing cluster etc. You can also use these config files to provide parameters.

To select different configurations, you can either use one of the preset "profiles", or you can provide your own Nextflow config files to the -config parameter https://www.nextflow.io/docs/latest/config.html. This enables you to tune the number of CPUs used per task etc to your own computing system.

Profiles

We have several available profiles that configure where to find software, CPU, memory etc.

type profile description
software docker Run the processes in a docker container.
software docker_sudo Run the processes in a docker container, using sudo docker.
software podman Run the processes in a container using podman.
software singularity Run the process using singularity (by pulling it from the local docker registry). To use a singularity image file use the -with-singularity image.sif parameter instead.
cpu c4 Use up to 4 CPUs/cores per computer/node.
cpu c8 Use up to 8 CPUs/cores ...
cpu c16 Use up to 16 CPUs/cores ...
memory r4 Use up to 4Gb RAM per computer/node.
memory r6 Use up to 6Gb RAM per computer/node.
memory r8 Use up to 8Gb RAM per computer/node.
memory r16 Use up to 16Gb RAM
memory r32 Use up to 32Gb RAM
memory r64 Use up to 64Gb RAM
time t1 Limits process time to 1hr, 5hr, and 12hr for short, medium and long tasks.
time t2 Limits process time to 2hr, 10hr, and 24hr for short, medium and long tasks.
time t3 Limits process time to 3hr, 15hr, and 24hr for short, medium and long tasks.
time t4 Limits process time to 4hr, 20hr, and 48hr for short, medium and long tasks.
compute pawsey_zeus A combined profile to use the Pawsey supercomputing centre's Zeus cluster. This sets cpu, memory, and time parameters appropriate for using this cluster.

You can mix and match these profiles, using the -profile parameter. By default, the pipeline will behave as if you ran the pipeline with -profile c4,r8 (4 CPUs, and 8 Gb memory) which should be compatible with most modern laptop computers and smaller cloud instances. But you can increase the number of CPUs available e.g. to make up to 16 CPUs available with -profile c16 which will have 16 cores available and 8 GB of memory. To make more memory available, specify one of the r* profiles e.g. -profile c16,r32.

In general for best performance I suggest specifying the profiles with the largest number of CPUs that you have available on the computer you're running on. For example if you are running on a computer with 8 CPUs and 32 GB of RAM specify -profile c8,r32. This will allow the pipeline to make the best use of your available resources.

The time profiles (t*) are useful for limiting running times of tasks. By default the times are not limited, but these can be useful to use if you are running on a supercomputing cluster (specifying the times can get you through the queue faster) or on commercial cloud computing services (so you don't rack up an unexpected bill if something stalls somehow).

So to combine all of these things; to use docker containers, extra ram and CPUs etc you can provide the profile -profile c16,r32,t2,docker.

Custom configuration

If the preset profiles don't meet your needs you can provide a custom config file. Extended documentation can be found here: https://www.nextflow.io/docs/latest/config.html.

I'll detail some pipeline specific configuration below but I suggest you start by copying the file https://github.com/ccdmb/predector/tree/master/conf/template_single_node.config and modify as necessary.

If you have questions about this, or want to suggest a configuration for us to officially distribute with the pipeline please file an issue or start a discussion.

Each Nextflow task is labelled with the software name, CPU, RAM, and time requirements for each task. In the config files, you can select these tasks by label.

kind label description
cpu cpu_low Used for single threaded tasks. Generally doesn't need to be touched.
cpu cpu_medium Used for parallelised tasks that are IO bound. E.G. signalp 3 & 4, deeploc etc.
cpu cpu_high Used for parallelised tasks that use lots of CPUs efficiently. Usually this should be all available CPUs.
memory ram_low Used for processes with low RAM requirements, e.g. downloads.
memory ram_medium Used for tasks with moderate RAM requirements, and many of the parallelised tasks (e.g. with cpu_medium).
memory ram_high Used for tasks with high RAM requirements. Usually this should be all available RAM.
time time_short Used with tasks that should be super quick like sed or splitting files etc (1 or 2 hours at the very most).
time time_medium Used for more expensive tasks, most parallelised tasks should be able to complete within this time (e.g 5-10 hours).
time time_long Used for potentially long running tasks or tasks with times that depends on external factors e.g. downloads.
software download Software environment for downloading things. (i.e. contains wget)
software posix " for using general posix/GNU tools
software predectorutils " Tasks that use the Predector-utils scripts.
software signalp3
software signalp4
software signalp5
software signalp6
software deepsig
software phobius
software tmhmm
software deeploc
software apoplastp
software localizer
software effectorp1
software effectorp2
software effectorp3
software deepredeff
software emboss
software hmmer3
software pfamscan
software mmseqs

Running different pipeline versions.

We pin the version of the pipeline to run in all of our example commands with the -r 1.2.7 parameter. These flags are optional, but recommended so that you know which version you ran. Different versions of the pipelines may output different scores, use different parameters, different output formats etc. It also re-enforces the link between the pipeline version and the docker container tags.

If you specify the pipeline to run as ccdmb/predector, Nextflow will pull the git repository from GitHub and put it in a local cache. Unfortunately, if you change the version number provided to -r and that version is not in the local copy of the repository you will get an error (See Common issues. If you have previously run Predector and want to update it to use a new version, you can do one of the following:

  1. Provide the new version to the -r parameter, and add the -latest flag to tell Nextflow to pull new changes from the GitHub repository. Likewise, you can run old versions of the pipeline by simply changing -r.
nextflow run -r 1.2.7 -latest ccdmb/predector --proteomes "my_proteins.fasta"
  1. You can ask Nextflow to pull new changes without running the pipeline using nextflow pull ccdmb/predector.

  2. You can ask Nextflow to delete the local copy of the repository entirely using nextflow drop ccdmb/predector. Nextflow will then pull a fresh copy the next time you run the pipeline.

If you get an error about missing git tags when running either of the first two options, try the third option (drop). This might happen if we delete old development tags of the pipeline to clean up the pipeline.

Note that the software environments (conda, docker, singularity) often will not be entirely compatible between versions. You should generally rebuild the container or conda environment from scratch when changing versions. I suggest keeping copies of the proprietary dependencies handy in a folder or archive, and just building and removing the container/environment as you need it.

Providing pre-downloaded Pfam, PHI-base, and dbCAN datasets.

Sometimes the Pfam or dbCAN servers can be a bit slow for downloads, and are occasionally unavailable which will cause the pipeline to fail. You may want to keep the downloaded databases to reuse them (or pre-download them).

If you've already run the pipeline once, they'll be in the results folder (unless you specified --outdir) so you can do:

cp -rL results/downloads ./downloads
nextflow run \
  -profile test \
  -resume ccdmb/predector \
  --phibase phi-base_current.fas \
  --pfam_hmm downloads/Pfam-A.hmm.gz \
  --pfam_dat downloads/Pfam-A.hmm.dat.gz \
  --dbcan downloads/dbCAN.txt \
  --effectordb downloads/effectordb.hmm.gz

This will skip the download step at the beginning and just use those files, which saves a few minutes.

You can also download the files from:

Providing pre-computed results to skip already processed proteins

Predector can now take results of previous Predector runs to skip re-running individual analyses of identical proteins. This is decided based on a checksum of the processed sequence, the version of the software, and the version of the database (when applicable). If all three match, we will skip that analysis for that protein.

In the deduplicated folder is a file called new_results.ldjson. This contains all of the results from the current run of Predector. Just hold on to this file, and provide it to the --precomputed_ldjson argument the next time you run the pipeline. You can concatenate multiple of these files together without issue (e.g. cat dedup1.ldjson dedup2.ldjson > my_precomputed.ldjson) to continue a set of precomputed results in the long term.

Note that the results file new_results.ldjson will not contain any of the results that you provide to the --precomputed_ldjson argument. This is to avoid adding too many duplicate entries when you concatenate the files. It isn't a problem if there are duplicate entries in there, we internally deal with it, but it does slow things down and make the files bigger.

Here's a basic workflow using precomputed results.

nextflow run -profile docker -resume -r 1.2.7 ccdmb/predector \
  --proteome my_old_proteome.fasta

cp -L results/deduplicated/new_results.ldjson ./precomputed.ldjson

nextflow run -profile docker -resume -r 1.2.7 ccdmb/predector \
  --proteome my_new_proteome.fasta --precomputed_ldjson ./precomputed.ldjson

cat results/deduplicated/new_results.ldjson >> ./precomputed.ldjson

Any proteins in the first proteome will be skipped when you run the new one. I imagine this should speed up running new proteomes or re-running a newer version of the pipeline, as the actual versions of the software behind it don't change often.

Note that database searches are only assigned versions if the pipeline downloads the actual files. If you provide pre-downloaded copies, the pipeline won't skip these searches. This is just because we can't figure out what version a database is from the filename, and it ensures consistency. The database searches are not a particularly time-consuming part of the pipeline anyway, so I don't expect this to be a big issue. Please let us know if you feel otherwise.

Future versions may be able to download precomputed results from a server. It's something we're working on.

Accessing and copying results

Nextflow will dump a bunch of things in the directory that you run it in, and if you've run a lot of datasets it might be taking up a lot of space or the many files might slow down your filesystem. By default Nextflow avoids this by symbolically linking files in the results directory to the work directory. This means however that if you delete the work directory, you lose the results. As this is is not known to users that are unfamiliar with Nextflow, as of Predector version 1.2.7 we copy the results instead of symlinking.

If you need to conserve disk space, you can recover the original sym-linking behaviour by adding the --symlink flag at runtime. Note that when copying files from the results directory, make sure you use the -L flag to cp to ensure that the contents of the file are copied rather than just another symbolic link to the file in work.

eg.

cp -rL results/ copied_results/

Cleaning up

Once you've got what you need from the results folder:

rm -rf -- work results .nextflow*

Will clean up what's in your working directory. Alternatively you can use the nextflow clean command to gain more control over what is removed from work etc.

Pipeline output

Predector outputs several files for each input file that you provide, and some additional ones that can be useful for debugging results.

Results will always be placed under the directory specified by the parameter --outdir (./results by default).

Downloaded databases (i.e. Pfam and dbCAN) are stored in the downloads subdirectory. Predector internally removes duplicate sequences at the start to avoid redundant computation and reduplicates them at the end. The deduplicated folder contains the deduplicated sequences, results, and a mapping file of the old ids to new ones.

Other directories will be named after the input filenames and each contains several tables. An example set of these results is available in the test directory on github.

deduplicated/

The deduplicated folder contains deduplicated sequences and a tab-separated values file mapping the deduplicated sequence ids to their filenames and original ids in the deduplicated subdirectory. Deduplicated sequences may not be the same as the input sequences, as we do some "cleaning" before running the pipeline to avoid some common issues causing software crashes. Basically sequences are all uppercased, * characters are removed from the ends, - characters are removed, and any remaining *JBZUO characters are replaced with X.

The deduplicated.tsv file has four columns:

Column Type Description
deduplicated_id str The ID of the unique sequence in the deduplicated results
input_file str The input filename that the sequence came from
original_id str The ID of the protein in the input file
checksum str This is a hashed verison of the input amino-acid sequence that we use to detect duplicate sequences. The checksums are created with the seguid function in BioPython

This folder also contains two .ldjson files. deduplicated.ldjson contains all results of analyses on the deduplicated sequences. new_results.ldjson contains a subset of the results in deduplicated.ldjson suitable for input as pre-computed input to the pipeline.

analysis_software_versions.tsv

This is a table containing the software and database (where relevant) versions of the analyses that Predector has run.

It has a simple 3 column structure. analysis, software_version, database_version. If the analysis doesn't use a database or we cannot determine which version of the database you're using (because you provided it yourself rather than letting the pipeline download it), then the database_version column will be an empty string.

*-ranked.tsv

This is the main output table that includes the scores and most of the parameters that are important for effector or secretion prediction. There are a lot of columns, though generally you'll only be interested in a few of them.

Column Data type Description Notes
seqid String The protein name in the fasta input you provided
effector_score Float The Predector machine learning effector score for this protein
manual_effector_score Float The manually created effector score, which is the sum of the products of several values in this spreadsheet See manual ranking scores for details
manual_secretion_score Float The manually created secretion score, which is the sum of the products of several values in this spreadsheet
effector_matches String A comma separated list of the significant matches to the curated set of fungal effector HMMs If you are interested in knowing more about matches, see https://doi.org/10.6084/m9.figshare.16973665 under effectordb.tsv for details and links to papers describing functions. Matches are sorted by evalue, so the first hit is the best.
phibase_genes String A comma separated list of PHI-base gene names that were significant hits to this sequence Matches are sorted by evalue, so the first hit is the best.
phibase_phenotypes String A comma separated list of the distinct PHI-base phenotypes in the significant hits to PHI-base Phenotypes are sorted by minimum evalue for matches with that phenotype.
phibase_ids String A comma separated list of the PHI-base entries that were significant hits You can find out more details about PHI-base matches here http://www.phi-base.org/, which will include links to literature describing experimental results. If you do publish relevant experiments on virulence factors or effectors or know of entries not in PHI-base, please do consider helping them curate https://canto.phi-base.org/. Matches are sorted by evalue, so the first hit is the best.
has_phibase_effector_match Boolean [0, 1] Indicates whether the protein had a significant hit to one of the phibase phenotypes: Effector, Hypervirulence, or loss of pathogenicity
has_phibase_virulence_match Boolean [0, 1] Indicating whether the protein had a significant hit with the phenotype "reduced virulence"
has_phibase_lethal_match Boolean [0, 1] Indicating whether the protein had a significant hit with the phenotype "lethal"
pfam_ids List A comma separated list of all Pfam HMM ids matched You can find details on Pfam match entries at http://pfam.xfam.org (use the "Jump to" search boxes with this ID). Matches are sorted by evalue, so the first hit is the best.
pfam_names List A comma separated list of all Pfam HMM names matched Matches are sorted by evalue, so the first hit is the best.
has_pfam_virulence_match Boolean [0, 1] Indicating whether the protein had a significant hit to one of the selected Pfam HMMs associated with virulence function A list of virulence associated Pfam entries is here: https://github.com/ccdmb/predector/blob/master/data/pfam_targets.txt
dbcan_matches List A comma separated list of all dbCAN matches You can find details on CAZYme families at http://www.cazy.org/. For more on dbCAN specifically see here https://bcb.unl.edu/dbCAN2/. Matches are sorted by evalue, so the first hit is the best.
has_dbcan_virulence_match Boolean [0, 1] Indicating whether the protein had a significant hit to one of the dbCAN domains associated with virulence function A list of virulence associated dbCAN entries is here: https://github.com/ccdmb/predector/blob/master/data/dbcan_targets.txt
effectorp1 Float The raw EffectorP v1 prediction pseudo-probability Values above 0.5 are considered to be effector predictions
effectorp2 Float The raw EffectorP v2 prediction pseudo-probability Values above 0.5 are considered to be effector predictions, Values below 0.6 are annotated in the raw EffectorP output as "unlikely effectors"
effectorp3_cytoplasmic Float or None '.' The EffectorP v3 prediction pseudo-probability for cytoplasmic effectors EffectorP only reports probabilities for classifiers over 0.5, '.' indicates where the value is not reported by EffectorP v3
effectorp3_apoplastic Float or None '.' As for effectorp3_cytoplasmic but for apoplastic effector probability
effectorp3_noneffector Float or None '.' As for effectorp3_cytoplasmic but for non-effector probability
deepredeff_fungi Float The deepredeff fungal effector classifier pseudo probability Values above 0.5 are considered to be effector predictions
deepredeff_oomycete Float The deepredeff oomycete effector classifier pseudo probability Values above 0.5 are considered to be effector predictions
apoplastp Float The raw ApoplastP "apoplast" localised prediction pseudo probability Values above 0.5 are considered to be apoplastically localised
is_secreted Boolean [0, 1] Indicates whether the protein had a signal peptide predicted by any method, and does not have $\ge$ 2 transmembrane domains predicted by either TMHMM or Phobius
any_signal_peptide Boolean [0, 1] Indicates whether any of the signal peptide prediction methods predict the protein to have a signal peptide
single_transmembrane Boolean [0, 1] Indicates whether the protein is predicted to have 1 transmembrane domain by TMHMM or Phobius (and not >1 for either), and in the case of TMHMM the predicted number of TM AAs in the first 60 residues is less than 10
multiple_transmembrane Boolean [0, 1] Indicating whether a protein is predicted to have more than 1 transmembrane domain by either Phobius or TMHMM
molecular_weight Float The predicted molecular weight (Daltons) of the protein
residue_number Integer The length of the protein or number of residues/AAs
charge Float The overall predicted charge of the protein
isoelectric_point Float The predicted isoelectric point of the protein
aa_c_number Integer The number of Cysteine residues in the protein
aa_tiny_number Integer The number of tiny residues (A, C, G, S, or T) in the protein
aa_small_number Integer The number of small residues (A, B, C, D, G, N, P, S, T, or V) in the protein
aa_aliphatic_number Integer The number of aliphatic residues (A, I, L, or V) in the protein
aa_aromatic_number Integer The number of aromatic residues (F, H, W, or Y) in the protein
aa_nonpolar_number Integer The number of non-polar residues (A, C, F, G, I, L, M, P, V, W, or Y) in the protein
aa_charged_number Integer The number of charged residues (B, D, E, H, K, R, or Z) in the protein
aa_basic_number Integer The number of basic residues (H, K, or R) in the protein
aa_acidic_number Integer The number of acidic residues (B, D, E or Z) in the protein
fykin_gap Float The number of FYKIN residues + 1 divided by the number of GAP residues + 1 Testa et al. 2016 describe RIP affected regions as being enriched for FYKIN residues, and depleted in GAP residues
kex2_cutsites List A comma separated list of potential matches to Kex2 motifs These each take the form of <match>:<pattern>[&<pattern>[...]]:<start>-<end>. Where match is the actual motif in your protein, and pattern is one of [LIJVAP]X[KRTPEI]R, LXXR, [KR]R. If multiple patterns match at the same position, they will be listed separated by &, e.g the motif LAKR might output LAKR:[LIJVAP]X[KRTPEI]R&LXXR:10-13 since both pattens match that motif at that position. Positions are start and end inclusive (like GFF3). See Outram et al. 2021 for a recent brief review. Note that these are simple regular expression matches and there has been no processing. Use with some caution
rxlr_like_motifs List A comma separated list of potential RxLR-like motifs [RKH][A-Z][LMIFYW][A-Z] as described by Kale et al. 2010 Each take the form of <match>:<start>-<end>. Positions are start and end inclusive (like GFF3). Note that this is a simple regular expression match, it tends to be quite non-specific, and the function of these motifs remains controversial. Use with some caution
localizer_nucleus Boolean [0, 1] Indicates whether localiser predicted an internal nuclear localisation peptide. These predictions are run on all proteins with the first 20 AAs trimmed from the start to remove any potential signal peptides
localizer_chloro Boolean [0, 1] Indicates whether localiser predicted an internal chloroplast localisation peptide
localizer_mito Boolean [0, 1] Indicates whether localiser predicted an internal mitochondrial localisation peptide
signal_peptide_cutsites List A comma separated list of predicted signal-peptide cleavage sites Each will take the form <program_name>:<last_base_in_sp>. So the mature peptide is expected to begin after the number
signalp3_nn Boolean [0, 1] Indicates whether the protein is predicted to have a signal peptide by the neural network model in SignalP 3
signalp3_hmm Boolean [0, 1] Indicates whether the protein is predicted to have a signal peptide by the HMM model in SignalP 3
signalp4 Boolean [0, 1] Indicates whether the protein is predicted to have a signal peptide by SignalP 4
signalp5 Boolean [0, 1] Indicating whether the protein is predicted to have a signal peptide by SignalP 5
signalp6 Boolean [0, 1] Indicates whether the protein is predicted to have a signal peptide by SignalP 6
deepsig Boolean Boolean [0, 1] indicating whether the protein is predicted to have a signal peptide by DeepSig
phobius_sp Boolean [0, 1] Indicates whether the protein is predicted to have a signal peptide by Phobius
phobius_tmcount Integer The number of transmembrane domains predicted by Phobius
phobius_tm_domains List A comma separated list of transmembrane domain predictions from Phobius. Each will have the format <start>-<end> Positions are start and end inclusive (like GFF3). We also add the prefix tm: to this column. This is to prevent Excel from interpreting these entries as dates.
tmhmm_tmcount Integer The number of transmembrane domains predicted by TMHMM
tmhmm_first_60 Float The predicted number of transmembrane AAs in the first 60 residues of the protein by TMHMM
tmhmm_exp_aa Float The predicted number of transmembrane AAs in the protein by TMHMM
tmhmm_first_tm_sp_coverage Float The proportion of the first predicted TM domain that overlaps with the median predicted signal-peptide cut site Where no signal peptide or no TM domains are predicted, this will always be 0
tmhmm_domains List A comma separated list of transmembrane domains predicted by TMHMM. Each will have the format <start>-<end> Positions are start and end inclusive (like GFF3). We also add the prefix tm: to this column. This is to prevent Excel from interpreting these entries as dates.
targetp_secreted Boolean [0, 1] Indicates whether TargetP 2 predicts the protein to be secreted
targetp_secreted_prob Float The TargetP pseudo-probability of secretion
targetp_mitochondrial_prob Float The TargetP pseudo-probability of mitochondrial localisation
deeploc_membrane Float DeepLoc pseudo-probability of membrane association
deeploc_nucleus Float DeepLoc pseudo-probability of nuclear localisation Note that all DeepLoc values other than "membrane" are from the same classifier, so the sum of all of the pseudo-probabilities will be 1
deeploc_cytoplasm Float DeepLoc pseudo-probability of cytoplasmic localisation
deeploc_extracellular Float DeepLoc pseudo-probability of extracellular localisation
deeploc_mitochondrion Float DeepLoc pseudo-probability of mitochondrial localisation
deeploc_cell_membrane Float DeepLoc pseudo-probability of cell membrane localisation
deeploc_endoplasmic_reticulum Float DeepLoc pseudo-probability of ER localisation
deeploc_plastid Float DeepLoc pseudo-probability of plastid localisation
deeploc_golgi Float DeepLoc pseudo-probability of golgi apparatus localisation
deeploc_lysosome Float DeepLoc pseudo-probability of lysosomal localisation
deeploc_peroxisome Float DeepLoc pseudo-probability of peroxisomal localisation
signalp3_nn_d Float The raw D-score for the SignalP 3 neural network
signalp3_hmm_s Float The raw S-score for the SignalP 3 HMM predictor
signalp4_d Float The raw D-score for SignalP 4 See discussion of choosing multiple thresholds in the SignalP FAQs
signalp5_prob Float The SignalP 5 signal peptide pseudo-probability
signalp6_prob Float The SignalP 6 signal peptide pseudo-probability
deepsig_signal_prob Float or None . The DeepSig signal peptide pseudo-probability. Note that DeepSig only outputs the probability of the main prediction, so any proteins with a Transmembrane or Other prediction will be None (.) here.
deepsig_transmembrane_prob Float or None . The DeepSig transmembrane pseudo-probability.
deepsig_other_prob Float or None . The DeepSig "other" (i.e. not signal peptide or transmembrane) pseudo-probability.

*.gff3

This file contains gff3 versions of results from analyses that have some positional information (e.g. signal/target peptides or alignments). The columns are:

Column Type Description
seqid str The protein seqid in your input fasta file.
source str The analysis that gave this result. Note that for database matches, both the software and database are listed, separated by a colon (:).
type str The closest Sequence Ontology term that could be used to describe the region.
start int The start of the region being described (1-based).
end int The end of the region being described (1-based inclusive).
score float The score of the match if available. For MMSeqs2 and HMMER matches, this is the e-value. For SignalP 3-nn and 4 this will be the D-score, for SignalP 3-hmm this is the S-probability, and for SignalP5, DeepSig, TargetP and LOCALIZER mitochondrial or chloroplast predictions this will be the probability score.
strand +, -, or . This will always be unstranded (.), since proteins don't have direction in the same way nucleotides do.
phase 0, 1, 2, or . This will always be . because it is only valid for CDS features.
attributes A semi-colon delimited list of key=value pairs In here the remaining raw results and scores will be present. Of particular interest are the Gap and Target attributes, which define what database match an alignment found and the bounds in the matched sequence, and match/mismatch positions. Some punctuation characters will be escaped using URL escape rules. For example, commas , will be escaped as %2C.

You can map these GFF3 results onto your genomes!.

Individual results tables

There are a bunch of tables that are just TSV versions of the original outputs. Most of the tools outputs are not well described and not in convenient formats for parsing so we don't keep them around. We've done our best to retain all of the information in the original formats as a TSV version.

The original formats are described in:

DeepLoc doesn't have any output format documentation that I can find, but hopefully it's pretty self explanatory for you. Note that all DeepLoc values other than "membrane" are from the same classifier, so the sum of all of the pseudo-probabilities will be 1.

*.ldjson

LDJSON (aka. JSONL or NDJSON) is the common format file type that we use to store results. It is a plain text file, where each line is a valid JSON format.

The basic structure of each line is as follows (indentation and newlines added for clarity).

{
    "analysis": str,
    "checksum": str,
    "software": str,
    "software_version": str,
    "database": Optional[str],
    "database_version": Optional[str],
    "pipeline_version": str,
    "data": analysis specific object,
}

The data field contains the actual results from the analysis, which will be specific to each analysis type. The fields in data will represent parsed elements from the original software output.

Line delimited JSON can be parsed in most programming languages fairly easily. E.g. in python3

import json
results = []
with open("results.ldjson", "r") as handle:
    for line in handle:
        sline = line.strip()
        if sline == "":
            continue
        result = json.loads(sline)
        results.append(result)

pipeline_info/

Contains details of how the pipeline ran. Each file shows run-times, memory and CPU usage etc.

Linking vs copying results.

By default in Predector the results of the pipeline are copied from the work folder to the results folder. Note that this is not the default behaviour for Nextflow pipelines, which instead symbolically links results files from the work directory, to the specified output directory. If you want to recover the default Nextflow behaviour (i.e. symlinking rather than copying results), you can use the --symlink parameter.

Symlinking saves some space and time, but requires a bit of extra care when copying and deleting files. If you delete the work folder you will also be deleting the actual contents of the results, and you'll be left with a pointer to a non-existent file. Make sure you copy any files that you want to keep before deleting anything.

If you use the linux cp command to copy results, please make sure to use the -L flag. This ensures that you copy the contents of the file rather than just copying another link to the file. rsync also requires using an -L flag to copy the contents rather than a link. scp will always follow links to copy the contents, so no special care is necessary.

If you use a different tool, please make sure that it copies the contents.

Mapping results to genomes coordinates for genome browsers

The GFFs and score results from Predector can be projected onto your genomes for easy visualisation with genome browsers. We use the GFF that you used to extract the proteins to map protein coordinates back onto CDS features in your genome GFF.

Both utilities are provided as part of the predector-utils package (https://github.com/ccdmb/predector-utils).

predutils map_to_genome maps the predector .gff3 results to your genome, producing a genomic GFF. predutils score_to_genome maps scores from the -ranked.tsv table onto your genome, producing bedgraph files (https://bedtools.readthedocs.io/en/latest/content/tools/unionbedg.html)

Since predector-utils is installed as part of the predector environment you can use the same environment.

# conda
conda activate predector
predutils map_to_genome -o test_set-genomic.gff3 test_set.gff3 results/test_set/test_set.gff3
predutils scores_to_genome -o test_set-scores.bedgraph test_set.gff3 results/test_set/test_set-ranked.tsv


# docker
docker run --rm -it \
  -v "${PWD}":/data:rw \
  -w /data ccdmb/predector:1.2.7 \
  "predutils map_to_genome -o test_set-genomic.gff3 test_set.gff3 results/test_set/test_set.gff3"

docker run --rm -it \
  -v "${PWD}":/data:rw \
  -w /data ccdmb/predector:1.2.7 \
  "predutils score_to_genome -o test_set-scores.bedgraph test_set.gff3 results/test_set/test_set-ranked.tsv"

# Singularity
# singularity automatically mounts the current working directory
singularity exec ./predector.sif predutils map_to_genome -o test_set-genomic.gff3 test_set.gff3 results/test_set/test_set.gff3
singularity exec ./predector.sif predutils score_to_genome -o test_set-scores.bedgraph test_set.gff3 results/test_set/test_set-ranked.tsv

For more details and options, see the predector utils documentation for map_to_genome and score_to_genome. In particular, you may need to change the --id parameter which tells us how the protein names correspond to entries in your genome GFF.

Common issues

Cannot find revision `X.X.X` -- Make sure that it exists in the remote repository `https://github.com/ccdmb/predector`

All of our code examples specify the pipeline version number, which is to ensure that the correct dependencies are used and it's always clear what is actually being run.

Unfortunately this can cause a few issues if you have previously run the pipeline using the same computer. You can read about this in more detail in the section "Running different pipeline versions".

Try the following steps to resolve the issue.

  1. Double check that the specified version number is actually a tagged version of the pipeline (https://github.com/ccdmb/predector/tags).
  2. Try re-running the pipeline with the -latest flag included. i.e. nextflow run -r X.X.X -latest ccdmb/predector --proteome "proteomes/*".
  3. Try pulling the updates down from GitHub with nextflow directly with the following command: nextflow pull ccdmb/predector. Then try re-running the pipeline.
  4. Try asking nextflow to delete it's local copy of Predector from its cache with the following command: nextflow drop ccdmb/predector. Then try re-running the pipeline.

If you're still having problems after this, please either email us or raise an issue on GitHub.

Running with docker Unable to find image 'predector/predector:X.X.X' locally

This usually means that you haven't built the docker image locally. Remember that we cannot distribute some of the dependencies, so you need to build the container image and move it to where you'll be running.

Please check that you have the docker container in your local registry:

docker images

It's also possible that you built a different environment (e.g. conda or singularity). Check conda info -e or for any .sif files where your source archives are.

Another possibility is that you are trying to run the pipeline using a container built for a different version of the pipeline. Please check that the version tag in docker images is the same as the pipeline that you're trying to run. Update the pipeline if necessary using nextflow pull ccdmb/predector.

Running with singularity ERROR : Failed to set loop flags on loop device: Resource temporarily unavailable.

This is caused by nextflow trying to launch lots of tasks with the same singularity image at the same time. Updating singularity to version >= 3.5 should resolve the issue.

Connecting to XXX... failed: Connection timed out.

We automatically download Pfam, dbCAN, and PHI-base by default while running the pipeline. Sometimes these sources will be unavailable (e.g. for maintenance or they've just crashed), and sometimes the URLs to these data will change.

It is possible for you to download these data separately, and provide the files to the pipeline as described here. If you find yourself running the pipeline often it might be a good thing to keep a downloaded copy handy.

In the case that the servers are down, unfortunately we can't do much. But if the URL appears to have changed, we would appreciate it if you could please let us know so that we can resolve the issue.

ERROR ~ unable to resolve class XXX

You may encounter this error if you are using an old version of Nextflow (pre v21). We use the updated DSL2 syntax, which will cause older versions of Nextflow to raise errors looking like this...

ERROR ~ unable to resolve class download_phibase
@ line 7, column 5.
       download as download_phibase;
       ^

_nf_script_9f2a833e: 8: unable to resolve class download_pfam_hmm
@ line 8, column 5.
       download as download_pfam_hmm;
       ^

_nf_script_9f2a833e: 9: unable to resolve class download_pfam_dat
@ line 9, column 5.
       download as download_pfam_dat;
       ^

Please update Nextflow to a more recent version (>21) to resolve this issue.

Running/setting up conda environment: loadable library and perl binaries are mismatched (got handshake key 0xdb80080, needed 0xde00080)

This will usually happen if the operating system you're running on has some perl libraries in the search path for a different version of perl. Unfortunately because conda environments are not isolated from the host operating system like containers are, there isn't much we can do to avoid this. The good news is that it's usually an easy fix.

Search your bash environment for PERL5LIB or PERLLIB. e.g.

env | grep -i "perl"

If either of these are set, it's likely that this is how perl is finding these incompatible options. Try unset-ing these paths and try running again. e.g.

unset PERL5LIB
unset PERLLIB

If you are on a computer that uses modules to import software (e.g. many HPC clusters), check for any loaded perl modules as these will usually set the above variables.

# Check for and loaded perl modules
module list

# unload any relevant perl modules e.g.
module unload perl

Note you will need to unset these environment variables anytime you restart a new terminal session and want to use the conda environment. Alternatively you can remove the default imports and environment variable settings e.g. in your ~/.bashrc file to make changes permanent.

If unsetting the variables/unloading modules doesn't work, please let us know and we'll try to resolve the issue. We'll need info about any set environment variables before and after loading the conda environment, and the specific conda packages installed.

e.g.

env > env_before_load.txt
conda activate predector
env > env_after_load.txt

conda list > conda_environment.txt

Error while running signalp 6 ValueError: zero-size array to reduction operation maximum which has no identity

This is a known issue with some sequences and certain versions of SignalP 6. Unfortunately we can't do much about this other than report the troublesome sequence(s) to the developers.

If you contact us or raise an issue we can do that for you. Please include the sequences that are causing the issue and the exact version of SignalP 6 that you downloaded so that we can be most helpful. Otherwise if you use GitHub you can raise an issue yourself in their repository (note though that the code that's up there isn't actually what it distributed). They also list contact emails in their installation instructions.

As a temporary fix you can either re-run the pipeline using the --no_signalp6 parameter, which will not run SignalP 6 on any sequences. Alternatively, you can manually mark this chunk (internally we chunk the input into sets of --chunk_size unique sequences (default 5000)) as completed. This will only skip signalp6 for an individual chunk.

  1. Find the working directory of the task from the error message. It will look like this:
Work dir:
  /home/ubuntu/predector_analysis/work/7e/954be70138c4c29467945fade280ab
  1. Set the exit code to 0 and create an empty output file:
DIR_CONTAINING_ERROR=/home/ubuntu/predector_analysis/work/7e/954be70138c4c29467945fade280ab

echo "0" > "${DIR_CONTAINING_ERROR}/.exitcode"
touch "${DIR_CONTAINING_ERROR}/out.ldjson"
  1. Re-run the pipeline as you did before with the -resume option.

This should restart the pipeline and continue as if SignalP 6 hadn't failed (though it may still fail on a different chunk). Note however that if you skip the analysis for one chunk, the manual ranking scores (and probably the learned ranking scores in the near future) won't be reliable (because the other chunks will have more information).

Error while running a process with Command exit status: 137

The error-code usually means that you have run out of memory in a task. At the time of writing this seems to happen when running SignalP 6 on relatively small computers (e.g. with <6GB RAM available).

General strategies for reducing memory usage are to reduce the --chunk_size to below 1000 (say 500). Specifically for SignalP 6 you can also try reducing the --signalp6_bsize to 10. You can read more about these parameters in the Command line parameters section.

If you encounter this issue in the final steps when producing the output and ranking tables, it may be the case that one of your input fasta files is very large. As noted in the running the pipeline section, Predector was designed to handle typical proteomes. The number of proteomes doesn't really matter because internally we deduplicate and divide the inputs into chunks, but if one single input fasta has say >10e5 proteins, this can cause an issue if you don't have lots of RAM (I find that about 30GB is needed for a few million proteins). If your proteins aren't split into proteomes (e.g you're running on a set downloaded from UniProt), it's best to split them yourself to batches of about 20000, and then concatenate the final tables yourself. We can guide you through dealing with this to make use of what you have already computed, so please get in touch.

If you encounter this issue with other processes please let us know. We've done our best to keep peak memory use low for most tasks, but there may be cases that we hadn't considered.

Installing with Mamba, Problem: nothing provides __glibc >=2.17,<3.0.a0 needed by...

This appears to happen with very old versions of Mamba, and was reported to us here. It appears that simply updating mamba will fix the problem.

conda update -n base -c conda-forge mamba

If this does not resolve the problem, please raise another issue or contact us.

FAQ

We'll update these as we find new issues and get feedback. Please raise an issue on GitHub or email us if you have an issue not covered here.

What do Predector "effector scores" actually mean?

It's best to think of the learning to rank scores (and the manually designed ranking scores) as arbitrary numbers that attempt to make effectors appear near the top of a sorted list. The scores will not be consistent between different versions of the model, so please be careful if you're trying to compare scores. Similarly, like with EffectorP the scores should not be treated as a 'likelihood'. Although you can generally say that proteins with higher scores will be more like known effectors, the difference in "effector-ness" between 0 and 1 is not necessarily the same as it is between 1 and 2 (and so on).

In the paper for version 1 we present some comparisons with EffectorP classification using a score threshold of 0, but this is not how we suggest you use these scores and the threshold may not be applicable in the future if we change how the model is trained. In general, it's best to look at some additional evidence (e.g. homologues, expression, or presence-absence) and manually evaluate candidates in descending order of score (i.e. using Predector as a decision support system) until you have enough to work with.

In the first version of the model, the predictions between 0 and 1 can contain some odd effector predictions. This is because the model has tried to accomodate some unusual effectors, but the decision tree rules (with discontinuous boundaries) can let some things through that obviously aren't effectors. If you delve into the proteins with lower scores we recommended that you manually evaluate the protein properties in the ranking sheet yourself to select candidates.

With Predector we really wanted to encourage you to look at your data. Ranking separates the bulk of good proteins from bad ones, so it's easier to decide when to stop manually evaluating candidates and settle on a list. Think of it like searching for papers on the web. The first page usually contains something relevant to what you're interested in, but sometimes there are some gems in the 2nd and 3rd pages.

How should I cite Predector?

The Predector pipeline and ranking method is published in scientific reports:

Darcy A. B. Jones, Lina Rozano, Johannes W. Debler, Ricardo L. Mancera, Paula M. Moolhuijzen, James K. Hane (2021). An automated and combinative method for the predictive ranking of candidate effector proteins of fungal plant-pathogens. Scientific Reports. 11, 19731, DOI: 10.1038/s41598-021-99363-0

Please also cite the dependencies that we use whenever possible. I understand that citation limits can be an issue, but the continued maintenance development of tools relies on these citations. If you absolutely must prioritise, I'd suggest keeping EffectorP, ApoplastP, Deepredeff, TargetP, TMHMM, and one of the SignalP papers, as these do most of the heavy lifting in the pipeline. There is a BibTeX formatted file with citations in the main github repository, which can be imported into most citation managers. The dependency citations are also listed below.

  • Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H., & Winther, O. (2017). DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21), 3387–3395. https://doi.org/10.1093/bioinformatics/btx431
  • Armenteros, Jose Juan Almagro, Salvatore, M., Emanuelsson, O., Winther, O., Heijne, G. von, Elofsson, A., & Nielsen, H. (2019). Detecting sequence signals in targeting peptides using deep learning. Life Science Alliance, 2(5). https://doi.org/10.26508/lsa.201900429
  • Armenteros, José Juan Almagro, Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., Heijne, G. von, & Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature Biotechnology, 37(4), 420–423. https://doi.org/10.1038/s41587-019-0036-z
  • Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. https://doi.org/10.1038/nbt.3820
  • Dyrløv Bendtsen, J., Nielsen, H., von Heijne, G., & Brunak, S. (2004). Improved Prediction of Signal Peptides: SignalP 3.0. Journal of Molecular Biology, 340(4), 783–795. https://doi.org/10.1016/j.jmb.2004.05.028
  • Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLOS Computational Biology, 7(10), e1002195. https://doi.org/10.1371/journal.pcbi.1002195
  • Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., & Punta, M. (2014). Pfam: The protein families database. Nucleic Acids Research, 42(Database issue), D222–D230. https://doi.org/10.1093/nar/gkt1223
  • Teufel, F., Armenteros, J. A. A., Johansen, A. R., Gíslason, M. H., Pihl, S. I., Tsirigos, K. D., Winther, O., Brunak, S., von Heijne, G., & Nielsen, H. (2021). SignalP 6.0 achieves signal peptide prediction across all types using protein language models. bioRxiv. https://doi.org/10.1101/2021.06.09.447770
  • Käll, L., Krogh, A., & Sonnhammer, E. L. L. (2004). A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5), 1027–1036. https://doi.org/10.1016/j.jmb.2004.03.016
  • Kristianingsih, R., MacLean, D. (2021). Accurate plant pathogen effector protein classification ab initio with deepredeff: an ensemble of convolutional neural networks. BMC Bioinformatics 22, 372. https://doi.org/10.1186/s12859-021-04293-3
  • Krogh, A., Larsson, B., von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology, 305(3), 567–580. https://doi.org/10.1006/jmbi.2000.4315
  • Petersen, T. N., Brunak, S., Heijne, G. von, & Nielsen, H. (2011). SignalP 4.0: Discriminating signal peptides from transmembrane regions. Nature Methods, 8(10), 785–786. https://doi.org/10.1038/nmeth.1701
  • Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16(6), 276–277. https://doi.org/10.1016/S0168-9525(00)02024-2
  • Savojardo, C., Martelli, P. L., Fariselli, P., & Casadio, R. (2018). DeepSig: Deep learning improves signal peptide detection in proteins. Bioinformatics, 34(10), 1690–1696. https://doi.org/10.1093/bioinformatics/btx818
  • Sperschneider, J., Catanzariti, A.-M., DeBoer, K., Petre, B., Gardiner, D. M., Singh, K. B., Dodds, P. N., & Taylor, J. M. (2017). LOCALIZER: Subcellular localization prediction of both plant and effector proteins in the plant cell. Scientific Reports, 7(1), 1–14. https://doi.org/10.1038/srep44598
  • Sperschneider, J., Dodds, P. N., Gardiner, D. M., Singh, K. B., & Taylor, J. M. (2018). Improved prediction of fungal effector proteins from secretomes with EffectorP 2.0. Molecular Plant Pathology, 19(9), 2094–2110. https://doi.org/10.1111/mpp.12682
  • Sperschneider, J., Dodds, P. N., Singh, K. B., & Taylor, J. M. (2018). ApoplastP: Prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist, 217(4), 1764–1778. https://doi.org/10.1111/nph.14946
  • Sperschneider, J., Gardiner, D. M., Dodds, P. N., Tini, F., Covarelli, L., Singh, K. B., Manners, J. M., & Taylor, J. M. (2016). EffectorP: Predicting fungal effector proteins from secretomes using machine learning. New Phytologist, 210(2), 743–761. https://doi.org/10.1111/nph.13794
  • Sperschneider, J., & Dodds, P. N. (2021). EffectorP 3.0: prediction of apoplastic and cytoplasmic effectors in fungi and oomycetes. bioRxiv. https://doi.org/10.1101/2021.07.28.454080
  • Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988
  • Tange, O. (2020). GNU Parallel 20200522 ('Kraftwerk'). Zenodo. https://doi.org/10.5281/zenodo.3841377
  • Urban, M., Cuzick, A., Seager, J., Wood, V., Rutherford, K., Venkatesh, S. Y., De Silva, N., Martinez, M. C., Pedro, H., Yates, A. D., Hassani-Pak, K., & Hammond-Kosack, K. E. (2020). PHI-base: The pathogen–host interactions database. Nucleic Acids Research, 48(D1), D613–D620. https://doi.org/10.1093/nar/gkz904
  • Zhang, H., Yohe, T., Huang, L., Entwistle, S., Wu, P., Yang, Z., Busk, P. K., Xu, Y., & Yin, Y. (2018). dbCAN2: A meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Research, 46(W1), W95–W101. https://doi.org/10.1093/nar/gky418
Clone this wiki locally