Skip to content

A Python tool that screens MAFFT parameters, performs and evaluates alignments, and identifies the optimal configuration for a given sequence dataset.

License

Notifications You must be signed in to change notification settings

NicoFrL/MAFFT_ScoreNGo

Repository files navigation

MAFFT_ScoreNGo

A Python tool that screens MAFFT parameters, performs and evaluates alignments, and identifies the optimal alignment strategy given your amino acid sequence dataset.

Created by Nicolas-Frédéric Lipp, PhD.

Features

MAFFT_ScoreNGo is useful for determining the theoretically best alignment method using MAFFT** for your specific protein sequence dataset. All you need is an input .fasta file of unaligned sequences. It is recommended to run it on a small dataset - you can use the script I provided, "Rand_NSamp_MyFasta.py" (available soon), to extract 200 sequences or less if your dataset is too large.

  • Automated screening of multiple MAFFT parameters
  • Three screening levels: Light, Standard, and Aggressive
  • Evaluation of alignment quality using custom scoring metrics
  • Identification of optimal alignment strategy for given dataset
  • Generation of comprehensive result summaries

**MAFFT stands for Multiple sequence Alignment using Fast Fourier Transform. More documentation can be found at mafft.cbrc.jp, Katoh et al. (Nucleic Acids Res., 2002), and Katoh et al. (Brief. Bioinform., 2017).

Parameters Overview

MAFFT_ScoreNGo tests various combinations of the following MAFFT parameters:

  • Alignment strategies (--genafpair, --localpair, --globalpair)
  • Substitution matrices (BLOSUM62, BLOSUM80)
  • Gap opening penalties
  • Gap extension penalties
  • Large gap penalties

For a detailed explanation of the parameters tested and scoring algorithms used, please refer to the PARAMETERS.md file.

Installation

  1. Clone this repository:
    git clone https://github.com/yourusername/MAFFT_ScoreNGo.git
    cd MAFFT_ScoreNGo

  2. Install the required Python packages:
    pip3 install -r requirements.txt

  3. Ensure MAFFT is installed and accessible from your command line (see Dependencies section for more details).

Usage

Quick Start

  1. Run the script with:
    python3 MAFFT_ScoreNGo.py

  2. Follow the prompts to select your input FASTA file.

  3. Choose the screening level:

    • Light (type 1): Quick screening with fewer parameter combinations.
    • Standard (type 2): Balanced screening with a moderate number of combinations.
    • Aggressive (type 3): Thorough screening with many parameter combinations.
  4. (Optional) Enter your own personalized parameters if desired.

  5. Confirm that you want to run the computation on the given number of combinations.

Customization

You can add custom MAFFT parameters to be tested alongside the predefined combinations. When prompted, enter your parameters in the MAFFT command-line format. For example: --maxiterate 1000 --globalpair --thread 4

Interpreting Results

The script will output:

  1. A ranking of the top 13 alignments based on the final score.
  2. Detailed information for each alignment, including parameters used, execution time, and various scores.
  3. The best combination of parameters for your dataset.

Results are saved in the mafft_results directory, including:

  • Individual alignment files
  • A summary file (mafft_results_summary.txt)
  • A list of MAFFT commands used (mafft_commands.txt)
  • Debug logs (debug_logs.txt)

Troubleshooting

  • If MAFFT is not found, ensure it's correctly installed and added to your system PATH.
  • For memory issues with large datasets, try using a smaller subset of sequences.
  • If you encounter Python-related errors, verify that all dependencies are correctly installed.

Dependencies

  • Python 3.x (3.7 or later recommended)
  • Biopython 1.81
  • Tkinter
  • MAFFT (must be installed separately and available in your system PATH)

To check if MAFFT is properly installed and available, run the following command in your terminal:
mafft --version

This should display the installed version of MAFFT. If you see an error message instead, please refer to the MAFFT Installation section below.

Python and Biopython

Ensure you have Python 3.7 or later installed. You can install Biopython and other Python dependencies using:
pip3 install -r requirements.txt

Tkinter Installation

Tkinter is usually included with Python, but on some systems, it may need to be installed separately:

  • On macOS, if you've installed Python via Homebrew:
    brew install python-tk

  • On Ubuntu/Debian Linux:
    sudo apt-get install python3-tk

  • On other systems, please refer to your system's package manager or Python distribution instructions.

MAFFT Installation

MAFFT must be installed separately and be available in your system PATH. Installation instructions vary by operating system:

After installation, verify MAFFT is accessible by running:
mafft --version

Performance

MAFFT_ScoreNGo.py was tested with sample_input, containing 69 sequences with an average length of 438 amino acids, using macOS Sonoma on an ARM M1 Max (10-core CPU, 32-core GPU) with 32 GB of RAM. With this configuration, it took 25 seconds to run the Light screening level (13 alignments) and perform the scoring assessment. Don't forget to "caffeinate" your Mac! (or use systemd-inhibit on your Linux machine).

Language

Feel free to use MAFFT_ScoreNGo_FR.py which is the French version of MAFFT_ScoreNGo.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any problems or have any questions, please open an issue on the GitHub repository.

Author

Nicolas-Frédéric Lipp, PhD
https://github.com/NicoFrL

Acknowledgements

This project was developed with the assistance of AI language models, which provided guidance on code structure, best practices, and documentation. The core algorithm and scientific approach were designed and implemented by the author.

About

A Python tool that screens MAFFT parameters, performs and evaluates alignments, and identifies the optimal configuration for a given sequence dataset.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages