A Python tool that screens MAFFT parameters, performs and evaluates alignments, and identifies the optimal alignment strategy given your amino acid sequence dataset.
Created by Nicolas-Frédéric Lipp, PhD.
MAFFT_ScoreNGo is useful for determining the theoretically best alignment method using MAFFT** for your specific protein sequence dataset. All you need is an input .fasta file of unaligned sequences. It is recommended to run it on a small dataset - you can use the script I provided, "Rand_NSamp_MyFasta.py" (available soon), to extract 200 sequences or less if your dataset is too large.
- Automated screening of multiple MAFFT parameters
- Three screening levels: Light, Standard, and Aggressive
- Evaluation of alignment quality using custom scoring metrics
- Identification of optimal alignment strategy for given dataset
- Generation of comprehensive result summaries
**MAFFT stands for Multiple sequence Alignment using Fast Fourier Transform. More documentation can be found at mafft.cbrc.jp, Katoh et al. (Nucleic Acids Res., 2002), and Katoh et al. (Brief. Bioinform., 2017).
MAFFT_ScoreNGo tests various combinations of the following MAFFT parameters:
- Alignment strategies (--genafpair, --localpair, --globalpair)
- Substitution matrices (BLOSUM62, BLOSUM80)
- Gap opening penalties
- Gap extension penalties
- Large gap penalties
For a detailed explanation of the parameters tested and scoring algorithms used, please refer to the PARAMETERS.md file.
-
Clone this repository:
git clone https://github.com/yourusername/MAFFT_ScoreNGo.git
cd MAFFT_ScoreNGo
-
Install the required Python packages:
pip3 install -r requirements.txt
-
Ensure MAFFT is installed and accessible from your command line (see Dependencies section for more details).
-
Run the script with:
python3 MAFFT_ScoreNGo.py
-
Follow the prompts to select your input FASTA file.
-
Choose the screening level:
- Light (type
1
): Quick screening with fewer parameter combinations. - Standard (type
2
): Balanced screening with a moderate number of combinations. - Aggressive (type
3
): Thorough screening with many parameter combinations.
- Light (type
-
(Optional) Enter your own personalized parameters if desired.
-
Confirm that you want to run the computation on the given number of combinations.
You can add custom MAFFT parameters to be tested alongside the predefined combinations. When prompted, enter your parameters in the MAFFT command-line format. For example: --maxiterate 1000 --globalpair --thread 4
The script will output:
- A ranking of the top 13 alignments based on the final score.
- Detailed information for each alignment, including parameters used, execution time, and various scores.
- The best combination of parameters for your dataset.
Results are saved in the mafft_results
directory, including:
- Individual alignment files
- A summary file (mafft_results_summary.txt)
- A list of MAFFT commands used (mafft_commands.txt)
- Debug logs (debug_logs.txt)
- If MAFFT is not found, ensure it's correctly installed and added to your system PATH.
- For memory issues with large datasets, try using a smaller subset of sequences.
- If you encounter Python-related errors, verify that all dependencies are correctly installed.
- Python 3.x (3.7 or later recommended)
- Biopython 1.81
- Tkinter
- MAFFT (must be installed separately and available in your system PATH)
To check if MAFFT is properly installed and available, run the following command in your terminal:
mafft --version
This should display the installed version of MAFFT. If you see an error message instead, please refer to the MAFFT Installation section below.
Ensure you have Python 3.7 or later installed. You can install Biopython and other Python dependencies using:
pip3 install -r requirements.txt
Tkinter is usually included with Python, but on some systems, it may need to be installed separately:
-
On macOS, if you've installed Python via Homebrew:
brew install python-tk
-
On Ubuntu/Debian Linux:
sudo apt-get install python3-tk
-
On other systems, please refer to your system's package manager or Python distribution instructions.
MAFFT must be installed separately and be available in your system PATH. Installation instructions vary by operating system:
-
On macOS with Homebrew:
brew install mafft
-
On Ubuntu/Debian Linux:
sudo apt-get install mafft
-
For other systems or for manual installation, please refer to the MAFFT official website: https://mafft.cbrc.jp/alignment/software/
After installation, verify MAFFT is accessible by running:
mafft --version
MAFFT_ScoreNGo.py
was tested with sample_input, containing 69 sequences with an average length of 438 amino acids, using macOS Sonoma on an ARM M1 Max (10-core CPU, 32-core GPU) with 32 GB of RAM. With this configuration, it took 25 seconds to run the Light screening level (13 alignments) and perform the scoring assessment.
Don't forget to "caffeinate" your Mac! (or use systemd-inhibit on your Linux machine).
Feel free to use MAFFT_ScoreNGo_FR.py
which is the French version of MAFFT_ScoreNGo.py
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any problems or have any questions, please open an issue on the GitHub repository.
Nicolas-Frédéric Lipp, PhD
https://github.com/NicoFrL
This project was developed with the assistance of AI language models, which provided guidance on code structure, best practices, and documentation. The core algorithm and scientific approach were designed and implemented by the author.