ProtDomRetriever is a simple Python tool for retrieving protein domain information from the InterPro database based on UniProtKB accessions and specified InterPro entries. The script utilizes the application programming interface (API) of InterPro, extracts the position of every domain for each entry and selects the longest domain if multiple entries overlap. Facultatively the program returns a trimmed fasta file imported from UniProt. The program allows the retrieval of multiple domains in tandem if any, and it attributes a domain number to the uniprot accession code.
Created by Nicolas-Frédéric Lipp, PhD.
- Retrieve domain information for multiple UniProtKB accessions
- Filter domains based on specified InterPro entries
- Generate TSV output with domain ranges
- Create FASTA files for the retrieved protein domains
- User-friendly GUI for file selection
- Python 3.6+
- Required Python packages:
- tkinter
- requests
- Clone this repository: git clone https://github.com/NicoFrL/ProtDomRetriever.git
- Navigate to the project directory: cd ProtDomRetriever
- Install required packages: pip install -r requirements.txt
Run the script using Python:
python3 ProtDomRetriever.py
Follow the on-screen prompts to:
- Select an input file containing UniProtKB accessions
- Enter InterPro entries for domain filtering
- Choose whether to fetch FASTA files for the protein domains
The script generates three main output files in a new directory:
*_result_table.tsv
: A tab-separated file containing protein accessions, InterPro entries, and domain ranges*_domain_ranges.txt
: A text file listing the domain ranges for each protein*_output_domains.fasta
: A FASTA file containing the sequences of the retrieved protein domains (if FASTA retrieval is selected)
Two example datasets are provided in the examples
directory:
- ORP dataset (
example1
) - Spectrin dataset (
example2
)
Each example includes input files, suggested InterPro entries, and sample output files.
Users can always use the content of the output file *_domain_ranges.txt
at https://www.uniprot.org/id-mapping to map UniProtKB AC/ID to UniProtKB and retrieve the sequences manually, for instance as a comprehensive Excel file.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you encounter any problems or have any questions, please open an issue on the GitHub repository.
Nicolas-Frédéric Lipp, PhD
https://github.com/NicoFrL
This project was developed with the assistance of AI language models, which provided guidance on code structure, best practices, and documentation. The core algorithm and scientific approach were designed and implemented by the author on the basis of InterPro documentation.