The analysis of T-cell receptor sequences of an individual allows us to determine, which pathogens and viruses his immune system can detect. The main challenge here is the uncertain matching between receptors and antigens. The TCR sequences of different lengths and amino acid content can be complementary to the same antigen.
The goal of this project is to develop an efficient pipeline to compare the existing methods of the TCR prediction methods.
We also hypothesize, that usage of custom substitution matrices assembled from TCR sequences, instead of BLOSUM, can improve performance of antigen prediction.
For now we support the following algorithms:
Pipelines for these algorithms are provided as Jupyter notebooks and are placed in the folders corresponding to their names.
Except for the BLOSUM62 substitution matrix, we also provide substitution matrices, which are based on the TCRs receptors sequences (both for alpha-and beta-chains, and single antigens). The main pipeline of building matrices locates here in Jupyter format. The precalculated matrices are placed here.
We test all of the methods on VDJ-database (a curated database of T-cell receptor sequences of known antigen specificity).
We use Python with Jupyter notebook.
To recreate the results of the study you need to pull master branch. Please take a look at requirements files as they contain versions of packages required to run the code.
The code can be run on any OS. As modules that perform computations are well-commented you can add required functionality with little effort. This is also helpful if you want to look what is under the hood of result calculation.
Please note as some algorithms use neural networks, speed of computation depends on whether you have a Cuda device.
Then open .ipynb
file in a folder with a name of the algorithm you are interested in. The input to all pipelines is a set of epitopes for which prediction is tested and training data.
To compare algorithm performance to other ones it is convenient to use AUC calculated for a set of epitopes with ROC for visualization.
Results of algorithm comparison consist of ROC plot for each epitope and calculated AUC on test set. Plots and data can be found in the end of .ipynb
file for each algorithm.
For TCRdist we also used our own substitution matrices:
- the first row represents the results with BLOSUM
- the second row - with our substitution matrices
- Alexandra Ovsyannikova, National Research Nuclear University MEPHI
- Olga Nedilchenko, Bauman Moscow State Technical University
We want to express gratitude to:
- Mikhail Shugay for constant support and for providing us with data for the research.
- Bioinformatics Institute for organizational assistance