-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unexpected poor performance of e2efold #5
Comments
Hi, thank you very much for your interest! For your information, unlike the previous methods, our method is a learning method. We have the assumption that the distribution of the testing dataset should be similar to the training dataset. We also notice that the performance of our method is not very good if we apply the trained model onto a new RNA type which does not appear in the training dataset. The reason is that the distribution and threshold learned from the training sequences are different from those in the testing sequences. The our-of-distribution learning problem is a very fascinating and open research direction, where we are currently exploring. |
I am afraid this may not be the only reason. For example, SPOT-RNA is also a deep learning algorithm similar to e2efold, except that it does not have unrolled algorithms and uses much simpler loss function. However, its performance is actually very good on this dataset. Also, could you explain what do you mean by "a new RNA type"?
|
Thank you very much for raising that. |
Thank you Chengxin for testing our trained model on your dataset. It is neither unexpected nor surprising to me that the performance is not so good on your dataset. It is very likely that the distribution of your test dataset is very different from our training dataset. The performance of a deep learning model is mainly determined by the following 3 factors: E2Efold focuses on improving (1) and also has contributions in (2). To fairly compare E2Efold with SPOT-RNA, we suggest you use the SAME training data for both models and then compare their performance on your test dataset. As Yu mentioned earlier, SPOT-RNA is trained on bpRNA. If their training dataset (after any preprocessing or filtering) is available, you can try to re-train our model on their training dataset. Our codes for training the model are available in this repo. If you encounter any difficulty, we are very happy to help you out. |
This brings to my second question. When e2efold generate the training/validation/test set, does e2efold check sequence redundancy? For example, cd-hit-est clustering at 80% sequence identity cutoff (which is already a very permissive sequence identity cutoff) shows that only 1231 out of all 3966 archiveII RNA sequences can be considered unique. The large redundancy is partly due to the archiveII dataset including both the full length and individual domain sequences. For example
are just sub-sequences of |
There is a preprocessing step where we remove redundant structures. There also won't be the cases as you mentioned that the same sequence appears in both test and train dataset in different forms. Our experiment has the following data split after removing the redundancy: [RNAStralign train, RNAStralign vali, RNAStralign test]. The model is trained on [RNAStralign train], validated on [RNAStralign vali], and tested on both [RNAStralign test] and [ArchiveII(filtered)], where [ArchiveII(filtered)] is also filtered and does not overlap with RNAStralign. Still, I guess the BEST way to resolve your concerns on both our methodology and the dataset is to train E2Efold on the training set of SPOT-RNA. Give it a try? |
For the RNAStralign dataset, it was mentioned that "After removing redundant sequences and structures, 30451 structures remain." Could you explain a little more on what does the original text mean by "removing redundant sequences and structures"? |
@kad-ecoli It appeared that running cd-hit-est with -c 0.8 showed only around 3000 sequences in RNAStrAlign can be considered unique |
I have tested e2efold on a set of 361 PDB chains, where secondary structures for RNAs shorter than 600 nucleotides are predicted by
e2efold_productive/e2efold_productive_short.py
, while those longer than 600 nucleotides are predicted bye2efold_productive/e2efold_productive_long.py
.To my big surprise, when evaluated against DSSR assigned canonical base pairs of this dataset, e2efold predicted *.ct files have very low average F1 and MCC of 0.2400 and 0.2401, respectively, which are significantly worse than SOTA methods mentioned in Table 2 of the e2efold paper (https://openreview.net/pdf?id=S1eALyrYDH). The following is my benchmark result, ranked in ascending order of F1 score.
I have attached the predicted ct files below. Additionally, I include the 4 sequences listed under e2efold_productive/*_seqs/*seq and make sure that my run generates identical ct files as the one shown in the github repository.
e2e.zip
Could you check whether I run the e2efold program incorrectly and results in such a low performance? In particular, could you check why e2efold has on average only 18.2133 predicted base pairs per RNA chain, while the actual average number of canonical base pairs in the native structure is as many as 28.6648? Thank you.
The text was updated successfully, but these errors were encountered: