-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248
Comments
Dear @Mangosteen24 Thanks you for the feedback! You understanding is correct, and this is indeed a little bit odd. To understand where this inconsistency stems from one has to go deeper in the algorithms and data. A few questions do you use. Which version do you use? Some of the inconsistencies were fixed at some point, but I cannot guarantee all of the are eliminated. Best |
Hi Andrey I use the latest version of isoquant v3.6.1 First I checked what reads were assigned to ENST00000275493 and their assignment_type in
Then I checked those 10032 unique read_id in
It seems that most unique reads were assigned to * instead of ENST00000275493, which means it is not a known transcript, or NIC or NNIC? I also noticed that most of the unique reads' assignment events were 'mono_exonic'; maybe that's the reason it cannot differentiate which isoform they come from? |
Thank you for providing the insights! I think that's what happens here. These reads map to some unique parts of the ENST00000275493 isoform (i.e. some exons that are only present in this transcript), and thus are uniquely assigned. I can a further look if you have a chance to send me a BAM file with all the reads from this region or reads assigned to ENST00000275493 (all types). Best |
Thank you @andrewprzh . I have a follow-up question. Since I have a huge dataset, around 40 BAM files, I want to run them simultaneously to obtain a single annotation file. This way, each sample will have the same ID for NIC/NNIC for comparison. However, the process seems to be stuck after running for 24 hours. In the log file, the last line is from a few days ago: 2024-10-26 - INFO - Finished processing chromosome chr10.. Here is my command. Could you kindly suggest a solution?
|
Could you send me the entire log file? Best |
I tried again with 27 samples; however the server shut down unexpectedly. At the time, most chromosomes seemed to have finished processing except for four: chr1, chr7, chr11, and chrM. Details are shown in the attached log1.
I then resumed the run and monitored the memory usage closely. I noticed that memory consumption increased to more than 600GB while processing chromosome 1 (see log2), so I manually stopped the program to prevent further issues. At that time, chr1, chr7, chr11, and chrM all appeared as ‘Loading read assignments’ but didn’t finish.
After that, I resumed again with only 2 threads, hoping to reduce memory usage. At start, chr1 and chr7 were being processed. Updates after resume2 for 30 hours: chr1 finished, chr7 and chr11 were being processed. Around 6 hours later, memory started to increase dramatically, so I had to stop the run again (see log3). I am unsure whether I should resume again or try a different approach. Could you please advise the next move?
Furthermore, would it be helpful if
|
Sorry for the delay. I think the main issue here is Removing chrM might work as well, it's been known to be very slow when too many reads map onto it. Best |
Hi Thank you for developing this useful tool!
I would like to inquire about the differences between transcript_tpm.tsv and transcript_model_tpm.tsv.
The file transcript_model_tpm.tsv contains the expression of discovered transcript models in TPM (and corresponds to transcript_models.gtf). It should include all expressed transcripts, both novel and known, correct? However, when I search for a specific transcript, such as the canonical transcript ENST00000275493, I find it only exists in transcript_tpm.tsv with a high TPM value, while it is absent from transcript_model_tpm.tsv. I understand that those are two different algorithms: reference-based and discovery, but it is a bit weird that the ENST00000275493 is completely absent from transcript_model_tpm.tsv. Any reason for this inconsistency?
Smilarly, ENST00000275493 is absent from transcript_models.gtf. So, would you recommend using transcript_models.gtf or extended_annotation.gtf for downstream analyses, such as SQANTI3?
Thank you!
The text was updated successfully, but these errors were encountered: