Novel/Unannotated Isoforms #345

ajritter8 · 2022-07-22T21:37:13Z

ajritter8
Jul 22, 2022

Hi there,

So I'm trying to look at novel/unannotated isoforms detected and quantified by Bambu and I'd just like some clarification and/or advice. So basically I get an extended annotation GTF file, and upon looking at it, it contains all the Ensemble transcript IDs that were in my reference GTF, plus several hundred Ensemble transcript IDs that were not in my reference GTF. Where is Bambu getting these Ensemble transcript IDs from? I was hoping unannotated transcripts would be given some kind of separate ID instead of being associated with existing annotations.

andredsim · 2022-08-16T09:48:34Z

andredsim
Aug 16, 2022
Maintainer

Hi there,

Sorry for the late response. Bambu shouldn't be able to create new ensembl transcript IDs. All novel transcripts will be assigned a transcript id of tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene ID if they overlap with an annotated gene, but these ID should already be present in the annotation GTF file.

If this doesn't answer your question, you could send me the files, or a screen shot of an example, so that I may look a bit deeper.

0 replies

ajritter8 · 2022-08-16T15:51:30Z

ajritter8
Aug 16, 2022
Author

Hey Andre, No worries, I got that part figured out. But I do have another question: I’ve been trying out different cutoffs for de novo isoform detection and no matter how little stringency, Bambu doesn’t seem to detect more than ~20 or so unannotated transcripts in a given sample. Do you have recommendations for how to optimize discovery mode settings to best detect unannotated transcripts? Best, Alex

…

On Tue, Aug 16, 2022 at 2:48 AM Andre Sim ***@***.***> wrote: Hi there, Sorry for the late response. Bambu shouldn't be able to create new ensembl transcript IDs. All novel transcripts will be assigned a transcript id of tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene ID if they overlap with an annotated gene, but these ID should already be present in the annotation GTF file. If this doesn't answer your question, you could send me the files, or a screen shot of an example, so that I may look a bit deeper. — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F36ANBINDDUMF75ALEDVZNPX3ANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

andredsim · 2022-08-17T01:10:48Z

andredsim
Aug 17, 2022
Maintainer

Hi Alex,

There are many things that could influence this. Which parameters are you using for your bambu run and is this a standard long-read run or a custom library preparation (i.e used primers to target specific genes)?

I would recommend just using the NDR (novel discovery rate) parameter and if you want a more sensitive output increasing the NDR. In a human sample we generally use between .1-.2, however if you are using a different species with less comprehensive annotations you might want to increase this value. The maximum this value can be set is 1, which will mean every novel transcript that has at least 2 reads will be annotated.

In the coming patch (see the BambuRevisionManuscript branch) we are also introducing a recommended NDR feature which will recommend an NDR value for your sample, in the event many annotations are missing.

0 replies

ajritter8 · 2022-08-18T22:13:32Z

ajritter8
Aug 18, 2022
Author

Hey Andre, Below is an example of how I'm running Bambu. For "x", I tried 0.01, 0.05, 0.10 (default), 0.25, 0.50, 0.75 and 1.00. The "bam_file" is from a standard long-read run. The results were actually the exact same across all NDR levels, which was that only annotated transcripts were detected and quantified. When I instead tried messing with minGeneFrac and minReadCount, it did detect some unannotated transcripts but not many. Would you recommend a specific combination of arguments to tune to get a more sensitive novel discovery rate? txdb <- makeTxDbFromGFF(gtf_file) se_bambu <- bambu(reads = bam_file, annotations = txdb, genome = genome_fasta_file, discovery = TRUE, opt.discovery = list(NDR = x)) Thank you, Alex

…

On Tue, Aug 16, 2022 at 6:10 PM Andre Sim ***@***.***> wrote: Hi Alex, There are many things that could influence this. Which parameters are you using for your bambu run and is this a standard long-read run or a custom library preparation (i.e used primers to target specific genes)? I would recommend just using the NDR (novel discovery rate) parameter and if you want a more sensitive output increasing the NDR. In a human sample we generally use between .1-.2, however if you are using a different species with less comprehensive annotations you might want to increase this value. The maximum this value can be set is 1, which will mean every novel transcript that has at least 2 reads will be annotated. In the coming patch (see the BambuRevisionManuscript branch) we are also introducing a recommended NDR feature which will recommend an NDR value for your sample, in the event many annotations are missing. — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F3667P2CTDT6AADRUQLVZQ32FANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

andredsim · 2022-08-19T05:18:00Z

andredsim
Aug 19, 2022
Maintainer

Hi Alex,

This is very strange, if the NDR is 1, you should be getting loads of novel transcripts as that is maximum sensitivity.

Could you let me know how you changed minGeneFrac, and minReadCount that allowed you to get some unannotated transcripts? This might give me a clue as to what might be occuring.

Could you also please try a run where you replace makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity check for me?

In the likely event this doesn't make a difference, could you add rcOutDir = "/some/output/directory/here" to your arguments. This should output an .rds file which contains all the read classes for your sample (essentially all possible candidates before any filtering). If you could send that to me I could have a look what might be going on.

0 replies

ajritter8 · 2022-08-19T20:37:49Z

ajritter8
Aug 19, 2022
Author

Hey Andre, I was getting novel transcripts with min.readCount <=5, and with min.readFractionByGene <= 0.05 I got just a few novel transcripts. I am now trying out min.readFractionByGene = 0.05 with different NDRs (0.25, 0.50, 0.75, 1.00) to see how that does. I've also added the rcOutDir argument so I'll send you one of the outputs when it's finished. The reason I'm using makeTxDbFromGFF instead of prepareAnnotations is because we're using a custom annotation file (human) and we found that it was instead importing the standard annotation. Thanks, Alex

…

On Thu, Aug 18, 2022 at 10:18 PM Andre Sim ***@***.***> wrote: Hi Alex, This is very strange, if the NDR is 1, you should be getting loads of novel transcripts as that is maximum sensitivity. Could you let me know how you changed minGeneFrac, and minReadCount that allowed you to get some unannotated transcripts? This might give me a clue as to what might be occuring. Could you also please try a run where you replace makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity check for me? In the likely event this doesn't make a difference, could you add rcOutDir = "/some/output/directory/here" to your arguments. This should output an .rds file which contains all the read classes for your sample (essentially all possible candidates before any filtering). If you could send that to me I could have a look what might be going on. — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F34SNCUWIJZQK2E6U73VZ4KJHANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

andredsim · 2022-08-22T02:15:08Z

andredsim
Aug 22, 2022
Maintainer

Hi Alex,

The default threshold for min.readCount is 2, meaning only transcripts with 1 full length read are filtered out via this threshold. Your threshold of 5 is therefore less sensitive than the default. (Note that it is >= not <=)
The default min.readFractionByGene is also 0.05.

Let me know once you can upload the rcOutDir, it gets produced before the quantification step, so it make be in the directory even before it finishes running.

I also wanted to ask about what happened with your custom annotation file when using prepareAnnotations, as it is not expected behavior that it should not import anything other than the .gtf that is provided to it. How did you know it loaded in a standard annotation and not your custom annotation?

Could you please check the annotations object you have created with for me with TxDbFromGFFmcols(annotations) and just double check that the TXNAME and GENEID are not the same, and that there are multiple rows that have different TXNAMES but share a GENEID. We have had reports that makeTxDbFromGFF sometimes causes all transcript and gene names to be the same resulting in no multi-isoform genes. This is a big problem for bambu and might be the crux of the issue.

Thanks, I hope we will be able to get this working for you soon!

Andre

0 replies

ajritter8 · 2022-08-22T19:05:16Z

ajritter8
Aug 22, 2022
Author

Hey Andre, Right, I tried several different min.readCount and min.readFractionByGene cutoffs from more to less stringent and like I said I was able to detect just a few novel transcripts from those cutoffs downward. It is strange that I didn't get any when increasing the NDR compared to even default cutoffs. With the custom annotation, when I used prepareAnnotations I noticed the resulting txdb object contained transcript IDs that were not in my custom annotation, which is why I switched to using TxDbFromGFF which works just fine for me (no problems with duplicate transcript/gene IDs, and the resulting txdb object contained multi-isoform genes). Here <https://drive.google.com/file/d/1KPyQG2K5XXXii03kzkZouEVNiFQRqPoE/view?usp=sharing>'s the link to one of the rcOutDir outputs with NDR = 1.0. Thank you, Alex

…

On Sun, Aug 21, 2022 at 7:15 PM Andre Sim ***@***.***> wrote: Hi Alex, The default threshold for min.readCount is 2, meaning only transcripts with 1 full length read are filtered out via this threshold. Your threshold of 5 is therefore less sensitive than the default. (Note that it is >= not <=) The default min.readFractionByGene is also 0.05. Let me know once you can upload the rcOutDir, it gets produced before the quantification step, so it make be in the directory even before it finishes running. I also wanted to ask about what happened with your custom annotation file when using prepareAnnotations, as it is not expected behavior that it should not import anything other than the .gtf that is provided to it. How did you know it loaded in a standard annotation and not your custom annotation? Could you please check the annotations object you have created with for me with TxDbFromGFFmcols(annotations) and just double check that the TXNAME and GENEID are not the same, and that there are multiple rows that have different TXNAMES but share a GENEID. We have had reports that makeTxDbFromGFF sometimes causes all transcript and gene names to be the same resulting in no multi-isoform genes. This is a big problem for bambu and might be the crux of the issue. — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F36DEMYUUL4MBIGUU63V2LPDRANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

andredsim · 2022-08-23T06:31:16Z

andredsim
Aug 23, 2022
Maintainer

Hi Alex,

Thanks for sending the rcFile. I havn't finished investigating this, but I will update you with what I have seen so far. I had a look at the rcFile and Bambu reports 523982 read classes (with 2 or more reads), with 14283 being already annotated and the remaining 509699 being novel candidates.

When I look at the transcript probability score of these read classes, nearly all the novel candidates are being scored very lowly, with only 2725 read classes being > 0.5.

After doing extend annotations (where additional filtering is performed including the NDR and min.readFractionByGene) I am left with 874 novel genes with default threshold and NDR = 1 in the final output.

This was with my own human genome and annotation file (which might explain different results), however while it sounds higher than what you are reporting it is still lower than I would expect so I am going to investigate further and get back to you on this.

Kind Regards,
Andre Sim

0 replies

ajritter8 · 2022-08-23T19:35:01Z

ajritter8
Aug 23, 2022
Author

Do you think it's possible that many/all of the novel transcripts just aren't being included in the output from writeBambuOutput()? Because I'm primarily looking at the transcript counts and the GTF file given from that command. Also, is there a way to manually filter the rcFile myself? Best, Alex

…

On Mon, Aug 22, 2022 at 11:31 PM Andre Sim ***@***.***> wrote: Hi Alex, Thanks for sending the rcFile. I havn't finished investigating this, but I will update you with what I have seen so far. I had a look at the rcFile and Bambu reports 523982 read classes (with 2 or more reads), with 14283 being already annotated and the remaining 509699 being novel candidates. When I look at the transcript probability score of these read classes, nearly all the novel candidates are being scored very lowly, with only 2725 read classes being > 0.5. After doing extend annotations (where additional filtering is performed including the NDR and min.readFractionByGene) I am left with 874 novel genes with default threshold and NDR = 1 in the final output. This was with my own human genome and annotation file (which might explain different results), however while it sounds higher than what you are reporting it is still lower than I would expect so I am going to investigate further and get back to you on this. Kind Regards, Andre Sim — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F35MFJQJZ7W7ATDVQLLV2RV37ANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

andredsim · 2022-09-05T03:53:33Z

andredsim
Sep 5, 2022
Maintainer

Hi Alex,

Sorry again for the belated response. I was a bit busy the last week and didn't get a chance to investigate this further until today.

I found that the majority of candidate novel transcripts in this sample appear to be subsets of known transcripts. That is transcripts which share a continuous run of exons with a known transcript and could also be produced by degradation of the transcripts. By default bambu does not output subset transcripts due to the high chance they may be false positive. However, if I set opt.discovery = list(remove.subsetTx = FALSE) and output them at an NDR of 1 I get 14606 novel transcripts.

As a side experiment I also set opt.discovery = list(min.readFractionByGene = 0). This resulted in a boost of ~5000 novel transcripts at an NDR of 1. These would be transcripts that have at least 2 reads, but stem from a gene with very high expression that dwarves the expression of these novel isoform candidates. You can combine these two options like so opt.discovery = list(remove.subsetTx = FALSE, min.readFractionByGene = 0) which at an NDR of 1 results in 335863 novel transcripts accounting for nearly every possible candidate.

Therefore if you are looking to boost the number of novel transcripts you could consider allowing the subset transcripts to be output as well. (I would recommend an NDR of 0.1-0.2 and not 1).

To answer your question. It is technically possible to filter the rcFile yourself but it is not very user friendly and the quantification statistics are not yet included at that step.
You can view the rcFile data with rowData(rcFile).

I believe it is unlikely that writeBambuOutput() is the cause. To check this however you can view the annotations before they are output by looking at the rowRanges of the se output from bambu. rowRanges(se). This should have the same number of annotations in it as the output GTF file.

Let me know if this doesn't solve your issue and I can reopen this.

Kind Regards,
Andre Sim

0 replies

ajritter8 · 2022-10-11T07:50:00Z

ajritter8
Oct 11, 2022
Author

Awesome, thanks for checking all that, this is very helpful!

…

On Sun, Sep 4, 2022 at 8:53 PM Andre Sim ***@***.***> wrote: Closed #300 <#300> as completed. — Reply to this email directly, view it on GitHub <#300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALT2F37DIZYP7UNPEDYTVITV4VVETANCNFSM54M2OMZQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Novel/Unannotated Isoforms #345

{{title}}

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Novel/Unannotated Isoforms #345

ajritter8 Jul 22, 2022

Replies: 12 comments

andredsim Aug 16, 2022 Maintainer

ajritter8 Aug 16, 2022 Author

andredsim Aug 17, 2022 Maintainer

ajritter8 Aug 18, 2022 Author

andredsim Aug 19, 2022 Maintainer

ajritter8 Aug 19, 2022 Author

andredsim Aug 22, 2022 Maintainer

ajritter8 Aug 22, 2022 Author

andredsim Aug 23, 2022 Maintainer

ajritter8 Aug 23, 2022 Author

andredsim Sep 5, 2022 Maintainer

ajritter8 Oct 11, 2022 Author

ajritter8
Jul 22, 2022

andredsim
Aug 16, 2022
Maintainer

ajritter8
Aug 16, 2022
Author

andredsim
Aug 17, 2022
Maintainer

ajritter8
Aug 18, 2022
Author

andredsim
Aug 19, 2022
Maintainer

ajritter8
Aug 19, 2022
Author

andredsim
Aug 22, 2022
Maintainer

ajritter8
Aug 22, 2022
Author

andredsim
Aug 23, 2022
Maintainer

ajritter8
Aug 23, 2022
Author

andredsim
Sep 5, 2022
Maintainer

ajritter8
Oct 11, 2022
Author