Replies: 12 comments
-
Hi there, Sorry for the late response. Bambu shouldn't be able to create new ensembl transcript IDs. All novel transcripts will be assigned a transcript id of tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene ID if they overlap with an annotated gene, but these ID should already be present in the annotation GTF file. If this doesn't answer your question, you could send me the files, or a screen shot of an example, so that I may look a bit deeper. |
Beta Was this translation helpful? Give feedback.
-
Hey Andre,
No worries, I got that part figured out. But I do have another question:
I’ve been trying out different cutoffs for de novo isoform detection and no
matter how little stringency, Bambu doesn’t seem to detect more than ~20 or
so unannotated transcripts in a given sample. Do you have recommendations
for how to optimize discovery mode settings to best detect unannotated
transcripts?
Best,
Alex
…On Tue, Aug 16, 2022 at 2:48 AM Andre Sim ***@***.***> wrote:
Hi there,
Sorry for the late response. Bambu shouldn't be able to create new ensembl
transcript IDs. All novel transcripts will be assigned a transcript id of
tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene
ID if they overlap with an annotated gene, but these ID should already be
present in the annotation GTF file.
If this doesn't answer your question, you could send me the files, or a
screen shot of an example, so that I may look a bit deeper.
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F36ANBINDDUMF75ALEDVZNPX3ANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Alex, There are many things that could influence this. Which parameters are you using for your bambu run and is this a standard long-read run or a custom library preparation (i.e used primers to target specific genes)? I would recommend just using the NDR (novel discovery rate) parameter and if you want a more sensitive output increasing the NDR. In a human sample we generally use between .1-.2, however if you are using a different species with less comprehensive annotations you might want to increase this value. The maximum this value can be set is 1, which will mean every novel transcript that has at least 2 reads will be annotated. In the coming patch (see the BambuRevisionManuscript branch) we are also introducing a recommended NDR feature which will recommend an NDR value for your sample, in the event many annotations are missing. |
Beta Was this translation helpful? Give feedback.
-
Hey Andre,
Below is an example of how I'm running Bambu. For "x", I tried 0.01, 0.05,
0.10 (default), 0.25, 0.50, 0.75 and 1.00. The "bam_file" is from a
standard long-read run. The results were actually the exact same across
all NDR levels, which was that only annotated transcripts were detected and
quantified. When I instead tried messing with minGeneFrac and
minReadCount, it did detect some unannotated transcripts but not many.
Would you recommend a specific combination of arguments to tune to get a
more sensitive novel discovery rate?
txdb <- makeTxDbFromGFF(gtf_file)
se_bambu <- bambu(reads = bam_file, annotations = txdb, genome =
genome_fasta_file, discovery = TRUE, opt.discovery = list(NDR = x))
Thank you,
Alex
…On Tue, Aug 16, 2022 at 6:10 PM Andre Sim ***@***.***> wrote:
Hi Alex,
There are many things that could influence this. Which parameters are you
using for your bambu run and is this a standard long-read run or a custom
library preparation (i.e used primers to target specific genes)?
I would recommend just using the NDR (novel discovery rate) parameter and
if you want a more sensitive output increasing the NDR. In a human sample
we generally use between .1-.2, however if you are using a different
species with less comprehensive annotations you might want to increase this
value. The maximum this value can be set is 1, which will mean every novel
transcript that has at least 2 reads will be annotated.
In the coming patch (see the BambuRevisionManuscript branch) we are also
introducing a recommended NDR feature which will recommend an NDR value for
your sample, in the event many annotations are missing.
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F3667P2CTDT6AADRUQLVZQ32FANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Alex, This is very strange, if the NDR is 1, you should be getting loads of novel transcripts as that is maximum sensitivity. Could you let me know how you changed minGeneFrac, and minReadCount that allowed you to get some unannotated transcripts? This might give me a clue as to what might be occuring. Could you also please try a run where you replace makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity check for me? In the likely event this doesn't make a difference, could you add rcOutDir = "/some/output/directory/here" to your arguments. This should output an .rds file which contains all the read classes for your sample (essentially all possible candidates before any filtering). If you could send that to me I could have a look what might be going on. |
Beta Was this translation helpful? Give feedback.
-
Hey Andre,
I was getting novel transcripts with min.readCount <=5, and with
min.readFractionByGene <= 0.05 I got just a few novel transcripts. I am
now trying out min.readFractionByGene = 0.05 with different NDRs (0.25,
0.50, 0.75, 1.00) to see how that does. I've also added the rcOutDir
argument so I'll send you one of the outputs when it's finished. The
reason I'm using makeTxDbFromGFF instead of prepareAnnotations is because
we're using a custom annotation file (human) and we found that it was
instead importing the standard annotation.
Thanks,
Alex
…On Thu, Aug 18, 2022 at 10:18 PM Andre Sim ***@***.***> wrote:
Hi Alex,
This is very strange, if the NDR is 1, you should be getting loads of
novel transcripts as that is maximum sensitivity.
Could you let me know how you changed minGeneFrac, and minReadCount that
allowed you to get some unannotated transcripts? This might give me a clue
as to what might be occuring.
Could you also please try a run where you replace
makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity
check for me?
In the likely event this doesn't make a difference, could you add rcOutDir
= "/some/output/directory/here" to your arguments. This should output an
.rds file which contains all the read classes for your sample (essentially
all possible candidates before any filtering). If you could send that to me
I could have a look what might be going on.
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F34SNCUWIJZQK2E6U73VZ4KJHANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Alex, The default threshold for min.readCount is 2, meaning only transcripts with 1 full length read are filtered out via this threshold. Your threshold of 5 is therefore less sensitive than the default. (Note that it is >= not <=) Let me know once you can upload the rcOutDir, it gets produced before the quantification step, so it make be in the directory even before it finishes running. I also wanted to ask about what happened with your custom annotation file when using prepareAnnotations, as it is not expected behavior that it should not import anything other than the .gtf that is provided to it. How did you know it loaded in a standard annotation and not your custom annotation? Could you please check the annotations object you have created with for me with TxDbFromGFF Thanks, I hope we will be able to get this working for you soon! Andre |
Beta Was this translation helpful? Give feedback.
-
Hey Andre,
Right, I tried several different min.readCount and min.readFractionByGene
cutoffs from more to less stringent and like I said I was able to detect
just a few novel transcripts from those cutoffs downward. It is strange
that I didn't get any when increasing the NDR compared to even default
cutoffs.
With the custom annotation, when I used prepareAnnotations I noticed the
resulting txdb object contained transcript IDs that were not in my custom
annotation, which is why I switched to using TxDbFromGFF which works just
fine for me (no problems with duplicate transcript/gene IDs, and the
resulting txdb object contained multi-isoform genes).
Here
<https://drive.google.com/file/d/1KPyQG2K5XXXii03kzkZouEVNiFQRqPoE/view?usp=sharing>'s
the link to one of the rcOutDir outputs with NDR = 1.0.
Thank you,
Alex
…On Sun, Aug 21, 2022 at 7:15 PM Andre Sim ***@***.***> wrote:
Hi Alex,
The default threshold for min.readCount is 2, meaning only transcripts
with 1 full length read are filtered out via this threshold. Your threshold
of 5 is therefore less sensitive than the default. (Note that it is >= not
<=)
The default min.readFractionByGene is also 0.05.
Let me know once you can upload the rcOutDir, it gets produced before the
quantification step, so it make be in the directory even before it finishes
running.
I also wanted to ask about what happened with your custom annotation file
when using prepareAnnotations, as it is not expected behavior that it
should not import anything other than the .gtf that is provided to it. How
did you know it loaded in a standard annotation and not your custom
annotation?
Could you please check the annotations object you have created with for me
with TxDbFromGFFmcols(annotations) and just double check that the TXNAME
and GENEID are not the same, and that there are multiple rows that have
different TXNAMES but share a GENEID. We have had reports that
makeTxDbFromGFF sometimes causes all transcript and gene names to be the
same resulting in no multi-isoform genes. This is a big problem for bambu
and might be the crux of the issue.
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F36DEMYUUL4MBIGUU63V2LPDRANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Alex, Thanks for sending the rcFile. I havn't finished investigating this, but I will update you with what I have seen so far. I had a look at the rcFile and Bambu reports 523982 read classes (with 2 or more reads), with 14283 being already annotated and the remaining 509699 being novel candidates. When I look at the transcript probability score of these read classes, nearly all the novel candidates are being scored very lowly, with only 2725 read classes being > 0.5. After doing extend annotations (where additional filtering is performed including the NDR and min.readFractionByGene) I am left with 874 novel genes with default threshold and NDR = 1 in the final output. This was with my own human genome and annotation file (which might explain different results), however while it sounds higher than what you are reporting it is still lower than I would expect so I am going to investigate further and get back to you on this. Kind Regards, |
Beta Was this translation helpful? Give feedback.
-
Do you think it's possible that many/all of the novel transcripts just
aren't being included in the output from writeBambuOutput()? Because I'm
primarily looking at the transcript counts and the GTF file given from that
command. Also, is there a way to manually filter the rcFile myself?
Best,
Alex
…On Mon, Aug 22, 2022 at 11:31 PM Andre Sim ***@***.***> wrote:
Hi Alex,
Thanks for sending the rcFile. I havn't finished investigating this, but I
will update you with what I have seen so far. I had a look at the rcFile
and Bambu reports 523982 read classes (with 2 or more reads), with 14283
being already annotated and the remaining 509699 being novel candidates.
When I look at the transcript probability score of these read classes,
nearly all the novel candidates are being scored very lowly, with only 2725
read classes being > 0.5.
After doing extend annotations (where additional filtering is performed
including the NDR and min.readFractionByGene) I am left with 874 novel
genes with default threshold and NDR = 1 in the final output.
This was with my own human genome and annotation file (which might explain
different results), however while it sounds higher than what you are
reporting it is still lower than I would expect so I am going to
investigate further and get back to you on this.
Kind Regards,
Andre Sim
—
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F35MFJQJZ7W7ATDVQLLV2RV37ANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Alex, Sorry again for the belated response. I was a bit busy the last week and didn't get a chance to investigate this further until today. I found that the majority of candidate novel transcripts in this sample appear to be subsets of known transcripts. That is transcripts which share a continuous run of exons with a known transcript and could also be produced by degradation of the transcripts. By default bambu does not output subset transcripts due to the high chance they may be false positive. However, if I set As a side experiment I also set Therefore if you are looking to boost the number of novel transcripts you could consider allowing the subset transcripts to be output as well. (I would recommend an NDR of 0.1-0.2 and not 1). To answer your question. It is technically possible to filter the rcFile yourself but it is not very user friendly and the quantification statistics are not yet included at that step. I believe it is unlikely that writeBambuOutput() is the cause. To check this however you can view the annotations before they are output by looking at the rowRanges of the se output from bambu. rowRanges(se). This should have the same number of annotations in it as the output GTF file. Let me know if this doesn't solve your issue and I can reopen this. Kind Regards, |
Beta Was this translation helpful? Give feedback.
-
Awesome, thanks for checking all that, this is very helpful!
…On Sun, Sep 4, 2022 at 8:53 PM Andre Sim ***@***.***> wrote:
Closed #300 <#300> as completed.
—
Reply to this email directly, view it on GitHub
<#300 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALT2F37DIZYP7UNPEDYTVITV4VVETANCNFSM54M2OMZQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi there,
So I'm trying to look at novel/unannotated isoforms detected and quantified by Bambu and I'd just like some clarification and/or advice. So basically I get an extended annotation GTF file, and upon looking at it, it contains all the Ensemble transcript IDs that were in my reference GTF, plus several hundred Ensemble transcript IDs that were not in my reference GTF. Where is Bambu getting these Ensemble transcript IDs from? I was hoping unannotated transcripts would be given some kind of separate ID instead of being associated with existing annotations.
Beta Was this translation helpful? Give feedback.
All reactions