Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent read assignment with bam file #250

Open
airichli opened this issue Oct 9, 2024 · 2 comments
Open

Inconsistent read assignment with bam file #250

airichli opened this issue Oct 9, 2024 · 2 comments

Comments

@airichli
Copy link

airichli commented Oct 9, 2024

Hi,

Thanks for developing this nice tool. I have met a problem, the read strand is different between .read_assignments.tsv and bam file, this leads to novel gene identification (antisense), but it seems to be from the known gene rather than novel gene, could you please give me some advice?

My command line is (the data is generate by ONT cDNA sequencing, isoquant version is version 3.5.0):
isoquant.py -t 1 --reference mm10.chr19.fa --genedb gencode.vM23.annotation.chr19.gtf --fastq test.fastq.gz --model_construction_strategy default_ont --report_novel_unspliced true --data_type nanopore -o test_out --prefix test_out --complete_genedb

In the below example, one is "+", the other is "16", they are inconsistent. Thanks.

In the test_out.read_assignments.tsv file:
ff4b5d98-6398-43ee-8b9b-6416e9083ee7 chr19 + ENSMUST00000113533.2 ENSMUSG00000024790.8 inconsistent intron_alternation_novel:6116267-6118343,tss_match_precise:2,correct_polya_site_left:6115998,antisense 6115998-6116266,6118344-6118561 gene_assignment=inconsistent; PolyA=True; Classification=novel_not_in_catalog;

However, in the bam file in aux folder
ff4b5d98-6398-43ee-8b9b-6416e9083ee7 16 chr19 6115998 60 106S88M4I26M1I2M1D52M1D38M1D60M2077N160M1D57M135S * 0 0 TCACAAGGTTTGAATCTTTGTCCCTCTCACTTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTGGAGATCATAATGAGTTCCTTTTATTGTTTGGGAATTATGAAAAAGAAGCCAGGTGCTCTGACCCAGGCCTTGTGGGAGACGAGCAAGACCCTCTCAAGCTGCAGAGCCAGGCTCGTGAATGTCACCTTCCTCAGCCATGACCACATCTTCCAGGGTGTGTCCCTGCACTTATTCCCCACTAATATGTGGTAGGCACCAGGGGGTGGAGTCCTTCCTCAGAGTATTGACCCCTCAGGAATACAACTCTGTCTTTATCCAAGGTCAGTCCATGGGTAGTTTGCAGCCGGACATGGTAGGGGTGTGGGTAGAGGACGCACACTCAGCATCGCCGGGCCAGCGGAAGCGCGGCACTGCATCCTGCCGTGGGCTCGAGGGACAGACTCGAGCCCCAGTCTGGCCAAACGCAGCCTGCCGCGGCTTTCTCATTTGCCGCTCCCGCGGCCTTTAGAAACGCCTCCCATAATCCATCGCGCCTGGCTTCCCCAAATGTGAACTCGAATGTTAATCGTAAAGCAATATCAGCACCAACAGAAAGAGAGGACAAAGGTTTCAACGCTTGTCACGGTTACGCCTTCGCCAATTGCTGGATAAAGACCGGTCTCAAGGGTTACATAGC $$&%&&'$#$$%%)&%%&&'&&%%&&%$%%$####%''&''&%&&''5799:9>=<=<=>E988+(((1*');:666:CEOSSDCDECEECCBEHHJJSLHSNFDEFIDBFIFGA>>IGGGEHGEIGLKS<<;4435***=>CBDDBDBCIHLHKSGCCBDDCDFBFEDFC==>H@C;3323237899::@9854490--.:>FJIEDECDGDMLFGGHGGCDAHK+++-<786AA??@@/.013>-,,,,CC>:78<=::;>MQSSSISGKSCDIEEC==<55;<@=>@=:<866899DHSSB:::;FHPJJGC>=>CIRSHGEDFHMGLOKNOFB@ABHK?812<>=GLIKFDCC>HDMNOPCBCCHDBAAF>B<856766;<>FHG@?@A?644335=EFCGDFJKGGIKSSHCAC??8766111142229@B?>))),*('(%%&$$$224BOGIM??>?>7311,/-,,++$%&$%%)4113-@CFSGQLSFHHFBC@?AAFCCB==JSHEE?IE9645=>=>CDB=EHJIKGLSKIJEIJSLGQLMHIGKMKIJSISGI>FH@AFEIHIHSSSSSSSSSSSSSSSGGFKSSSE;>@96--.2<==<<9788ACC?CCA@=DCD9333>?CGECB?3222333846541-,,,+)(%)+,.011-,%$ NM:i:10 ms:i:459 AS:i:423 nn:i:0 ts:A:- tp:A:P cm:i:125 s1:i:457 s2:i:0 de:f:0.0143 MD:Z:116^T52^A38^G61C158^G57 rl:i:39

@andrewprzh
Copy link
Collaborator

Dear @airichli

The strand indicated in read assignments and BAM files are two different strands and might not be equal in general.

The BAM file indicates mapping strand, i.e. whether a read was reverse-complemented during the alignment. Since ONT cDNA data is non-strand-specific (correct me if you use some specific protocol), strands in BAM files are distributed arbitrary ~50/50. Basically half of the reads are sequenced from cDNA with the same strand as the reference genome, half - from the reverse one.

Read assignments simple indicate the strand of the matching gene / isoform.

However, in your particular case read assignments also tell "antisense". This means the read was assigned to a + stand gene, but other properties, such as polyA or splice junctions tell that the strand is negative. I'll take a look at this case, thanks for the report.
Could you also send me the resulting novel antisense gene/transcript?

Best
Andrey

@airichli
Copy link
Author

airichli commented Oct 14, 2024

Thanks Andrey for your kind assistance.
testdata.zip

I attach the testdata.zip, including

  1. the resulting novel antisense gene/transcript(mouse_allsamples_chr19.transcript_models.gtf.ENSMUST00000113533.2)
  2. read assigned to the transcript from the novel gene (mouse_allsamples_chr19.transcript_model_reads.tsv.transcript19671.chr19.nnic), including the test read ff4b5d98-6398-43ee-8b9b-6416e9083ee7
  3. 1 test read fastq and the isoquant result runing with just 1 read (ff4b5d98-6398-43ee-8b9b-6416e9083ee7)

I checked the read ff4b5d98-6398-43ee-8b9b-6416e9083ee7 in IGV, and it should be derived from the known gene rather than antisense novel gene, but isoquant just say it is antisense.

image

Thanks again and looking forward to hearing from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants