-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMItools extract with nanopore long reads #661
Comments
UMI-tools uses regex through the regex module using As for why its not matching..... Secondly you've got this string of |
Thanks @IanSudbery! That is really helpful, I see adding the "*" after the "." at the start of the regex will make a big difference. I spoke with a bioinformatics specialist at ONT about the pattern we're seeing, and they believe it essentially comes down to higher rates of sequencing error at the start of the read due to strand slippage through the pore and fewer kmers at the start of the reads. So, I agree with your comment about the string of I didn't include any mismatches in the UMI pattern query because I was worried about being too lenient with the UMI matching and ultimately grouping UMIs that were unrelated. If I use one edit distance of mismatch allowed between the UMI pattern and the sequence pattern being matched, but also want to account for the fact that there may be other sequence errors in the N / random bases of the UMI pattern, do you suggest any specific ways of running the UMItools group to ensure that the program will cluster the UMIs appropriately? Thanks again for your help on this |
Hello!
I would love your opinion or advice with our method, and I have a few questions about how UMItools extract with a regex works.
I have nanopore long reads with dual 18bp UMIs, one on each end of sequence reads that are variable in length between 2000-2800bp. I would like to extract both UMIs, concatenate them and append them to the sequence read name.
The anatomy of my sequence read is:
Barcode - enrichment primer - 18bp UMI - target primer - genomic region of interest - reverse target primer - 18bp UMI - enrichment primer - Barcode
The regex that I have been using looks like this:
"^.(?P<umi_1>.{3}[CT][GA].{3}[CT][GA].{3}[CT][GA].{3})(?P<discard_1>CAGTGGCTCC){e<=1}.{1000,}(?P<discard_2>GCACATGCAG){e<=1}(?P<umi_2>.{3}[CT][GA].{3}[CT][GA].{3}[CT][GA].{3})."
Where I ignore any number of bases before the first instance of my UMI pattern, then I search for the flanking primer sequence with fuzzy matching of at most 1, at least a thousand bases of genomic region of interest before the second flanking primer sequence with an edit distance of at most 1, the second UMI pattern and any number of bases following. I am also running the reverse complement of this regex as you suggest at the end of #610.
Using this regex and the reverse complement, I am only finding matches in a cumulative 45% of my reads. I understand there may be some molecular issues I have yet to rule out, but can you think of a reason that the regex may not be performing as I expect it to? Or is there some characteristic of the nanopore data that I may be overlooking? Or part of the program I am misunderstanding?
How does the tool use the regex, is the first match of the regex pattern in the read reported? What does it do in the instance that it can match the pattern more than once? I believe our current regex is specific enough to avoid this issue, but I started with a regex of just the UMI patterns flanking at least 100bp. I found that in +95% of my reads, only to realize that we were seeing the UMI pattern more than the expected 2 times per read, so I couldn't trust that the UMIs being pulled out by extract were true UMIs.
Thank you for your help!
The text was updated successfully, but these errors were encountered: