-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sumstatsBP_overlap.py failed to handle paired-end reads with unmatched length in the overlapped region. #86
Comments
thanks for digging into this. maybe the best way is to take the min of the lengths of R1 and R2 in settings where they are not equal rather than fully discarding? you wouldn't get an index error that way I don't think and would keep more data. |
Hi Caleb, Thanks for the speedy reply! That sounds good to me and I will try to do it :). Will keep you posted. Best, |
Hi Caleb, I modified the script further and rewrote the part to deal with the overlapped region between the forward and reverse strands. Instead of using the position of the base on the read, I used the ref_pos to get the bases from both fwd and rev. To do that:
The code looks like this: base comparison between fwd and rev
After that, instead of using qpos, it used ref_pos when constructing the fwd_use_idx and rev_use_idx. Also, it takes care of those reads that fwd is covered by rev, or vice versa. alignment quality is checked as previously implemented. fwd and rev partitioning
Then, I used ref pos instead of qpos for counting and base quality calculation by simply replace pair[0] to pair[1] in your script: counting
I uploaded the updated script here sumstatsBP_overlap.zip also. I tested on my end and it worked, but the coverage has somehow dropped a little bit, which is opposite to my expectation. The logic of the code makes sense to me but maybe I missed some obvious thing. What do you think about this modification? Best, |
Describe the bug
Hi again! Continue with #85 where I'm running a atac-seq dataset with unequal R1 and R2 read lengths as I trimmed the low-quality area of R1 after sequencing. It throws out a similar error message that the overlapped regions between R1 and R2 have different lengths after correcting the splitting issue.
After checking the result, the sequence length (len(fwd_seq)) is unequal to the distance between the ref start and ref end, probably due to the unmatched indels between R1 and R2 caused by sequencing error.
Additional context
I modified the sumstatsBP_overlap.py by adding codes to compare the overlapped length between R1 and R2. I discard the entire read pairs if they don't match, as it's caused by sequencing error and only a few of the reads have this issue.
(the file is the same as the one in #85 but try to upload it again: sumstatsBP_overlap.zip )
Same as before, I'm sure this is not the best way but rather a stringent filter to those reads, and super happy if it's possible to discuss more about it.
Best,
Qirong
The text was updated successfully, but these errors were encountered: