fix: infer library type relationship #157

balajtimate · 2024-01-09T10:02:29Z

Description

Add check to assign first_mate and second_mate when it can correctly be inferred from seq_id's (when seq_ids are not empty)
Add check to only align when library source is either inferred or given by --tax-id argument (in case of library type as well as read orientation inference)
Add --outSAMorder PairedKeepInputOrder to mapping.py to keep input order of reads when running STAR with multiple threads
Update tests

In the case of paired-end libraries, the logic should be:

For the library type inference, check seq_ids of first and second pair, to decide if they fit the Casava format, and the lib type and relationship can be determined
If not, as in most SRA samples, align them separately to reference transcripts, but only if library_source is known
Compare alignments and decide relationship
For the orientation inference, based on the library type relationship determined:
1. If not_mates or not_available, infer the orientation from the separately aligned results
2. If split_mates, align the samples in paired-end mode, but only if library_source is known

Fixes #153

Type of change

New feature (non-breaking change which adds functionality)

Checklist

Please carefully read these items and tick them off if the statements are true
or do not apply.

I have performed a self-review of my own code
My code follows the existing coding style, lints and generates no new
warnings
I have added type annotations to all function/method signatures, and I
have added type annotations for any local variables that are non-trivial,
potentially ambiguous or might otherwise benefit from explicit typing.
I have commented my code in hard-to-understand areas
I have added ["Google-style docstrings"] to all new modules, classes,
methods/functions or updated previously existing ones
I have added tests that prove my fix is effective or that my feature
works
New and existing unit tests pass locally with my changes and I have not
reduced the code coverage relative to the previous state
I have updated any sections of the app's documentation that are affected
by the proposed changes

If for some reason you are unable to tick off all boxes, please leave a
comment explaining the issue you are facing so that we can work on it
together.

codecov · 2024-01-09T10:04:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (930b741) 100.00% compared to head (8fe0cbe) 100.00%.

Additional details and impacted files

@@            Coverage Diff            @@
##               dev      #157   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           13        13           
  Lines         1109      1131   +22     
=========================================
+ Hits          1109      1131   +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

uniqueg

This looks good to me (minus the very few minor comments), but I have to be honest in that it is a bit hard to follow the code in the library type model since the introduction of the mappings. I am pretty confident given the extensive tests, but I propose that you play this through with some small real world examples as well to check if this really behaves as desired from end to end.

htsinfer/get_library_type.py

htsinfer/get_read_orientation.py

balajtimate · 2024-01-17T11:26:14Z

To summarize: refactored _align_mates in get_library_type.py to correctly determine the number of aligned reads as well as the number of concordant reads.

When iterating through the reads, the problem during _compare_alignments was that the order of the reference names were scrambled. Turns out, this was the result of STAR: it outputs the aligned reads in the same order as the input read, except when running it on multiple threads, which I have been doing since day 1. When I reran the samples on the default 1 thread, the output of the alignment matched between the two samples, and the concordant reads were calculated correctly. So one solution was to sort reads by ref name using pysam, which outputs an extra sorted BAM file, but ultimately updating the STAR command with --outSAMorder PairedKeepInputOrder seems to solve this, so there is no extra sorting needed, the output will be in the same order as input (it's also possible to sort the aligned reads with STAR, but only according to coordinates, and I think it'd be better to have sorted according to ref names. Btw, is there any advantage in this case for using SAMs instead of BAMs? I think mapping.py could be updated to output BAMs, maybe in a separate PR).

For checking the ratio between the concordant and the aligned reads (for _update_relationship), I choose the lowest of the two alignments. Also, I'll run it on more samples and check the ratio, but I think we could lower it a bit, like 90% (instead of 95% default now) for the ratio / aligned reads to be considered paired end library.

uniqueg

Approving, but please take care of the two comments first 🙏

htsinfer/mapping.py

htsinfer/get_library_type.py

uniqueg · 2024-01-18T01:17:31Z

Oh, and please update the PR description to reflect all of the latest changes. As far as I understood, you also fixed some things that never really worked.

balajtimate added 11 commits December 13, 2023 15:22

refactor: compare alignments between mapped reads only

04fcab2

fix: update get lib type test

455e983

refactor: helper function for getlibtype

806420d

feat: only map when lib source is inferred

782971a

update tests

28284de

fix orientation tests

dae137d

fix orientation tests

bf7bea1

refactor scripts

5f8f62d

update get lib type and get read orient

1e992e1

refactor get lib type

3fbaceb

refactor: update get lib type and tests

70f4252

balajtimate requested a review from uniqueg January 9, 2024 10:04

uniqueg requested changes Jan 11, 2024

View reviewed changes

balajtimate added 2 commits January 15, 2024 14:00

refactor: update lib type and orientation

b5b80ca

refactor: update concordant read counting, mapping

31af8ec

update debug messages

b908930

uniqueg previously approved these changes Jan 18, 2024

View reviewed changes

htsinfer/mapping.py Show resolved Hide resolved

htsinfer/get_library_type.py Show resolved Hide resolved

uniqueg changed the title ~~feat: infer library type relationship from aligned reads only~~ fix: infer library type relationship Jan 18, 2024

update comments in mapping

8fe0cbe

balajtimate dismissed uniqueg’s stale review via 8fe0cbe January 18, 2024 10:10

balajtimate mentioned this pull request Jan 19, 2024

feat: improve lib type relationship inference for SRA samples #160

Closed

balajtimate requested a review from uniqueg January 19, 2024 10:59

uniqueg approved these changes Jan 19, 2024

View reviewed changes

balajtimate merged commit 7b65c43 into dev Jan 21, 2024
19 checks passed

balajtimate deleted the fix_sra_mates branch January 21, 2024 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: infer library type relationship #157

fix: infer library type relationship #157

balajtimate commented Jan 9, 2024 •

edited

Loading

codecov bot commented Jan 9, 2024 •

edited

Loading

uniqueg left a comment

balajtimate commented Jan 17, 2024

uniqueg left a comment

uniqueg commented Jan 18, 2024

fix: infer library type relationship #157

fix: infer library type relationship #157

Conversation

balajtimate commented Jan 9, 2024 • edited Loading

Description

Type of change

Checklist

codecov bot commented Jan 9, 2024 • edited Loading

Codecov Report

uniqueg left a comment

Choose a reason for hiding this comment

balajtimate commented Jan 17, 2024

uniqueg left a comment

Choose a reason for hiding this comment

uniqueg commented Jan 18, 2024

balajtimate commented Jan 9, 2024 •

edited

Loading

codecov bot commented Jan 9, 2024 •

edited

Loading