-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudamapper] Filtering for self mappings, identical overlaps, and highly-similar overlaps #563
base: dev
Are you sure you want to change the base?
Conversation
Adds a post-processing filter for full-length self mappings. This is needed as self mappings cause miniasm to produce empty assemblies.
…erlaps in postprocessing. Adds a CLI option and functions for filtering overlaps that either have identical query starts/ends or which have a percent reciprocal identity greater than a user-defined threshold.
{"preserve-self-mappings", no_argument, 0, 'S'}, | ||
{"max-reciprocal", required_argument, 0, 'Z'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters should also be added to ApplicationParameters::help()
if (application_parameters.all_to_all && application_parameters.filter_self_mappings) | ||
{ | ||
GW_NVTX_RANGE(profiler, "main::postprocess_and_write_thread::remove_self_mappings"); | ||
::claraparabricks::genomeworks::cudamapper::details::overlapper::filter_self_mappings(overlaps, *application_parameters.query_parser, *application_parameters.target_parser, 0.9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to qualify the function this day? Also, is function really a detail if it is used in a completely different (top level) file
@@ -98,7 +110,8 @@ class Overlapper | |||
/// \brief Identified overlaps which can be combined into a larger overlap and add them to the input vector | |||
/// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector | |||
/// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output. | |||
static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false); | |||
/// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find 0.0
a bit confusing. If I understand it correctly passing 0.75 would mean that any pair that has more than 75% overlap will get filtered out (i.e. one of those two overlaps). Wouldn't it than make sense to say that specifying 1.0 means only identical (= 100% overlapped) overlaps are filtered out and >1.0 that nothing gets filtered out as two overlaps cannot overlap more than 100%?
@@ -98,7 +110,8 @@ class Overlapper | |||
/// \brief Identified overlaps which can be combined into a larger overlap and add them to the input vector | |||
/// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector | |||
/// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output. | |||
static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false); | |||
/// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how complicated would it be to specify which overlap gets filtered out (e.g. the one with smaller starting position, longer/shorter one or something)
@@ -53,8 +53,10 @@ class ApplicationParameters | |||
int32_t min_overlap_len = 250; // l, recommended range: 100 - 1000 | |||
int32_t min_bases_per_residue = 1000; // b | |||
float min_overlap_fraction = 0.8; // z | |||
float max_reciprocal = -1.0; // Z, < 0 : no filtering. 0.0: only identical filtering. > 1.0: filter by percent reciprocity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For -F
we use 0.0 - 1.0
range, not 1.0 - 100.0
(which should actually probably be 0.0 - 100.0)
{ | ||
return 0.0; | ||
} | ||
int32_t query_overlap = std::min(a.query_end_position_in_read_, b.query_end_position_in_read_) - std::max(a.query_start_position_in_read_, b.query_start_position_in_read_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and similar should be position_in_read_t
or even better number_of_basepairs_t
This PR brings in filtering strategies that mimic those of minimap2.
-S
).