-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indel effects #243
base: master
Are you sure you want to change the base?
Add indel effects #243
Conversation
…an be defined for df loaders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jburos not sure if you saw my review comment from 5 days ago (maybe it didn't go through?), but I'm trying to figure out which of these functions do/should return the same results. I think any that do, and we're only using to confirm that, can maybe be placed in a different file? cohorts.functions.dups
?
cohorts/functions.py
Outdated
isinstance(filterable_effect.effect, FrameShift) and | ||
isinstance(filterable_effect.effect, Exonic))) | ||
|
||
exonic_frameshift_snv_count = count_effects_function_builder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of SNV causes a frameshift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a question Alex asked as well. I can say, we saw several in our recent analyses.
Theoretically, snvs can cause a new start-site (stop-gain), modify splice sites (not sure if this would be counted by varcode), cause a stop-loss (which apparently leads to no transcription), or can be stop-gain. In practice I've seen a variety of varcode classes, but am working on functions which would make these easier to inspect using cohorts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like in varcode only FrameshiftTruncation
effects inherit from Frameshift
. AlternateStartCodon, PrematureStop & Startloss are all considered versions where we can either not predict the coding sequence (ie startloss) or they are considered in-frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see - i didn't filter to keep snvs. so this part is an error. I will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still pending a fix, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm (selfishly!) stalling until I no longer have code referencing this function, so I can delete it entirely. Right now I've fixed it in my local branch, but don't really want to commit that since it doesn't make too much sense to have a function that returns 0 by definition.
cohorts/functions.py
Outdated
@@ -258,6 +303,36 @@ def expressed_neoantigen_count(row, cohort, filter_fn, normalized_per_mb, **kwar | |||
only_expressed=True, | |||
**kwargs) | |||
|
|||
@use_defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just:
expressed_exonic_indel_count = expressed_of(exonic_indel_count)
- writing
expressed_of
to remove all the copied code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this too, but didn't really have time to rework these. For now just need to get the analysis done. Up to you if you want to leave this PR hanging until then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just file an issue vs. leave the PR hanging
Thanks @tavinathanson! Not sure your earlier review went through. Either way I can't find it. In general I think the exonic filter is the only one that might be redundant - if so, it is only redundant within certain variant types but not all. In my analysis since I'm comparing rates of types of mutations, it's important that I am 100% confident all counts are within exonic regions. Moving them to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after all known errors are fixed
@tavinathanson did a fair amount of refactoring to the functions, along the lines of the
This wouldn't yet work for neoantigens, but could if we refactored the Also, not sure yet about the naming convention proposed here, but as an approach I think this could give users (us) the flexibility sometimes required without having to create every combination of effects we might want in Finally, to make this work relatively seamlessly, I had to modify Note that I put in here an auto-naming feature, so that one could compute these on-the-fly. IE Curious to know your thoughts on the general approach before I go too far down this road. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! A few comments
cohorts/functions.py
Outdated
@@ -84,7 +84,8 @@ def count_filter_fn(filterable_variant, **kwargs): | |||
return ((filterable_variant_function(filterable_variant) if filterable_variant_function is not None else True) and | |||
filter_fn(filterable_variant, **kwargs)) | |||
patient_id = row["patient_id"] | |||
return cohort.load_variants( | |||
return cohort.load_effects( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change vs. the existing count_effects_function_builder
which does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the plan to get rid of both count_effects_function_builder
and count_variants_function_builder
and use the new composition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you wrote:
Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.
But I'm not clear what count_effects_function_builder
doesn't provide, and still am confused about why we're removing the variant
counter?
You probably already know this, but the same filter_fn
can be passed into load_effects
and load_variants
, since a FilterableEffect
is a subclass of a FilterableVariant
(and similar stuff for FilterableNeoantigen
etc.)
All that to say: I'm just a bit confused :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, i did suspect that although it helps to have your confirmation :). Except I did run into some errors when using an effect
filter on a filterable_variant. The first reason appeared to be that the parameter names are different (trivial to fix), whereas the second was an issue that the filterable_variant.effect
object could not be found so a number of the later filters failed.
At any rate I'm now:
- validating that these are, indeed equivalent
- testing that the filtered-cache-names are correct (ie equivalent when they should be, different for different filters).
- Unifying the above (ie removing count_variants_function_builder, using count_effects_function_builder everywhere instead)
I'm not convinced that using an only_snv
filter on the neoantigen_count will give us the response we want. But I need to do a little testing on this / review (maybe it's simpler than I think). only_expressed
on that neoantigen_count will definitely need to be reworked, since the expressed_neoantigen_count is computed almost completely differently from the non-expressed count (the rationale for this is something I'd like to understand better before I mess with it).
One question/concern I am running into is the reason why "nonsynonymous" effects are handled differently from other effect-type filters. it appears that, within varcode, we're filtering out silent_and_noncoding
effects before picking the highest priority effect per variant. I would think the silent/noncoding effects would be the lowest-priority for a variant, of all possible effects, so the order of this filtering wouldn't matter. In which case I'll make an only_nonsynonymous
filter analogous to the rest. But curious to know your thoughts on this.
Secondly, and this is a related question that I've been thinking about now that I'm digging into this a bit more, any reason why we might want to filter effects prior to selecting the highest priority per variant? Ie is there a case where a variant could have one effect type that is not a subclass of its highest-priority effect? (ie where including all effects per variant & then filtering to include only X type & to remove dups per variant would yield a different count from the current scenario where we keep only the highest priority & then filter to those where effect.is_snv
is True
?) [maybe this discussion should move into an issue, but it's relevant here .. so]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: getting rid of both function builders, I will likely only get rid of one -- the function builder is still useful for some edge cases not addressed by the composition -- which always composes functions using an and. E.g.: missense_snv_and_nonsynonymous_indel_count
where the logic contains an or (is_indel or (is_snv & is_substitution). It would be nice to generalize the filter-fn logic further to enable a phrase like that above, but i'll leave that for another day ;)
…elated code to utils; error if variant_file exists but cannot be loaded
…a-functions names
@tavinathanson just merged in your latest changes from master - would you mind reviewing again for a quick sanity check? I updated the description above to reflect all the various things that ended up in here. |
A few changes ended up in this PR:
only_nonsynonymous(snv_count)
returns a function equivalent tononsynonymous_snv_count
. Similarly,only_exonic(only_nonsynonymous(snv_count))
returns the count of exonic, nonsynonymous snvs.utils.py
cohorts.cohort.caching
logger, to aid cache-debugging (vs other debugging)Also deleted a random
quick-start.Rmd
file that accidentally found its way into our master branch.