Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indel effects #243

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Add indel effects #243

wants to merge 30 commits into from

Conversation

jburos
Copy link
Member

@jburos jburos commented Aug 15, 2017

A few changes ended up in this PR:

  • Refactored summary functions, so they are now composable. IE only_nonsynonymous(snv_count) returns a function equivalent to nonsynonymous_snv_count. Similarly, only_exonic(only_nonsynonymous(snv_count)) returns the count of exonic, nonsynonymous snvs.
  • Added a few new effect-types to the set of functions, since we are using them in one of our analyses. Most of these are variants of indel / insertion / deletion effects, with/without frameshift filters, and again filtering to exonic variants.
  • Now caching filtered-effects, along with effects
  • Refactored some of the caching code
    • moved all caching-related functions to utils.py
    • made a separate cohorts.cohort.caching logger, to aid cache-debugging (vs other debugging)

Also deleted a random quick-start.Rmd file that accidentally found its way into our master branch.

@coveralls
Copy link

coveralls commented Aug 15, 2017

Coverage Status

Coverage increased (+0.2%) to 52.86% when pulling 95da13e on add-indel-effects into c5844cd on master.

@coveralls
Copy link

coveralls commented Aug 18, 2017

Coverage Status

Coverage decreased (-0.3%) to 52.409% when pulling fd3b283 on add-indel-effects into c5844cd on master.

Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jburos not sure if you saw my review comment from 5 days ago (maybe it didn't go through?), but I'm trying to figure out which of these functions do/should return the same results. I think any that do, and we're only using to confirm that, can maybe be placed in a different file? cohorts.functions.dups?

isinstance(filterable_effect.effect, FrameShift) and
isinstance(filterable_effect.effect, Exonic)))

exonic_frameshift_snv_count = count_effects_function_builder(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of SNV causes a frameshift?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a question Alex asked as well. I can say, we saw several in our recent analyses.

Theoretically, snvs can cause a new start-site (stop-gain), modify splice sites (not sure if this would be counted by varcode), cause a stop-loss (which apparently leads to no transcription), or can be stop-gain. In practice I've seen a variety of varcode classes, but am working on functions which would make these easier to inspect using cohorts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in varcode only FrameshiftTruncation effects inherit from Frameshift. AlternateStartCodon, PrematureStop & Startloss are all considered versions where we can either not predict the coding sequence (ie startloss) or they are considered in-frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see - i didn't filter to keep snvs. so this part is an error. I will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still pending a fix, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm (selfishly!) stalling until I no longer have code referencing this function, so I can delete it entirely. Right now I've fixed it in my local branch, but don't really want to commit that since it doesn't make too much sense to have a function that returns 0 by definition.

@@ -258,6 +303,36 @@ def expressed_neoantigen_count(row, cohort, filter_fn, normalized_per_mb, **kwar
only_expressed=True,
**kwargs)

@use_defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just:

expressed_exonic_indel_count = expressed_of(exonic_indel_count)
  • writing expressed_of to remove all the copied code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this too, but didn't really have time to rework these. For now just need to get the analysis done. Up to you if you want to leave this PR hanging until then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just file an issue vs. leave the PR hanging

@jburos
Copy link
Member Author

jburos commented Aug 21, 2017

Thanks @tavinathanson! Not sure your earlier review went through. Either way I can't find it.

In general I think the exonic filter is the only one that might be redundant - if so, it is only redundant within certain variant types but not all. In my analysis since I'm comparing rates of types of mutations, it's important that I am 100% confident all counts are within exonic regions. Moving them to dups would be undesirable since a user would then have to know if it was a 'dup' or not (and which version is considered the dup & which the primary) in order to import it.

Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after all known errors are fixed

@jburos
Copy link
Member Author

jburos commented Aug 23, 2017

@tavinathanson did a fair amount of refactoring to the functions, along the lines of the expressed_of filter you suggested but taking it a bit further. The idea is to allow most of the variant/effects functions to be composable -- so a user could do the following:

missense_snv_count = only_missense(snv_count)
expressed_exonic_frameshift_indel_count = only_expressed(only_exonic(only_frameshift(indel_count)))

This wouldn't yet work for neoantigens, but could if we refactored the load_neoantigens & related code a bit (essentially to limit variants/effects first & then compute neoantigens from that set).

Also, not sure yet about the naming convention proposed here, but as an approach I think this could give users (us) the flexibility sometimes required without having to create every combination of effects we might want in cohorts.functions.py.

Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.

Note that I put in here an auto-naming feature, so that one could compute these on-the-fly.

IE cohort.plot_benefit(on=[only_exonic(only_frameshift(indel_count))]) and the result would be named as "exonic_frameshift_indel_count" by default. One could specify a custom "name" parameter, but may not want to.

Curious to know your thoughts on the general approach before I go too far down this road.

Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! A few comments

@@ -84,7 +84,8 @@ def count_filter_fn(filterable_variant, **kwargs):
return ((filterable_variant_function(filterable_variant) if filterable_variant_function is not None else True) and
filter_fn(filterable_variant, **kwargs))
patient_id = row["patient_id"]
return cohort.load_variants(
return cohort.load_effects(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change vs. the existing count_effects_function_builder which does this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the plan to get rid of both count_effects_function_builder and count_variants_function_builder and use the new composition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you wrote:

Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.

But I'm not clear what count_effects_function_builder doesn't provide, and still am confused about why we're removing the variant counter?

You probably already know this, but the same filter_fn can be passed into load_effects and load_variants, since a FilterableEffect is a subclass of a FilterableVariant (and similar stuff for FilterableNeoantigen etc.)

All that to say: I'm just a bit confused :)

Copy link
Member Author

@jburos jburos Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, i did suspect that although it helps to have your confirmation :). Except I did run into some errors when using an effect filter on a filterable_variant. The first reason appeared to be that the parameter names are different (trivial to fix), whereas the second was an issue that the filterable_variant.effect object could not be found so a number of the later filters failed.
At any rate I'm now:

  1. validating that these are, indeed equivalent
  2. testing that the filtered-cache-names are correct (ie equivalent when they should be, different for different filters).
  3. Unifying the above (ie removing count_variants_function_builder, using count_effects_function_builder everywhere instead)

I'm not convinced that using an only_snv filter on the neoantigen_count will give us the response we want. But I need to do a little testing on this / review (maybe it's simpler than I think). only_expressed on that neoantigen_count will definitely need to be reworked, since the expressed_neoantigen_count is computed almost completely differently from the non-expressed count (the rationale for this is something I'd like to understand better before I mess with it).

One question/concern I am running into is the reason why "nonsynonymous" effects are handled differently from other effect-type filters. it appears that, within varcode, we're filtering out silent_and_noncoding effects before picking the highest priority effect per variant. I would think the silent/noncoding effects would be the lowest-priority for a variant, of all possible effects, so the order of this filtering wouldn't matter. In which case I'll make an only_nonsynonymous filter analogous to the rest. But curious to know your thoughts on this.

Secondly, and this is a related question that I've been thinking about now that I'm digging into this a bit more, any reason why we might want to filter effects prior to selecting the highest priority per variant? Ie is there a case where a variant could have one effect type that is not a subclass of its highest-priority effect? (ie where including all effects per variant & then filtering to include only X type & to remove dups per variant would yield a different count from the current scenario where we keep only the highest priority & then filter to those where effect.is_snv is True?) [maybe this discussion should move into an issue, but it's relevant here .. so]

Copy link
Member Author

@jburos jburos Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: getting rid of both function builders, I will likely only get rid of one -- the function builder is still useful for some edge cases not addressed by the composition -- which always composes functions using an and. E.g.: missense_snv_and_nonsynonymous_indel_count where the logic contains an or (is_indel or (is_snv & is_substitution). It would be nice to generalize the filter-fn logic further to enable a phrase like that above, but i'll leave that for another day ;)

@coveralls
Copy link

coveralls commented Aug 25, 2017

Coverage Status

Coverage increased (+1.5%) to 54.238% when pulling d242c44 on add-indel-effects into c5844cd on master.

@jburos
Copy link
Member Author

jburos commented Aug 30, 2017

@tavinathanson just merged in your latest changes from master - would you mind reviewing again for a quick sanity check? I updated the description above to reflect all the various things that ended up in here.

@coveralls
Copy link

coveralls commented Aug 30, 2017

Coverage Status

Coverage increased (+1.7%) to 54.42% when pulling 1c0429f on add-indel-effects into f9f4af2 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants