Add indel effects #243

jburos · 2017-08-15T17:38:39Z

A few changes ended up in this PR:

Refactored summary functions, so they are now composable. IE only_nonsynonymous(snv_count) returns a function equivalent to nonsynonymous_snv_count. Similarly, only_exonic(only_nonsynonymous(snv_count)) returns the count of exonic, nonsynonymous snvs.
Added a few new effect-types to the set of functions, since we are using them in one of our analyses. Most of these are variants of indel / insertion / deletion effects, with/without frameshift filters, and again filtering to exonic variants.
Now caching filtered-effects, along with effects
Refactored some of the caching code
- moved all caching-related functions to utils.py
- made a separate cohorts.cohort.caching logger, to aid cache-debugging (vs other debugging)

Also deleted a random quick-start.Rmd file that accidentally found its way into our master branch.

…an be defined for df loaders

coveralls · 2017-08-15T17:50:37Z

Coverage increased (+0.2%) to 52.86% when pulling 95da13e on add-indel-effects into c5844cd on master.

coveralls · 2017-08-18T22:01:58Z

Coverage decreased (-0.3%) to 52.409% when pulling fd3b283 on add-indel-effects into c5844cd on master.

tavinathanson

@jburos not sure if you saw my review comment from 5 days ago (maybe it didn't go through?), but I'm trying to figure out which of these functions do/should return the same results. I think any that do, and we're only using to confirm that, can maybe be placed in a different file? cohorts.functions.dups?

tavinathanson · 2017-08-15T18:39:21Z

cohorts/functions.py

+        isinstance(filterable_effect.effect, FrameShift) and
+        isinstance(filterable_effect.effect, Exonic)))
+
+exonic_frameshift_snv_count = count_effects_function_builder(


What type of SNV causes a frameshift?

This was a question Alex asked as well. I can say, we saw several in our recent analyses.

Theoretically, snvs can cause a new start-site (stop-gain), modify splice sites (not sure if this would be counted by varcode), cause a stop-loss (which apparently leads to no transcription), or can be stop-gain. In practice I've seen a variety of varcode classes, but am working on functions which would make these easier to inspect using cohorts.

It looks like in varcode only FrameshiftTruncation effects inherit from Frameshift. AlternateStartCodon, PrematureStop & Startloss are all considered versions where we can either not predict the coding sequence (ie startloss) or they are considered in-frame.

Ah, I see - i didn't filter to keep snvs. so this part is an error. I will fix.

This is still pending a fix, right?

I'm (selfishly!) stalling until I no longer have code referencing this function, so I can delete it entirely. Right now I've fixed it in my local branch, but don't really want to commit that since it doesn't make too much sense to have a function that returns 0 by definition.

tavinathanson · 2017-08-20T20:49:37Z

cohorts/functions.py

@@ -258,6 +303,36 @@ def expressed_neoantigen_count(row, cohort, filter_fn, normalized_per_mb, **kwar
                            only_expressed=True,
                            **kwargs)

+@use_defaults


Maybe just:

expressed_exonic_indel_count = expressed_of(exonic_indel_count)

writing expressed_of to remove all the copied code?

I thought about this too, but didn't really have time to rework these. For now just need to get the analysis done. Up to you if you want to leave this PR hanging until then.

We can just file an issue vs. leave the PR hanging

jburos · 2017-08-21T11:12:57Z

Thanks @tavinathanson! Not sure your earlier review went through. Either way I can't find it.

In general I think the exonic filter is the only one that might be redundant - if so, it is only redundant within certain variant types but not all. In my analysis since I'm comparing rates of types of mutations, it's important that I am 100% confident all counts are within exonic regions. Moving them to dups would be undesirable since a user would then have to know if it was a 'dup' or not (and which version is considered the dup & which the primary) in order to import it.

tavinathanson

LGTM after all known errors are fixed

jburos · 2017-08-23T21:30:05Z

@tavinathanson did a fair amount of refactoring to the functions, along the lines of the expressed_of filter you suggested but taking it a bit further. The idea is to allow most of the variant/effects functions to be composable -- so a user could do the following:

missense_snv_count = only_missense(snv_count)
expressed_exonic_frameshift_indel_count = only_expressed(only_exonic(only_frameshift(indel_count)))

This wouldn't yet work for neoantigens, but could if we refactored the load_neoantigens & related code a bit (essentially to limit variants/effects first & then compute neoantigens from that set).

Also, not sure yet about the naming convention proposed here, but as an approach I think this could give users (us) the flexibility sometimes required without having to create every combination of effects we might want in cohorts.functions.py.

Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.

Note that I put in here an auto-naming feature, so that one could compute these on-the-fly.

IE cohort.plot_benefit(on=[only_exonic(only_frameshift(indel_count))]) and the result would be named as "exonic_frameshift_indel_count" by default. One could specify a custom "name" parameter, but may not want to.

Curious to know your thoughts on the general approach before I go too far down this road.

tavinathanson

Love this! A few comments

tavinathanson · 2017-08-23T22:15:13Z

cohorts/functions.py

@@ -84,7 +84,8 @@ def count_filter_fn(filterable_variant, **kwargs):
            return ((filterable_variant_function(filterable_variant) if filterable_variant_function is not None else True) and
                    filter_fn(filterable_variant, **kwargs))
        patient_id = row["patient_id"]
-        return cohort.load_variants(
+        return cohort.load_effects(


Why this change vs. the existing count_effects_function_builder which does this?

Also, is the plan to get rid of both count_effects_function_builder and count_variants_function_builder and use the new composition?

I see that you wrote:

Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.

But I'm not clear what count_effects_function_builder doesn't provide, and still am confused about why we're removing the variant counter?

You probably already know this, but the same filter_fn can be passed into load_effects and load_variants, since a FilterableEffect is a subclass of a FilterableVariant (and similar stuff for FilterableNeoantigen etc.)

All that to say: I'm just a bit confused :)

Yep, i did suspect that although it helps to have your confirmation :). Except I did run into some errors when using an effect filter on a filterable_variant. The first reason appeared to be that the parameter names are different (trivial to fix), whereas the second was an issue that the filterable_variant.effect object could not be found so a number of the later filters failed.
At any rate I'm now:

validating that these are, indeed equivalent

testing that the filtered-cache-names are correct (ie equivalent when they should be, different for different filters).

Unifying the above (ie removing count_variants_function_builder, using count_effects_function_builder everywhere instead)

I'm not convinced that using an only_snv filter on the neoantigen_count will give us the response we want. But I need to do a little testing on this / review (maybe it's simpler than I think). only_expressed on that neoantigen_count will definitely need to be reworked, since the expressed_neoantigen_count is computed almost completely differently from the non-expressed count (the rationale for this is something I'd like to understand better before I mess with it).

One question/concern I am running into is the reason why "nonsynonymous" effects are handled differently from other effect-type filters. it appears that, within varcode, we're filtering out silent_and_noncoding effects before picking the highest priority effect per variant. I would think the silent/noncoding effects would be the lowest-priority for a variant, of all possible effects, so the order of this filtering wouldn't matter. In which case I'll make an only_nonsynonymous filter analogous to the rest. But curious to know your thoughts on this.

Secondly, and this is a related question that I've been thinking about now that I'm digging into this a bit more, any reason why we might want to filter effects prior to selecting the highest priority per variant? Ie is there a case where a variant could have one effect type that is not a subclass of its highest-priority effect? (ie where including all effects per variant & then filtering to include only X type & to remove dups per variant would yield a different count from the current scenario where we keep only the highest priority & then filter to those where effect.is_snv is True?) [maybe this discussion should move into an issue, but it's relevant here .. so]

re: getting rid of both function builders, I will likely only get rid of one -- the function builder is still useful for some edge cases not addressed by the composition -- which always composes functions using an and. E.g.: missense_snv_and_nonsynonymous_indel_count where the logic contains an or (is_indel or (is_snv & is_substitution). It would be nice to generalize the filter-fn logic further to enable a phrase like that above, but i'll leave that for another day ;)

…elated code to utils; error if variant_file exists but cannot be loaded

…a-functions names

coveralls · 2017-08-25T16:42:16Z

Coverage increased (+1.5%) to 54.238% when pulling d242c44 on add-indel-effects into c5844cd on master.

jburos · 2017-08-30T10:43:44Z

@tavinathanson just merged in your latest changes from master - would you mind reviewing again for a quick sanity check? I updated the description above to reflect all the various things that ended up in here.

coveralls · 2017-08-30T10:46:35Z

Coverage increased (+1.7%) to 54.42% when pulling 1c0429f on add-indel-effects into f9f4af2 on master.

jburos and others added 20 commits July 18, 2017 18:37

allow strata parameter to plot two survival curves

11057a1

handle case of 1 group when plotting

1f48b6b

label results as well as graphs

815fbe7

spacing

9e37d59

move strata function within plot_kmf

5de468a

remove unused packages

cea5492

refactor kmf code

8277530

adding parameter for join_on_left, so that both left & right fields c…

79a0a0d

…an be defined for df loaders

remove leftover stuff from plot-surv-by-strata branch

4b41ebf

add deprecation warning

780e03c

remove error when both join_on & join_on_right given

cfc2387

Merge branch 'join-on-left' into develop

60b1970

Merge branch 'plot-surv-by-strata' into develop

4b1fd28

add get-blob method to gcio

5b15dcb

Merge branch 'add-get-blob' into develop-rcc

00ab4b9

add functions for frameshift-indels of various types

e2b1bea

add exonic_frameshift_snv_count

142b535

add exonic_frameshift_variant_count

cc59d02

Merge branch 'master' into add-indel-effects

76f06c4

clean up minor edits; remove blob-related changes

95da13e

jburos requested a review from tavinathanson August 15, 2017 17:39

add expressed exonic indel-related functions

fd3b283

tavinathanson reviewed Aug 20, 2017

View reviewed changes

tavinathanson approved these changes Aug 22, 2017

View reviewed changes

jburos added 3 commits August 23, 2017 16:15

remove extraneous frameshift functions

6f39c6b

refactor function code

75da993

functions now always return effects, so that all filters can be chained

a9fdcd2

tavinathanson reviewed Aug 23, 2017

View reviewed changes

jburos added 5 commits August 24, 2017 16:28

make separate cohorts.cohort.caching logger; move _hash_filter_fn & r…

1a20d64

…elated code to utils; error if variant_file exists but cannot be loaded

make sure kwargs get passed through to filter_fn; give bunch of lambd…

31fcbd6

…a-functions names

add logger

adaaf5e

memoize hash function, to save on repeated calls

13080bb

fix frameshift

d242c44

Merge branch 'master' into add-indel-effects

1c0429f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indel effects #243

Add indel effects #243

jburos commented Aug 15, 2017 •

edited

Loading

coveralls commented Aug 15, 2017 •

edited

Loading

coveralls commented Aug 18, 2017 •

edited

Loading

tavinathanson left a comment

tavinathanson Aug 15, 2017

jburos Aug 21, 2017

jburos Aug 21, 2017

jburos Aug 21, 2017

tavinathanson Aug 22, 2017

jburos Aug 22, 2017

tavinathanson Aug 20, 2017

jburos Aug 21, 2017

tavinathanson Aug 22, 2017

jburos commented Aug 21, 2017

tavinathanson left a comment

jburos commented Aug 23, 2017

tavinathanson left a comment

tavinathanson Aug 23, 2017

tavinathanson Aug 23, 2017

tavinathanson Aug 23, 2017

jburos Aug 24, 2017 •

edited

Loading

jburos Aug 24, 2017 •

edited

Loading

coveralls commented Aug 25, 2017 •

edited

Loading

jburos commented Aug 30, 2017

coveralls commented Aug 30, 2017 •

edited

Loading

Add indel effects #243

Are you sure you want to change the base?

Add indel effects #243

Conversation

jburos commented Aug 15, 2017 • edited Loading

coveralls commented Aug 15, 2017 • edited Loading

coveralls commented Aug 18, 2017 • edited Loading

tavinathanson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jburos commented Aug 21, 2017

tavinathanson left a comment

Choose a reason for hiding this comment

jburos commented Aug 23, 2017

tavinathanson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jburos Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

jburos Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

coveralls commented Aug 25, 2017 • edited Loading

jburos commented Aug 30, 2017

coveralls commented Aug 30, 2017 • edited Loading

jburos commented Aug 15, 2017 •

edited

Loading

coveralls commented Aug 15, 2017 •

edited

Loading

coveralls commented Aug 18, 2017 •

edited

Loading

jburos Aug 24, 2017 •

edited

Loading

jburos Aug 24, 2017 •

edited

Loading

coveralls commented Aug 25, 2017 •

edited

Loading

coveralls commented Aug 30, 2017 •

edited

Loading