Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report on but do not error when patient has no RNA data #245

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

jburos
Copy link
Member

@jburos jburos commented Aug 18, 2017

Here I'm trying to achieve a goal of allowing an analysis on a cohort where some but not all records have RNA sequencing data. Currently, if one is using a summary function like cohorts.functions.expressed_missense_snv_count, you will see an exception if some but not all records have RNA expression data.

Instead, we want to print a warning for these patients and populate the resulting df with np.NaN for any patient that does not contain RNA sequencing data where the filter function depends on it. This behavior would then be consistent with (for example) that of other summary functions like missense_snv_count for patients that do not have VCF files.

Aside: The correct way to handle this in most situations is to require that the cohort be filtered to those Patients having RNA data prior to analysis. This would make the exception desirable. However there are scenarios (like when using as_dataframe) where one might desire a different behavior. This however, runs the risk of a user ignoring the INFO messages & proceeding with analysis in error. This might be something we want to allow the user to configure, ie whether to allow NaN values in the cohort or not.

At any rate, for now, this PR does a few things:

  1. Create new error classes for missing BAM files of various types
  2. Catch those errors if they are used to filter variants or effects (not sure if this should apply to polyphen?)
  3. Report those errors as info using the logger.

Also, for the sake of consistency, this PR:

  1. Modifies earlier print statements when VCFs are not found for a patient to also use these error classes & report as INFO using the logger (previously these were print statements).

Note that I refactored some parts of the Cohort._load_single_patient_variants and Cohort._load_single_patient_effects in order to throw & report on these scenarios.

@coveralls
Copy link

coveralls commented Aug 18, 2017

Coverage Status

Coverage decreased (-0.2%) to 52.347% when pulling 7114c6e on fix-expression-errors into ea31881 on master.

@jburos
Copy link
Member Author

jburos commented Aug 18, 2017

Just noticed that these are all reported using logger.info() ; I wonder if they should instead be warnings? Edited the text above to reflect this.

@jburos jburos changed the title Warn but do not error when patient has no RNA data Report on but do not error when patient has no RNA data Aug 18, 2017
@coveralls
Copy link

coveralls commented Aug 18, 2017

Coverage Status

Coverage decreased (-0.2%) to 52.347% when pulling cd9e2ff on fix-expression-errors into ea31881 on master.

@tavinathanson
Copy link
Member

@jburos I generally agree with your aside. In lung, I just have load_cohort(only_rna=True) and proceed from there. I'm not sure what the downsides are of that?

Copy link
Member

@tavinathanson tavinathanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

@jburos
Copy link
Member Author

jburos commented Aug 21, 2017

@tavinathanson Thx. First this is a question of consistency, since this PR extends the logic currently in place for many functions to another class of functions.

But to elaborate on my aside, above, there are two primary/immediate reasons why this is undesirable for my use case - mainly when using as_dataframe to export data for a cohort.

  1. In some cases, I want to compare attributes (e.g. survival) for patients with & without missing data as part of an exploratory analysis. Without the ability to export data for the full cohort, including NaN values for missing data, I am unable to do this type of analysis.
  2. Many of my analysis make use of data where values are missing, either to impute data for these values or include the records to (e.g.) stabilize survival estimates.

In general, when for example working in R, I often want to export my data once & then move forward with analysis using a data frame in which I can see the full cohort's data. Re-exporting data for each analysis is not only inefficient, but also runs the risk of inconsistencies in computed values across exported instances.

Separately, I can imagine a use case where you might want to initialize a cohort once (which can take a non-trivial amt of time, even with caching) & then loop through various analyses irrespective of the fact that some data are missing for some patients in some analyses. This is where an unsuspecting user might ignore or forget that the sample sizes are different for each analysis. Maybe we want to go through each of our analyses & report on the sample size for each?

This is not always a problem by the way - for example, in our bladder-cohort analyses we didn't restrict all analyses to the same subset of patients as has all data available.

At any rate in general I propose we move this discussion to an issue (or set of issues), since they are somewhat beyond the scope of this PR. For now, I see two open questions:

  1. At what level to report these messages? As INFO or WARNING (warning actually seems more appropriate to me)?
  2. Do we want consistency across the various functions in how these are handled? If so, is this the right approach to achieve this?

@coveralls
Copy link

coveralls commented Aug 22, 2017

Coverage Status

Coverage decreased (-0.2%) to 52.314% when pulling 1eb7199 on fix-expression-errors into ea31881 on master.

@tavinathanson
Copy link
Member

@jburos fair points. Since we have different use cases, and I want to make sure people don't inadvertently have missing data because they misspelled a BAM filename, how about we gate this on a new Cohort.fail_on_missing_bams that defaults to True?

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

@tavinathanson then, why not do the same for VCFs, etc? what about an allow_nan which defaults to False? If True, then NaN values in any of these fields would be "OK"

@tavinathanson
Copy link
Member

@jburos I was just worried about confusion with clinical data, since we do allow that to be NaN.

@jburos
Copy link
Member Author

jburos commented Aug 22, 2017

@tavinathanson First, we should probably separate the issue of allowing NaN for variants but not expressed variants, vs the name of the parameter ;) my fault for confounding this.

I care less about the parameter name & more about consistency across summary-functions. Even if the flag were fail_on_missing_bams, shouldn't we also fail on missing vcfs? Or would this be a separate flag?

Regardless, was thinking when I wrote the aside that this would be nice to customize at the level of the analysis (ie within plot_benefit) as well as a cohort-wide default.

I was also thinking (one could argue) that this should apply also to clinical data -- ie fine for age to be missing for some patients until we are using it in an analysis.

Finally, I should probably separate the issues of a misspelled bamfile name from that of a patient not having a bamfile (i should rename the exception to reflect this). I'm pretty sure the misspelled-bam-file case will result in a different error at the moment.

@jburos
Copy link
Member Author

jburos commented Aug 23, 2017

@tavinathanson also FYI (no rush) I'm done making changes to this one, for whenever you are available to take a look. We should really add some test cases to this -- going to create an issue for that.

@coveralls
Copy link

coveralls commented Aug 23, 2017

Coverage Status

Coverage increased (+0.2%) to 52.704% when pulling de9301d on fix-expression-errors into ea31881 on master.

@coveralls
Copy link

coveralls commented Aug 23, 2017

Coverage Status

Coverage increased (+0.2%) to 52.704% when pulling b7c8139 on fix-expression-errors into ea31881 on master.

no_variants = True

# Note that this is the number of variant collections and not the number of
# variants. 0 variants will lead to 0 neoantigens, for example, but 0 variant
# collections will lead to NaN variants and neoantigens.
if no_variants:
print("Variants did not exist for patient %s" % patient.id)
merged_variants = None
raise MissingVariantFile(patient_id=patient.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some repeated logic here; so many checks. Maybe file an issue to clean up this function? Don't want to delay the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this function does need to be cleaned up. But, yep that would belong in a different PR

@coveralls
Copy link

coveralls commented Aug 30, 2017

Coverage Status

Coverage increased (+0.1%) to 52.856% when pulling 3c5e86a on fix-expression-errors into f9f4af2 on master.

@coveralls
Copy link

coveralls commented Sep 7, 2017

Coverage Status

Coverage increased (+0.1%) to 53.019% when pulling 4a26b14 on fix-expression-errors into 5ed28a4 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants