-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report on but do not error when patient has no RNA data #245
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 164c106.
Just noticed that these are all reported using |
@jburos I generally agree with your aside. In lung, I just have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
@tavinathanson Thx. First this is a question of consistency, since this PR extends the logic currently in place for many functions to another class of functions. But to elaborate on my aside, above, there are two primary/immediate reasons why this is undesirable for my use case - mainly when using
In general, when for example working in R, I often want to export my data once & then move forward with analysis using a data frame in which I can see the full cohort's data. Re-exporting data for each analysis is not only inefficient, but also runs the risk of inconsistencies in computed values across exported instances. Separately, I can imagine a use case where you might want to initialize a cohort once (which can take a non-trivial amt of time, even with caching) & then loop through various analyses irrespective of the fact that some data are missing for some patients in some analyses. This is where an unsuspecting user might ignore or forget that the sample sizes are different for each analysis. Maybe we want to go through each of our analyses & report on the sample size for each? This is not always a problem by the way - for example, in our bladder-cohort analyses we didn't restrict all analyses to the same subset of patients as has all data available. At any rate in general I propose we move this discussion to an issue (or set of issues), since they are somewhat beyond the scope of this PR. For now, I see two open questions:
|
@jburos fair points. Since we have different use cases, and I want to make sure people don't inadvertently have missing data because they misspelled a BAM filename, how about we gate this on a new |
@tavinathanson then, why not do the same for VCFs, etc? what about an |
@jburos I was just worried about confusion with clinical data, since we do allow that to be |
@tavinathanson First, we should probably separate the issue of allowing NaN for variants but not expressed variants, vs the name of the parameter ;) my fault for confounding this. I care less about the parameter name & more about consistency across summary-functions. Even if the flag were Regardless, was thinking when I wrote the aside that this would be nice to customize at the level of the analysis (ie within I was also thinking (one could argue) that this should apply also to clinical data -- ie fine for Finally, I should probably separate the issues of a misspelled bamfile name from that of a patient not having a bamfile (i should rename the exception to reflect this). I'm pretty sure the misspelled-bam-file case will result in a different error at the moment. |
@tavinathanson also FYI (no rush) I'm done making changes to this one, for whenever you are available to take a look. We should really add some test cases to this -- going to create an issue for that. |
no_variants = True | ||
|
||
# Note that this is the number of variant collections and not the number of | ||
# variants. 0 variants will lead to 0 neoantigens, for example, but 0 variant | ||
# collections will lead to NaN variants and neoantigens. | ||
if no_variants: | ||
print("Variants did not exist for patient %s" % patient.id) | ||
merged_variants = None | ||
raise MissingVariantFile(patient_id=patient.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some repeated logic here; so many checks. Maybe file an issue to clean up this function? Don't want to delay the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this function does need to be cleaned up. But, yep that would belong in a different PR
Here I'm trying to achieve a goal of allowing an analysis on a cohort where some but not all records have RNA sequencing data. Currently, if one is using a summary function like
cohorts.functions.expressed_missense_snv_count
, you will see an exception if some but not all records have RNA expression data.Instead, we want to print a warning for these patients and populate the resulting df with
np.NaN
for any patient that does not contain RNA sequencing data where the filter function depends on it. This behavior would then be consistent with (for example) that of other summary functions likemissense_snv_count
for patients that do not have VCF files.Aside: The correct way to handle this in most situations is to require that the cohort be filtered to those
Patient
s having RNA data prior to analysis. This would make the exception desirable. However there are scenarios (like when usingas_dataframe
) where one might desire a different behavior. This however, runs the risk of a user ignoring the INFO messages & proceeding with analysis in error. This might be something we want to allow the user to configure, ie whether to allow NaN values in the cohort or not.At any rate, for now, this PR does a few things:
info
using the logger.Also, for the sake of consistency, this PR:
Note that I refactored some parts of the
Cohort._load_single_patient_variants
andCohort._load_single_patient_effects
in order to throw & report on these scenarios.