-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Report Uncertainty Chapter #320
Comments
Dear Ken, Thanks for putting this together - a lot of work has clearly gone into it. I have a number of comments and questions which I'd like to think about some more before posting, but I'd like to highlight at this time that the proposal as it stands would require a number of changes to the CF data model. The features that would need data model changes are, as far as I can tell:
Extending the data model is not a problem provided that it has been established that we can't meet the requirements with the current data model. I don't have any answers as yet, but will carry on thinking about it. I'll post again with some more detailed thoughts on the text ... All the best, |
Overall, I think that the general approach of using ancillary variables and cell methods is a good one. There was considerable discussion around the topic of "standard name modifiers or cell methods?" in 2011, 2012, and 2013 (e.g. http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html, https://cf-trac.llnl.gov/trac/ticket/74) - which is well worth revisiting if you have the time. Here are my initial thoughts on the detailed proposal: Standard namesI don't think that these standard names will work, for two reasons:
From reading the [GUM] reference you very helpfully provided in the bibliography (https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf), I'm a bit confused on the definition of "total_uncertainty". Your reference to "the square root of the sum of squares" might suggest that it is the [GUM]'s "standard uncertainty" - is that right? I felt the prefix Cell methodsIt would be very useful to include the parent data variable in the examples that have ancillary data cell methods. Without that reference I can't fully understand what the I'm not sure that In the Type B case, there are perhaps further complications, if one considers the stored values to not be representative of sub-grid variation. There is already a I'm confused how the confidence interval in example 10.4 is stored as a scalar. Is it in fact half of an interval that is symmetric about the measurement? If so, the Ancillary variablesAllowing cell methods to be interpreted on ancillary variables could be allowed, provided that the the cell method names could also be plausibly applied to the data variable - see comments above on the interpretation of the cell method names. Ancillary methods containing ancillary methods and having trailing dimensions - these need more thought, and I'll post again when I've had time to do so. All the best, |
David, Thanks for the comments. I initially proposed a standard name modifier and it was not well accepted by the audiences I presented to. I think a standard name modifier is a more difficult and confusing way to use the standard names, and would require updating the CF document every time we want to add a new name. That is significantly more difficult than adding a standard name (which is not a trivial process). If we want to use some other attribute that's fine. I was trying to keep the number of new attributes to a minimum. I also don't want to develop a new system to track names. We could use cf_role, but I get worried about overloading attributes with too many different concepts. There is no standard uncertainty. This is the crux of the problem. There are literally thousands of ways to derive uncertainty so we can't be too specific with the definitions. That's why using standard_error standard name modifier will not be enough. "the square root of the sum of squares" is just the component sum method to add values. It only works if the values are independent, which may or may not be the case. Whether we use specific_ or component_ doesn't matter to me. Someone will not like whichever one is chosen. That's just how it goes. I got the term from someone else. I think the confidence interval needs to be listed as the full range for a confidence value to be calculated. It's a scalar because the same interval is used for all values. The value listed in the variable would be half the range centered on the middle of the range. I don't think the details matter that much, just that the variable indicates the values are from a 95% confidence range. I see the cell_methods as just a high level indication of what is going on. There are many other decisions made in the calculation that are not indicated in cell_methods (e.g. how was missing data treated, was QC applied). I don't follow the concern with ancillary variables. I assume the cell methods attribute describes the process to create the values in the variable. So of course the method listed in cell_methods would need to work on the data variable. Thanks for the comments, Ken |
Hi Ken, I'm trying to better understand the issues around cell methods, and would find it very useful to have the parent data variable that goes with this example ancillary variable the new chapter 10:
Would it be possible to update the example? Many thanks, |
David, Sure, no problem. I've updated the example in Chapter 10 to include a data variable for both confidence interval examples. Ken |
This review appears to have stalled. What can we do to get this going agian? |
Hi Ken, I'm sorry that this has stalled, but I don't have as much time as I would like to devote to as many lengthy and involved proposals as I might like. Perhaps when some other CF issues I've been involved with for some time have concluded I will have more time here. There are still many outstanding questions for me on a few areas, such as: the use of standard names, role identification, interpretation of cell methods on ancillary variables, ancillary variables referencing other ancillary variables[*]. These will need careful thought, especially the last two which, as proposed, break the CF data model. [*] in my original post I meant Ancillary variables containing ancillary variables when I wrote I think the first points we need to resolve is the use of standard names and role_identification (I agree that re-using "cf_role" name is probably not the best choice, here, but another name could work). Do you have any more thoughts on that? All the best, |
Dear @kenkehoe Thanks for making this proposal. I realise that three months has passed. I regret that I have not yet had time to study and review it, although it's been on my agenda all this time. Best wishes Jonathan |
@davidhassell I am trying to keep this proposal as simple as possible. I've gone through a few iterations on options and I think the current proposal is the least drastic change to the convention. I see no issue with ancillary variables containing ancillary variables as that attribute is just a linkage. A previous discussion was to use standard name modifiers but that became cumbersome and would require a change to the CF document each time we add new uncertainty. I also thought about creating a new attribute but that would require either changing the CF document appendix where they are listed each time a new one is added or a new external look up. Since we have the standard name table already I am suggesting we use that. If we wanted to use cf_role set to "uncertainty" to indicate the variable contains uncertainty values that is fine. But we don't require that for other variables (like state variables, quality control, data, platform information) so it would be a strange one off. If we wanted to create a new attribute to signify the values are uncertainty that is fine. But I'd prefer to keep the description of the variable contents using standard name table so we don't need to create a new table. Honestly I'm not a huge fan of cell methods. If that is causing strife I'm OK with dropping it and just using a comment attribute, standard name table, or reference attribute to point to a document with the description of how it was computed. |
What can I do to help move this along? |
Hi Ken, Sorry to have abandoned this again. Your proposal is clearly workable in practice, but there are some issues with the implementation, of varying degrees of seriousness, that mean that it is not yet suitable for inclusion into CF. I agree that it seems like a minimally intrusive set of changes, but some of the aspects you propose are a bit like the tip of the iceberg in complexity, like the ancillary variables containing ancillary variables issue. The issue there is that the This could well be worse than it sounds! It is possible that minor changes could resolve these problems and make your proposal fit in with the CF view of data. I think that the key to progression might be to demonstrate clearly when any raised concerns are not in fact valid, or else be proactive in suggesting concrete alternative ideas, with examples, if none have been suggested. I think I can devote some time to thinking about this in detail again towards the end of August, when I am back from leave, so I look forward to carrying on the discussion then. All the best, |
Dear Ken I have had time at last to study and think a bit about your detailed proposal. Thank you for preparing and presenting it. I appreciate it's frustrating for you that this issue is going slowly. Speaking for myself and from David's comments too, I believe this is because it is a large and complicated proposal; when you're busy (as we all are), it's hard to create a large enough chunk of time to address something requiring lengthy thought. Things might go faster if we dealt with it a piece at a time. I formed my opinions before reading David's, and I find (without surprise) that many of them are the same. Like David, I'm grateful for your link to the GUM. I too agree with your approach of using ancillary variables to contain measures of uncertainty. The CF standard (section 3.4) doesn't say what dimensions ancillary variables should have. Since they're intended to provide metadata about individual values of a data variable, they would normally have all the same dimensions. However, I don't think it would be problematic to allow dimensions to be dropped over which the uncertainty doesn't vary. You could drop all the dimensions to provide a scalar uncertainty, as in your examples. I don't think that standard names are the right way to describe the uncertainties, because the standard name should still identify the geophysical quantity for which it is an uncertainty e.g. David mentioned that your proposal requires ancillary variables themselves to have ancillary variables. I didn't notice an instance of that in the examples - is there one? The earlier long and detailed discussion of 2013, which David referenced, is certainly very relevant to your proposal, regarding the distinction between Since ancillary variables are like data variables, I think we could allow them to have If the uncertainty comes from repeated measurement of a quantity with the same spatiotemporal coordinates, you might really add a dimension which runs over the individual measurements. This is exactly like an ensemble of model runs e.g. Most of your examples of uncertainty are mathematically described as standard deviations. I think they are actually standard errors in the statistical sense: "The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation" (wikipedia). I note that the GUM doesn't use that term, and probably "experimental standard deviation" is the same concept, isn't it? I think it's confusing to call it a standard deviation, however, because it is not the SD of the sample; it's divided by sqrt(N). I would prefer All the above leads me to suggest a syntax such as You would also like to be able to provide intervals when not symmetrical. That could be done by adding a size-one dimension for probability or percentile, with bounds to specify the interval e.g. So far this is all about describing the mathematical nature of the uncertainty. You also want to describe what it represents. You do this with the standard name, which David and I both think wouldn't work. Could you do this with standardised comments in the cell methods? For instance, you could add I think that's enough for now! I wonder what you think. Best wishes Jonathan |
Hi Jonathan, Thanks for the reply. You have proposed some good ideas to ponder. I like your idea to use statistical and subjective for the Type A and Type B terms. I'll incorporate that suggestion. I understand the argument against using stanard_name because of the conocial units issue. If that is a deal breaker we can find another method. I will need to think about the cell_methods suggestions some more. But I can say I'm not excited about that option. I've been pushing the use of cell_methods with my institution and it's not being accepted well. It is often quite difficult to encapsulate a description of the process into the cell_methods attribute, and most of my colleagues don't want to add that attribute. I spend a lot of time ensuring other critical attributes make it into the datasets, so I've been pushing less hard lately. I also see this as a slippery slope. If cell methods becomes required (in this case it is THE metadata to indicate uncertainty) then we should require it for other metadata variables. I'd prefer to require a less complicated way to indicate a type of data value. I agree we should not add new attributes if an existing attribute already exists. My main concern is to find a method that is simple to write and simple to understand for our data users. Most of our data products have different types of methods of uncertainty and I need to find a solution that work for all of them. Most data users will not care how the uncertainty value was derived, only that the institution which created the data file are providing their best guess uncertainty estimation. They will then use that uncertainty estimation in their work. Often the uncertainty value provided will not be of the form most desireable to their research, but if that is all the researcher is provided they will find a way to use it. Therefore, I'm trying to not require the uncertainty to be defined by the details of the method used to derive the uncertainty. Thanks, Ken |
Correction to what I wrote yesterday:
Since the uncertainty variable is like a data variable, it doesn't have bounds. I was getting confused. Here, |
Dear @kenkehoe I would encourage you to read http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html if you haven't, because there's a lot of a discussion there about the advantages and disadvantages of The The Best wishes Jonathan |
I am trying to follow this issue, but don't expect to have much to add. However, the pointer to this email thread (thanks @JonathanGregory) particularly caught my eye because I am looking at And measures of spread have a lot of common with uncertainty even though uncertainty is a much more advanced and theoretically sophisticated concept. So, with regard to this issue I think it is relevant to revisit the list of I do not in any way want to hijack this thread, only point out that there are aspects of standard names, standard name modifiers, and cell methods that are relevant for this thread but might better separated into a conversation over in that issue. |
Jonathan, Thanks for the link to the standard_name vs. cell_methods discussion. I found some useful discussions in there. It was nice to see others have the same confusion about standard_error being a standard_name modifier (only) while standard_deviation is part of cell_methods. I see Jonathan indicates the standard error is not in cell_methods because it does not relate to a particular dimension. If this is true then I don't see how we can rely on cell_methods to define uncertainty when standard error is the most common method to estimate uncertainty. This could result in the location of the described method being in two different locations depending (if standard error look in standard_name, if something else look in cell_methods). I've had a really hard time trying to understand why CF recommends the definition of a statistical process that changes the essence of the values from a mean to standard deviation would use the same standard_name. Jim points out this is quite confusing, and assumes a person or the software would always need to analyze multiple place of metadata just to determine a value is a standard deviation not the mean or instantaneous value of a value. I think we should keep with Ken's idea that the standard_name (with modifiers) defines the data and the cell_method is a second order description of what is really in the cells. This is how I have always interpreted the difference. As I go through this email thread I think I should propose my original proposal, which follows along Jim's idea, of using standard name modifiers, and point to cell_methods for information on what sort of operation was performed. This follows David's interpretation of a standard name modifier as something that further describes a data variable beyond the initial description by the standard name. We could limit the standard name modifier to total_uncertainty, random_uncertainty, systematic_uncertainty and then expect the users to look in cell_methods for the other details. I'm still concerned with the need Steve pointed out to balance the technically correct description of the values with the ease of use. Since this method will be used by software to discover data and most users will still use an uncertainty estimation when it does not align with their understanding of uncertainty when presented with no other option, we should make the process to discovery a measurement's uncertainty estimation as simple as possible. The other issue I see is that cell_methods (as far as I can tell from the CF document) is used to describe how the values were derived using the other data in the file. So a variable's standard deviation can be simply explained in cell_methods by indicating the dimension the operation was performed over and what statistical process was performed on the data. But this will typically not be how the uncertainty estimate was derived. Most uncertainty estimations will include or entirely contain data that is not part of the provided data file. Or the estimation of uncertainty is provided with an equation from an instrument manufacture that is not just a simple standard error calculation. Trying to put that description of operations in the cell_methods is currently not possible. My program attempted to get an uncertainty estimate for all our primary measurements and present them in a single location. This was a huge task that took many resources to produce. We ended up on a simplified PDF document. As you can see in the appendix there are many different methods used to derive an uncertainty estimation. Many of the estimates are from vendors with proprietary methods they are not willing to share, or are too complicated to put into cell_methods. A majority of the uncertainty estimates are single values. My goal is to just provide a simple method to provide data users with the current best estimate of uncertainty and not bog them down with too much detail. Thanks for the links and discussion, Ken |
Dear Ken Yes, the My suggestion above is to define I think some of the comments in the previous email discussion, as well as yours and @larsbarring's, may be partly addressed by keeping in mind that the I agree that what I sketched above is insufficient to deal with your more complex description of uncertainty computations. I didn't write any more because I'd already written quite a lot! I agree that some further attributes may be needed to provide information about how the uncertainty is derived. Best wishes Jonathan |
Dear Jonathan,
I have a hard time understanding how mean, median or mode can be used to characterise variation. For each of these statistics one can (in principle) use one data value to argue that it is the the best available estimate of the corresponding true unobserved value "out there". However, for standard deviation, variance and standard error this is clearly not possible. So, I would say that there is a fundamental difference. In the light of this issue, mean, median or mode do not (as far as I understand) give any information that can be related to uncertainty, which is contrary to standard deviation, variance or standard error. Kind regards, |
Sorry to be unclear. Let me try again. The default cell methods ( |
Jonathan, While I understand the intention of
I am now leaning towards using a standard name modifier of Thanks, Ken |
Dear @kenkehoe You write
and I agree that that. I didn't suggest such a requirement myself. In #320 (comment) I wrote,
The detailed description which you give in these two examples is probably too cumbersome for
I don't know what quantity this is so I can't suggest the standard name! But it doesn't need a modifier in this suggestion.
The proposal is that to determine whether a variable contains an uncertainty you would search the The next word (the comment in Best wishes Jonathan |
Jonathan, Thanks for the suggestion. I can see how using cell_method = "uncertainty" would work, but the main issue is that we don't know the method. The method is not standard_error. So the information provided in cell_methods in your example is incorrect. We do not know the method. I would be on board with moving the statistical or subjective into the method location as that is not listing a specific mathematical process. float random(time); float systematic(time); I think there is confusion between the need to describe the process (statistical vs. subjective) and need to describe the type (total, random, systematic, specific random, specific systematic). According to GUM both classifiers are needed for describing uncertainty. Following the idea of absence of a qualifier defaulting to something, the default would be "(total)". With this method I would suggest the data user searching cell_methods for a string starting with "uncertainty: " as the indicator a variable is an uncertainty variable. This would be equivalent to searching the standard_name ending in "uncertainty". Then the keywords random or systematic in parentheses would indicate the type. Absence of these words would mean total. This would require the addition of subjective and statistical to the cell_methods method list. I feel we are getting close to a solution. Thanks, Ken |
Dear @kenkehoe I suggested the keyword I think that The standard error is not the only statistic you might use for an uncertainty distribution. Confidence limits are an alternative, for example (the "expanded uncertainty" of GUM 6.2), and that would be a different cell method again. I had assumed that random errors are evaluated statistically (Type A in GUM 2.3.2) and systematic errors are evaluated subjectively (Type B in GUM 2.3.3). However, if any combination is allowed, as you say, I agree that they should be separately indicated. I think they should all be put in the comment in I'm glad you think we are making progress. I agree. Best wishes Jonathan |
PS and if it contains neither |
Dear @kenkehoe A couple of years ago we made some progress with this issue that you raised about the description of uncertainty in CF. Do you have any time to continue with this, or is someone else able to pursue it? Best wishes Jonathan |
I propose to close this issue, labelled |
I just got back from the AGU Ocean Science Meeting, uncertainty and how to report it (not from a technical standpoint though), received a lot of attention in a few sessions both modeling and observational. I would be reluctant to close the issue, even if it is dormant. |
Dear Andrew @DocOtak I agree it's potentially useful enhancement to make, and we have made some progress in this issue. Closing it doesn't mean deleting it, of course. It could always be reopened if there is a new contribution to make. The motivation for closing dormant issues is to clarify our view of the truly active issues, which helps with managing them. We could put a separate link on the discussion page to produce a list of dormant issues, if that would be useful. Best wishes Jonathan |
Three weeks have passed with no new contribution to the discussion, so I am closing it as |
Before submitting an issue be sure you have read and understand the github contributing guidelines: https://github.com/cf-convention/cf-conventions/blob/master/CONTRIBUTING.md and the rules for CF changes: http://cfconventions.org/rules.html
If the modification is straightforward and non-controversial, feel free to open a pull request simultaneously with the proposed changes.
Change proposals should include the following information as applicable.
Title
Add a new chapter to explain how to report uncertainty values that correlate with data in the file
Moderator
@user
Moderator Status Review [last updated: YYYY-MM-DD]
Brief comment on current status, update periodically
Requirement Summary
Proposing a new chapter to the CF convention to report uncertainty values in a netCDF file that correspond to a linked data variable(s). Since there is no one clear definition of an uncertainty, the proposal is flexible to accommodate many different types and shapes.
Technical Proposal Summary
Brief proposal overview
Benefits
Any data users who would like to include uncertainty values in a netCDF file with (or external) to the data file.
Status Quo
Discussion of the current state CF and other standards.
Associated pull request
#321
Detailed Proposal
I have been working on a proposal for adding uncertainties to CF for a number of years. I've presented these proposals to the CF meetings and taken into account many suggestions. In addition to the proposals to the CF community I have engaged other communities to see their needs and how to accommodate as many use cases as possible. This has culminated in a working Google Doc (https://docs.google.com/document/d/1UR0flhrEE3yw_3dKW8NpCrGymLt9idwFXJBhZ5ngX3Y/edit#) with the core proposal and examples. Most of the details of the proposal are best summed up in the Google Doc which also has permissions set to comments for anyone to add suggestions and comments.
The basic summary is to use ancillary variables to contain the uncertainty values with flexibility in how to represent the uncertainty values from scalars, to vectors, to external files, to formula to allow users to calculate uncertainty values.
The text was updated successfully, but these errors were encountered: