Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cell_methods for covariance #269

Open
achho opened this issue Jan 2, 2024 · 7 comments
Open

cell_methods for covariance #269

achho opened this issue Jan 2, 2024 · 7 comments
Labels
question Further information is requested or discussion invited

Comments

@achho
Copy link

achho commented Jan 2, 2024

I have a covariance variable (e.g. covariance between temperature and a wind component). I would like to use time_bounds the cell_methods attribute to clarify how the data was processed. I think I should use cell_methods like "time: covariance (interval: 0.1 s)". However, "covariance" is not part of the supported methods in Appendix E: Cell Methods. How should I deal with that issue? Thanks!

@achho achho added the question Further information is requested or discussion invited label Jan 2, 2024
@taylor13
Copy link

taylor13 commented Jan 2, 2024

We might have considered this issue some time ago, but I can't find the discussion. In any case, perhaps we should also think about how to handle "correlations". Also, do we need a new standard_name? Note that currently there is one standard name that includes the word covariance: covariance_over_longitude_of_northward_wind_and_air_temperature .

@JonathanGregory
Copy link
Contributor

I'm sure we must have discussed this before in the last >20 years! Cell methods is designed to record statistical operations that reduce the number of dimensions of the data (e.g. a zonal mean, which removes longitude) or its spatiotemporal resolution (e.g. daily maxima calculated from hourly data). These operations involve only one quantity.

Correlation and covariance involve combining two quantities. In that they are statistical operations, they seem similar to cell methods, but cell methods isn't a convention for describing arbitrary combination of quantities. Therefore I think the covariance of two quantities should be given a standard name. Indeed, both covariance and correlation are foreseen in the guidelines for designing new standard names. Other quantities are also mathematical combinations of quantities, such as product and difference, and they likewise have new standard names.

However, I agree that you should be able to record the original interval of the data in time (before the covariance was calculated) using cell methods. Once it has been calculated, it's a property of the entire cell in time, for instance if the cells are intervals of 10 s and the original data was at 0.1 s intervals. Because it represents the whole cell, and not just an instant within the cell, it can't be point, but sum isn't right either, because it's not extensive. The covariance won't generally get larger if the cells cover longer intervals. Maybe we could define a new cell method of cell to indicate a quantity which is an intensive property of the whole cell. Would that make sense for a covariance?

@ChrisBarker-NOAA
Copy link

Maybe we could define a new cell method of cell to indicate a quantity which is an intensive property of the whole cell

That really doesn't exist? It would think that we would have found a need for that before now :-)

Anyway, maybe we could repurpose mean -- it's mathematically the same: if a value is constant over a cell then the mean over the cell would be the same value.

But does that make sense for covariance? Maybe I'm blinded by my intuitive sense of what a cell is, but I don' t think this has much to do with cells.

For what I think is a simpler, but analogous example, how would one encode a moving average of 1-D variable in time?

you may have hourly data, and the moving average is still hourly -- would you define cell bounds to define the window of the moving average? Then mean as the cell method? if so, then perhaps we would need a covariance cell method.

@JonathanGregory
Copy link
Contributor

Dear @ChrisBarker-NOAA et al.

By "cell" I mean a 1D interval between a pair of coordinate bounds. That's the meaning of "cell" in cell_methods. A cell method can be applied to several dimensions at once, in which case "cell" means the n-dimensional space defined by the coordinate bounds in all the relevant dimensions at once.

In my last posting I was thinking it would be useful to have a cell method which did not imply anything about how the value for the cell is obtained, just that it represents the whole cell, rather than a subset of it or a number of points within it. mean is a particular way of obtaining a value that is representative of a cell i.e. integrating over the whole cell, and dividing by the extent of the cell (the difference between the bounds). You are right that if the value is constant within the cell the mean gives you that value, but equally so do maximum, minimum, median and various other cell methods apart from mean, so I don't think we should pick on mean as special.

Now I realise that we do want to imply that the quantity is intensive as well as applying to the whole cell, whatever the precise method. A covariance is intensive, meaning it wouldn't necessarily get bigger if evaluated in larger cells. Some day we might want a cell method that indicates an extensive quantity applying to the whole cell. That would suggest the new method should be intensive, rather than cell.

But maybe this is too opaque and general and we should define covariance as a cell method instead, as @achho suggested, but not as the entire description of the quantity. We would still need the standard name saying what it is a covariance of, since we can't record that in the cell_methods. As a cell method, covariance would be defined as something like "covariance within the cell of two other quantities that are intensive with respect to the specified dimension".

A moving average is a mean with overlapping cells.

Best wishes

Jonathan

@achho
Copy link
Author

achho commented Jan 3, 2024

Thanks for considering this! I agree, there seems to be a fundamental difference between existing cell methods which are aggregations of the variable itself and covariance and correlation which are calculations of different variables. So I am not entirely sure whether cell_methods is actually the right place to put the information I would like to specify (sampling interval of the original variables). The term cell_method seems to imply that the method is performed on a cell in the field of the variable itself. But the way I understand the CF standard, I have to put something in the cell_methods attribute when I use time_bounds?

As for the variables for which standard names might be needed: I am working with variables from Eddy-Covariance measurements, where the covariance between any pair of wind components u, v, w and scalars like acoustic temperature, H2O, CO2 and other gas concentrations are calculated. The measurements were part of the urban climate project "Urban Climate Under Change" in which we put some work in standardizing variable names across the institutions involved in the project. The resulting table can be found here (csv) and here (pdf). We sticked to CF standard names where they existed. The covariance variables are called kinematic fluxes (e.g., eastward kinematic sensible heat flux in air). If you deem it beneficial, I am happy to contribute to a discussion on standard names to be added. Maybe under a separate issue?

@ChrisBarker-NOAA
Copy link

ChrisBarker-NOAA commented Jan 3, 2024

By "cell" I mean a 1D interval between a pair of coordinate bounds. That's the meaning of "cell" in cell_methods.

got it (I think) but I dont get how a computed covariance would be a cell ...

In my last posting I was thinking it would be useful to have a cell method which did not imply anything about how the value for the cell is obtained, just that it represents the whole cell, rather than a subset of it or a number of points within it.

I think that is still a good idea, even if not for this use case -- but would simply not supplying a cell method accomplish the same thing?

Now I realise that we do want to imply that the quantity is intensive as well as applying to the whole cell, whatever the precise method. A covariance is intensive, meaning it wouldn't necessarily get bigger if evaluated in larger cells --

but it would change -- is that any different?

A moving average is a mean with overlapping cells.

So how does that get represented? I guess I was hoping that it could give us a hint as to what to do with covariance. Or better yat, a weighted moving average, which can no longer be defined as a mran over a cell.

I guess where I'm heading is that there are any number of ways that a derived quantity can be computed from a range (cell?) of other quantities -- trying to capture them all as a cell method seems like a tricky idea ...

@JonathanGregory
Copy link
Contributor

I agree, there seems to be a fundamental difference between existing cell methods which are aggregations of the variable itself and covariance and correlation which are calculations of different variables. So I am not entirely sure whether cell_methods is actually the right place to put the information I would like to specify (sampling interval of the original variables). The term cell_method seems to imply that the method is performed on a cell in the field of the variable itself. But the way I understand the CF standard, I have to put something in the cell_methods attribute when I use time_bounds?

You don't need cell_methods for time bounds, but you do need to choose a method if you want to put an entry in cell_methods like you first suggested: "time: METHOD (interval: 0.1 s)". METHOD can't be omitted in this syntax.

We agree that it's not clear whether it's appropriate to use cell_methods for your purpose. That's what we have to decide. It wouldn't break anything, and it could be informative, but does it make sense? At the moment, I think it would be OK, but I'm not sure.

For example, it would be fine to put "time: sum (interval: 0.1 s)" for a quantity which was originally accumulated over 0.1 s intervals and then added up over longer intervals. If in fact it was a rate that was measured at 0.1 s intervals, and an integral had been calculated over longer intervals with some interpolation between the measurements, we'd still call it "time: sum (interval: 0.1 s)". This cell method indicates that the quantity should be interpreted as a sum over the cells, and was derived from data at 0.1 s intervals. To me it doesn't seem much of a stretch to put "time: covariance (interval: 0.1 s)" for a covariance calculated from other quantities which were measured at 0.1 s intervals. What do you and others think?

As for the variables for which standard names might be needed: I am working with variables from Eddy-Covariance measurements, where the covariance between any pair of wind components u, v, w and scalars like acoustic temperature, H2O, CO2 and other gas concentrations are calculated. ... The covariance variables are called kinematic fluxes (e.g., eastward kinematic sensible heat flux in air). If you deem it beneficial, I am happy to contribute to a discussion on standard names to be added. Maybe under a separate issue?

Yes, please start another issue about that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested or discussion invited
Projects
None yet
Development

No branches or pull requests

4 participants