Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching the catalog for a dataset #203

Open
anton-seaice opened this issue Sep 30, 2024 · 9 comments
Open

Searching the catalog for a dataset #203

anton-seaice opened this issue Sep 30, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@anton-seaice
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Currently, to retrieve a dataset you need to do two search operations - one to find an intake-esm datastore, and then search the datastore for a variable of interest. This is confusing, as the catalog can be searched for variables, however the resulting datastore contains all variables, instead of only the variables searched for:

i.e.

image

Describe the feature you'd like

I would like to be able to get from a catalog search directly to a dataset

e.g.:

cat.search(name='025deg_jra55_ryf9091_gadi', variable='aice_m').to_dask()

Returns an xarray dataset

I would like for catalog searches to return a datastore search if possible:

e.g.

cat.search(name='025deg_jra55_ryf9091_gadi', variable='aice_m').to_source()

Returns the same result as :

cat['025deg_jra55_ryf9091_gadi'].search( variable='aice_m').to_dask()

Describe alternatives you've considered

No change - train users in the current implementation

Additional context

I haven't thought about how this might apply to the CMIP6 datastores - which are formatted / handled a bit differently

@anton-seaice anton-seaice added the enhancement New feature or request label Sep 30, 2024
@rbeucher
Copy link
Member

I agree with you, this is confusing. We should discuss that in our next meeting.

@marc-white
Copy link
Collaborator

We'll need to check the current state of the catalog vis-a-vis completeness of the variable column - I think there are some experiments that won't have that defined.

@anton-seaice
Copy link
Collaborator Author

I think if it's not defined, its somewhat ok. The issue here is that you search for a variable and return a datastore, but the datastore isn't filtered by the variable you've searched for. If you can't refine by variable when searching for a datastore, it doesn't matter.

The some logic applies to the other columns too btw (principally frequency, although also short_name / long_name).

@dougiesquire
Copy link
Collaborator

I'm not sure whether you're talking about the issue described in the documentation I linked above, or the fact that Intake-ESM only reduces returned datasets by queries on the variable_column_name column (i.e. variable for most of our datastores)

@anton-seaice
Copy link
Collaborator Author

Ah the first one, although I would like the second one to change too :-)

Can/could/should pass_query=True be the default ?

@dougiesquire
Copy link
Collaborator

dougiesquire commented Oct 1, 2024

Can/could/should pass_query=True be the default ?

It's pretty clunky because it only works nicely when the columns in the catalog and datastore are the same, which isnt guaranteed. Otherwise it throws warnings.

@anton-seaice
Copy link
Collaborator Author

We might be able to tidy that up for "our" datasets, this is useful:

Screenshot 2024-10-01 at 11 23 02 AM

It's probably harder for others datastores, can't pass model name to the NCI CMIP6 datastore for example ...

Screenshot 2024-10-01 at 11 23 59 AM

@charles-turner-1
Copy link
Collaborator

I think this relates to search functionality in the intake-dataframe-catalogue.

I'm working on trying to better understand the search functionality there - just commenting so I can come back and find this more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

5 participants