-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
68 Polars based search #69
base: main
Are you sure you want to change the base?
Conversation
… understand the numpy & pandas version
…search_coluns_with_iterables[query2-True-expected2] is incorrect
…search_columns_with_iterables[query2-True-expected2] is incorrect. Failing on test_search, only working on test_search_columns_with_iterables
- Restored 'is_pattern' & related tests - Updated search to recurse on pattern searches with collections of search terms
…to make sense & are failing
tests/test_core.py
Outdated
@@ -258,14 +258,12 @@ def test_catalog_keys(catalog_path): | |||
[ | |||
({"realm": "ocean"}, False, 1), | |||
({"realm": ["atmos", "ocnBgchem"]}, False, 3), | |||
({"realm": ["atmos", "ocnBgchem"]}, True, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realm is not defined to be a column with iterables here, and so the require_all=True
argument doesn't really make sense.
Additionally, in the test dataset provided, only one realm is provided per dataset. If we were to interpret the require_all=True
as meaning that we needed to match both 'atmos' and 'ocnBgchem', we wouldn't expect any results
tests/test_core.py
Outdated
({"realm": "atmos"}, False, 3), | ||
({"realm": "atmos", "variable": "tas"}, False, 1), | ||
({"realm": "atmos", "variable": ["tas"]}, False, 1), | ||
({"variable": ["NO2", "tas", "fgco2"]}, False, 3), | ||
({"variable": ["NO2", "tas", "fgco2"]}, True, 0), | ||
({"name": ["cesm", "cmip5"]}, False, 2), | ||
({"name": ["cesm", "cmip5"]}, True, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above comment regarding realm
@@ -176,7 +177,7 @@ def test_is_pattern(value, expected): | |||
], | |||
), | |||
( | |||
{"A": [re.compile("^a.*a$", flags=re.IGNORECASE)]}, | |||
{"A": ["(?i)^a.*a$"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars string matching has first class support for regexes. Setting case insensitive search in polars is as straightforward as adding a (?i)
group.
There doesn't seem to be any documentation indicating that we expose compiled regex's to the user as a way to interact with the catalog, so I think it's safe to drop the use of re.compile
here.
@@ -333,7 +333,7 @@ def test_search(query, expected): | |||
}, | |||
{ | |||
"A": "cat1", | |||
"B": ["a", "c"], | |||
"B": ["c", "a"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When searching over list columns in polars, the ordering of the initial list column is preserved.
Whether we would prefer to preserve the initial list order or sort in order of the query as the previous implementation did is an open question - I can see arguments for both.
I've altered the test as I don't think it's important & it reduces the amount of data processing necessary.
@@ -353,7 +353,7 @@ def test_search(query, expected): | |||
}, | |||
{ | |||
"A": "cat1", | |||
"B": ["a", "c"], | |||
"B": ["c", "a"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above regarding list ordering
tests/test_search.py
Outdated
@@ -258,7 +259,6 @@ def test_search(query, expected): | |||
[ | |||
{"A": "cat0", "B": ["a", "b"], "C": ("cx", "cy"), "D": {0}, "E": "xxx"}, | |||
{"A": "cat1", "B": ["a", "b"], "C": ("cx", "cz"), "D": {0}, "E": "xxx"}, | |||
{"A": "cat1", "B": ["a"], "C": ("cz", "cy"), "D": {0}, "E": "yyy"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this test was erroneous: see #68
I've run a few performance tests - it looks like the polars based implementation is similarly performant on simple(ish) searches & quite a bit faster on more complicated searches: see below import intake
cat = intake.cat.access_nri
%%timeit
cat.search(variable='(?i).*temp.*')
# Using polars based implementation
# 30.4 ms ± 613 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Using pandas based implementation
# 37 ms ± 934 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*')
# Using polars based implementation
# 87.3 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Using pandas based implementation
# 77 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*')
# Using polars based implementation
# 19.6 ms ± 163 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Using pandas based implementation
# 108 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*', realm='(?i)seaice')
# Using polars based implementation
# 12.3 ms ± 199 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Using pandas based implementation
# 112 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Closes #68.
This PR is a bit of a monster & addresses the difficulty I had understanding the
search._search
function. It's hard to say for certain, but I think the complexity of this function lead to some erroneous/inconsistent tests.I have:
The change from performing catalog searches using polars over pandas should improve performance pretty substantially. I haven't tested this explicitly yet (the tests only run on small catalogues), but polars is in general much more performant, in particular when using lazy evaluation as I've done here.
NB: Although catalog search has been changed to use polars in this PR, the user will never see a polars dataframe, only a pandas dataframe: conversion back and forth happens internally within the search functionality.