Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

68 Polars based search #69

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

68 Polars based search #69

wants to merge 18 commits into from

Conversation

charles-turner-1
Copy link
Collaborator

Closes #68.

This PR is a bit of a monster & addresses the difficulty I had understanding the search._search function. It's hard to say for certain, but I think the complexity of this function lead to some erroneous/inconsistent tests.

I have:

  • Refactored the search function to use polars instead of pandas & numpy in order to (hopefully) simplify the logic.
  • Removed & altered some tests which appear to be inconsistent with other tests & documentation.
  • Added type hints everywhere, fixed inconsistent typing (mypy reports no errors).

The change from performing catalog searches using polars over pandas should improve performance pretty substantially. I haven't tested this explicitly yet (the tests only run on small catalogues), but polars is in general much more performant, in particular when using lazy evaluation as I've done here.

NB: Although catalog search has been changed to use polars in this PR, the user will never see a polars dataframe, only a pandas dataframe: conversion back and forth happens internally within the search functionality.

@@ -258,14 +258,12 @@ def test_catalog_keys(catalog_path):
[
({"realm": "ocean"}, False, 1),
({"realm": ["atmos", "ocnBgchem"]}, False, 3),
({"realm": ["atmos", "ocnBgchem"]}, True, 1),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realm is not defined to be a column with iterables here, and so the require_all=True argument doesn't really make sense.

Additionally, in the test dataset provided, only one realm is provided per dataset. If we were to interpret the require_all=True as meaning that we needed to match both 'atmos' and 'ocnBgchem', we wouldn't expect any results

({"realm": "atmos"}, False, 3),
({"realm": "atmos", "variable": "tas"}, False, 1),
({"realm": "atmos", "variable": ["tas"]}, False, 1),
({"variable": ["NO2", "tas", "fgco2"]}, False, 3),
({"variable": ["NO2", "tas", "fgco2"]}, True, 0),
({"name": ["cesm", "cmip5"]}, False, 2),
({"name": ["cesm", "cmip5"]}, True, 0),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above comment regarding realm

@@ -176,7 +177,7 @@ def test_is_pattern(value, expected):
],
),
(
{"A": [re.compile("^a.*a$", flags=re.IGNORECASE)]},
{"A": ["(?i)^a.*a$"]},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars string matching has first class support for regexes. Setting case insensitive search in polars is as straightforward as adding a (?i) group.

There doesn't seem to be any documentation indicating that we expose compiled regex's to the user as a way to interact with the catalog, so I think it's safe to drop the use of re.compile here.

@@ -333,7 +333,7 @@ def test_search(query, expected):
},
{
"A": "cat1",
"B": ["a", "c"],
"B": ["c", "a"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When searching over list columns in polars, the ordering of the initial list column is preserved.

Whether we would prefer to preserve the initial list order or sort in order of the query as the previous implementation did is an open question - I can see arguments for both.

I've altered the test as I don't think it's important & it reduces the amount of data processing necessary.

@@ -353,7 +353,7 @@ def test_search(query, expected):
},
{
"A": "cat1",
"B": ["a", "c"],
"B": ["c", "a"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above regarding list ordering

@@ -258,7 +259,6 @@ def test_search(query, expected):
[
{"A": "cat0", "B": ["a", "b"], "C": ("cx", "cy"), "D": {0}, "E": "xxx"},
{"A": "cat1", "B": ["a", "b"], "C": ("cx", "cz"), "D": {0}, "E": "xxx"},
{"A": "cat1", "B": ["a"], "C": ("cz", "cy"), "D": {0}, "E": "yyy"},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test was erroneous: see #68

@charles-turner-1 charles-turner-1 marked this pull request as draft November 6, 2024 05:32
@charles-turner-1 charles-turner-1 marked this pull request as ready for review November 11, 2024 07:56
@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Nov 13, 2024

I've run a few performance tests - it looks like the polars based implementation is similarly performant on simple(ish) searches & quite a bit faster on more complicated searches: see below

import intake
cat = intake.cat.access_nri

%%timeit
cat.search(variable='(?i).*temp.*')

# Using polars based implementation
# 30.4 ms ± 613 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Using pandas based implementation
# 37 ms ± 934 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*')

# Using polars based implementation
# 87.3 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Using pandas based implementation
# 77 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*')

# Using polars based implementation
# 19.6 ms ± 163 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Using pandas based implementation
# 108 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*', realm='(?i)seaice')

# Using polars based implementation
# 12.3 ms ± 199 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Using pandas based implementation
# 112 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential bug in tests/test_search.py/test_search_columns_with_iterables
1 participant