68 Polars based search #69

charles-turner-1 · 2024-11-06T04:42:04Z

Closes #68.

This PR is a bit of a monster & addresses the difficulty I had understanding the search._search function. It's hard to say for certain, but I think the complexity of this function lead to some erroneous/inconsistent tests.

I have:

Refactored the search function to use polars instead of pandas & numpy in order to (hopefully) simplify the logic.
Removed & altered some tests which appear to be inconsistent with other tests & documentation.
Added type hints everywhere, fixed inconsistent typing (mypy reports no errors).

The change from performing catalog searches using polars over pandas should improve performance pretty substantially. I haven't tested this explicitly yet (the tests only run on small catalogues), but polars is in general much more performant, in particular when using lazy evaluation as I've done here.

NB: Although catalog search has been changed to use polars in this PR, the user will never see a polars dataframe, only a pandas dataframe: conversion back and forth happens internally within the search functionality.

… understand the numpy & pandas version

… working

… out yet)

…search_coluns_with_iterables[query2-True-expected2] is incorrect

…search_columns_with_iterables[query2-True-expected2] is incorrect. Failing on test_search, only working on test_search_columns_with_iterables

- Restored 'is_pattern' & related tests - Updated search to recurse on pattern searches with collections of search terms

…to make sense & are failing

charles-turner-1 · 2024-11-06T04:48:48Z

tests/test_core.py

@@ -258,14 +258,12 @@ def test_catalog_keys(catalog_path):
    [
        ({"realm": "ocean"}, False, 1),
        ({"realm": ["atmos", "ocnBgchem"]}, False, 3),
-        ({"realm": ["atmos", "ocnBgchem"]}, True, 1),


Realm is not defined to be a column with iterables here, and so the require_all=True argument doesn't really make sense.

Additionally, in the test dataset provided, only one realm is provided per dataset. If we were to interpret the require_all=True as meaning that we needed to match both 'atmos' and 'ocnBgchem', we wouldn't expect any results

charles-turner-1 · 2024-11-06T04:49:25Z

tests/test_core.py

        ({"realm": "atmos"}, False, 3),
        ({"realm": "atmos", "variable": "tas"}, False, 1),
        ({"realm": "atmos", "variable": ["tas"]}, False, 1),
        ({"variable": ["NO2", "tas", "fgco2"]}, False, 3),
        ({"variable": ["NO2", "tas", "fgco2"]}, True, 0),
        ({"name": ["cesm", "cmip5"]}, False, 2),
-        ({"name": ["cesm", "cmip5"]}, True, 0),


As above comment regarding realm

charles-turner-1 · 2024-11-06T04:52:08Z

tests/test_search.py

@@ -176,7 +177,7 @@ def test_is_pattern(value, expected):
            ],
        ),
        (
-            {"A": [re.compile("^a.*a$", flags=re.IGNORECASE)]},
+            {"A": ["(?i)^a.*a$"]},


Polars string matching has first class support for regexes. Setting case insensitive search in polars is as straightforward as adding a (?i) group.

There doesn't seem to be any documentation indicating that we expose compiled regex's to the user as a way to interact with the catalog, so I think it's safe to drop the use of re.compile here.

charles-turner-1 · 2024-11-06T04:55:30Z

tests/test_search.py

@@ -333,7 +333,7 @@ def test_search(query, expected):
                },
                {
                    "A": "cat1",
-                    "B": ["a", "c"],
+                    "B": ["c", "a"],


When searching over list columns in polars, the ordering of the initial list column is preserved.

Whether we would prefer to preserve the initial list order or sort in order of the query as the previous implementation did is an open question - I can see arguments for both.

I've altered the test as I don't think it's important & it reduces the amount of data processing necessary.

charles-turner-1 · 2024-11-06T04:55:56Z

tests/test_search.py

@@ -353,7 +353,7 @@ def test_search(query, expected):
                },
                {
                    "A": "cat1",
-                    "B": ["a", "c"],
+                    "B": ["c", "a"],


As above regarding list ordering

charles-turner-1 · 2024-11-06T04:56:34Z

tests/test_search.py

@@ -258,7 +259,6 @@ def test_search(query, expected):
            [
                {"A": "cat0", "B": ["a", "b"], "C": ("cx", "cy"), "D": {0}, "E": "xxx"},
                {"A": "cat1", "B": ["a", "b"], "C": ("cx", "cz"), "D": {0}, "E": "xxx"},
-                {"A": "cat1", "B": ["a"], "C": ("cz", "cy"), "D": {0}, "E": "yyy"},


I believe this test was erroneous: see #68

…in iterables

…factoring

charles-turner-1 · 2024-11-13T00:50:11Z

I've run a few performance tests - it looks like the polars based implementation is similarly performant on simple(ish) searches & quite a bit faster on more complicated searches: see below

import intake
cat = intake.cat.access_nri

%%timeit
cat.search(variable='(?i).*temp.*')

# Using polars based implementation
# 30.4 ms ± 613 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Using pandas based implementation
# 37 ms ± 934 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*')

# Using polars based implementation
# 87.3 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Using pandas based implementation
# 77 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*')

# Using polars based implementation
# 19.6 ms ± 163 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Using pandas based implementation
# 108 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
cat.search(description='(?i).*access-om2.*|.*jra.*', variable='(?i).*global.*', realm='(?i)seaice')

# Using polars based implementation
# 12.3 ms ± 199 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Using pandas based implementation
# 112 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

charles-turner-1 added 15 commits October 5, 2024 11:36

Added typing & replacing search function with polars version (I don't…

97bcf0f

… understand the numpy & pandas version

Cleaned up polars search - most nesting gone. Still need to get regex…

4cce3e5

… working

Cleaned up polars search & nearly got regex working

ffa8483

Polars search works if we don't have regex cols (havent figured those…

b94877c

… out yet)

Polars search works so long as search terms are all ordered now

8816b7a

Polars search working (I think) - I think tests/test_search.py::test_…

081608d

…search_coluns_with_iterables[query2-True-expected2] is incorrect

Polars search working (I think) - I think tests/test_search.py::test_…

4caf556

…search_columns_with_iterables[query2-True-expected2] is incorrect. Failing on test_search, only working on test_search_columns_with_iterables

All tests but test2 on test_search_columns_with_iterables working

59737d1

Updaetd test to reflect likely error

90d53c7

Added polars dependency

2c62fb2

g This is a combination of 2 commits.

8b322cc

- Restored 'is_pattern' & related tests - Updated search to recurse on pattern searches with collections of search terms

Removed a couple of test_catalog_search instances which don't appear …

ee0ca57

…to make sense & are failing

Fixed type hint in _is_pattern

2553a02

Lazily evaluate query in polars

444e13f

Fixed python>=3.10 type hint syntax

fb9c0e1

charles-turner-1 linked an issue Nov 6, 2024 that may be closed by this pull request

Potential bug in tests/test_search.py/test_search_columns_with_iterables #68

Open

charles-turner-1 commented Nov 6, 2024

View reviewed changes

charles-turner-1 marked this pull request as draft November 6, 2024 05:32

Working for all tests except require_all on columns which don't conta…

7564c71

…in iterables

charles-turner-1 mentioned this pull request Nov 7, 2024

Potential bug in tests/test_search.py/test_search_columns_with_iterables #68

Open

charles-turner-1 added 2 commits November 11, 2024 13:44

All tests should be passing now - search function tractable, needs re…

00d1551

…factoring

_search.search significantly cleaner

e70b473

charles-turner-1 marked this pull request as ready for review November 11, 2024 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

68 Polars based search #69

68 Polars based search #69

charles-turner-1 commented Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 Nov 6, 2024

charles-turner-1 commented Nov 13, 2024 •

edited

Loading

68 Polars based search #69

Are you sure you want to change the base?

68 Polars based search #69

Conversation

charles-turner-1 commented Nov 6, 2024

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 Nov 6, 2024

Choose a reason for hiding this comment

charles-turner-1 commented Nov 13, 2024 • edited Loading

charles-turner-1 commented Nov 13, 2024 •

edited

Loading