feat: make `Table.cache()` a no-op for tables that are already concrete in a backend #6195

jcrist · 2023-05-11T21:26:20Z

Currently Table.cache() will result in a new copy of the data being stored in the backend, even if the data is already a "concrete" table in the backend.

Ideally if a table is already concrete (backed by a physical table, not a view, in the corresponding backend) then table.cache() would be a no-op. This would better enable writing generic functions that make use of .cache without unnecessarily duplicating data.

t = con.table("my_table")  # a concrete table

_ = t.cache()  # a no-op

_ = t.mutate(foo=t.bar + 1).cache()  # not a no-op, since `t` isn't a physical table

t = con.table("some-view")  # a view

_ = t.cache()  # not a no-op, since `t` is a view

The text was updated successfully, but these errors were encountered:

cpcloud · 2024-07-24T17:52:45Z

@jcrist I can't remember, are we not already doing this?

jcrist · 2024-07-24T18:33:47Z

We are not. t.cache(); t.cache() won't cache a table twice, but it's not a no-op if a table is already a physical table.

jcrist · 2024-08-26T20:45:30Z

This is easy to do if we assume that PhysicalTable ops don't need to be cached (we might also expand this to simple column subselection on these tables like t.select("a", "b", "c") which should be cheap), while all other expressions get cached.

However, this assumption isn't true, since some backends (e.g. duckdb) return views for read_csv/read_parquet, which won't be as cheap to access for repeat queries as a temporary table produced by .cache(). Since these views are also mapped to DatabaseTable, it's not easy to determine whether a PhysicalTable is really "physical" or not. In #9931 I note that I (and some user's code like @lostmygithubaccount I've seen) have sometimes used read_csv(...).cache() to create a temp table in duckdb since by default read_csv(...) just creates a view. It'd be nice to keep that workflow working while reducing overhead for cases where it's not needed.

A few options:

Add a new IR node for views (possibly a subclass of DatabaseTable) and use those instead for those apis. This would make determining whether an op is worth caching easier.
Query inside _load_into_cache more info about a backing table to determine if it's cheap or not, and if it is take a fast path to avoid the temp-table creation.

Neither seems terribly onerous (mostly just plumbing), but the latter is much less invasive.

jcrist · 2024-08-26T20:56:22Z

I'm going to try implementing the latter (probably for a small subset of backends to start) and see how it looks. I think this should be doable without being too invasive.

jcrist added feature Features or general enhancements ddl Issues related to creating or altering data definitions labels May 11, 2023

lostmygithubaccount added this to Ibis planning and roadmap Nov 30, 2023

github-project-automation bot moved this to backlog in Ibis planning and roadmap Nov 30, 2023

jcrist mentioned this issue Aug 26, 2024

feat(io): support way to ensure read_* apis create physical tables in the backend #9931

Closed

jcrist self-assigned this Aug 26, 2024

This was referenced Aug 27, 2024

refactor: simplify cache, add support for mssql #9938

Merged

feat(api): avoid caching physical tables #9976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make `Table.cache()` a no-op for tables that are already concrete in a backend #6195

feat: make `Table.cache()` a no-op for tables that are already concrete in a backend #6195

jcrist commented May 11, 2023

cpcloud commented Jul 24, 2024

jcrist commented Jul 24, 2024

jcrist commented Aug 26, 2024

jcrist commented Aug 26, 2024

feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195

feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195

Comments

jcrist commented May 11, 2023

cpcloud commented Jul 24, 2024

jcrist commented Jul 24, 2024

jcrist commented Aug 26, 2024

jcrist commented Aug 26, 2024

feat: make `Table.cache()` a no-op for tables that are already concrete in a backend #6195

feat: make `Table.cache()` a no-op for tables that are already concrete in a backend #6195