Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195

Open
jcrist opened this issue May 11, 2023 · 4 comments · May be fixed by #9976
Open

feat: make Table.cache() a no-op for tables that are already concrete in a backend #6195

jcrist opened this issue May 11, 2023 · 4 comments · May be fixed by #9976
Assignees
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements

Comments

@jcrist
Copy link
Member

jcrist commented May 11, 2023

Currently Table.cache() will result in a new copy of the data being stored in the backend, even if the data is already a "concrete" table in the backend.

Ideally if a table is already concrete (backed by a physical table, not a view, in the corresponding backend) then table.cache() would be a no-op. This would better enable writing generic functions that make use of .cache without unnecessarily duplicating data.

t = con.table("my_table")  # a concrete table

_ = t.cache()  # a no-op

_ = t.mutate(foo=t.bar + 1).cache()  # not a no-op, since `t` isn't a physical table

t = con.table("some-view")  # a view

_ = t.cache()  # not a no-op, since `t` is a view
@jcrist jcrist added feature Features or general enhancements ddl Issues related to creating or altering data definitions labels May 11, 2023
@cpcloud
Copy link
Member

cpcloud commented Jul 24, 2024

@jcrist I can't remember, are we not already doing this?

@jcrist
Copy link
Member Author

jcrist commented Jul 24, 2024

We are not. t.cache(); t.cache() won't cache a table twice, but it's not a no-op if a table is already a physical table.

@jcrist
Copy link
Member Author

jcrist commented Aug 26, 2024

This is easy to do if we assume that PhysicalTable ops don't need to be cached (we might also expand this to simple column subselection on these tables like t.select("a", "b", "c") which should be cheap), while all other expressions get cached.

However, this assumption isn't true, since some backends (e.g. duckdb) return views for read_csv/read_parquet, which won't be as cheap to access for repeat queries as a temporary table produced by .cache(). Since these views are also mapped to DatabaseTable, it's not easy to determine whether a PhysicalTable is really "physical" or not. In #9931 I note that I (and some user's code like @lostmygithubaccount I've seen) have sometimes used read_csv(...).cache() to create a temp table in duckdb since by default read_csv(...) just creates a view. It'd be nice to keep that workflow working while reducing overhead for cases where it's not needed.

A few options:

  • Add a new IR node for views (possibly a subclass of DatabaseTable) and use those instead for those apis. This would make determining whether an op is worth caching easier.
  • Query inside _load_into_cache more info about a backing table to determine if it's cheap or not, and if it is take a fast path to avoid the temp-table creation.

Neither seems terribly onerous (mostly just plumbing), but the latter is much less invasive.

@jcrist
Copy link
Member Author

jcrist commented Aug 26, 2024

I'm going to try implementing the latter (probably for a small subset of backends to start) and see how it looks. I think this should be doable without being too invasive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

2 participants