Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up augur filter by replacing Pandas #1574

Open
2 tasks
victorlin opened this issue Aug 9, 2024 · 1 comment
Open
2 tasks

Speed up augur filter by replacing Pandas #1574

victorlin opened this issue Aug 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented Aug 9, 2024

Context

See parent issue for context on how Pandas is used in augur filter and why it is slow.

The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:

  1. Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be closer to a library swap with less code change than a full rewrite.
  2. Database file approach such as SQLite. This would require more of a rewrite and needs extensive testing. Note that at least some form of Pandas may still be necessary to continue supporting the --query option (which allows Pandas-based queries and is widely used).
    • I had explored DuckDB (code) but decided against it due to issues with certain characters in metadata. This seems to be resolved now so it may be worth revisiting.

Progress

@victorlin victorlin added the enhancement New feature or request label Aug 9, 2024
@j23414
Copy link
Contributor

j23414 commented Aug 10, 2024

question
Are we exploring panda-alternatives like polars? I guess polars would be part of the first category.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants