Speed up augur filter by replacing Pandas #1574

victorlin · 2024-08-09T00:26:27Z

Context

See parent issue for context on how Pandas is used in augur filter and why it is slow.

The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:

Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be closer to a library swap with less code change than a full rewrite.
Database file approach such as SQLite. This would require more of a rewrite and needs extensive testing. Note that at least some form of Pandas may still be necessary to continue supporting the --query option (which allows Pandas-based queries and is widely used).
- I had explored DuckDB (code) but decided against it due to issues with certain characters in metadata. This seems to be resolved now so it may be worth revisiting.

Progress

The text was updated successfully, but these errors were encountered:

j23414 · 2024-08-10T14:44:29Z

question
Are we exploring panda-alternatives like polars? I guess polars would be part of the first category.

victorlin added the enhancement New feature or request label Aug 9, 2024

This was referenced Aug 9, 2024

Speed up augur filter #1575

Open

Speed up augur filter without replacing Pandas #1573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up augur filter by replacing Pandas #1574

Speed up augur filter by replacing Pandas #1574

victorlin commented Aug 9, 2024 •

edited

Loading

j23414 commented Aug 10, 2024 •

edited

Loading

Speed up augur filter by replacing Pandas #1574

Speed up augur filter by replacing Pandas #1574

Comments

victorlin commented Aug 9, 2024 • edited Loading

Context

Progress

j23414 commented Aug 10, 2024 • edited Loading

victorlin commented Aug 9, 2024 •

edited

Loading

j23414 commented Aug 10, 2024 •

edited

Loading