You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See parent issue for context on how Pandas is used in augur filter and why it is slow.
The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:
Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be closer to a library swap with less code change than a full rewrite.
Database file approach such as SQLite. This would require more of a rewrite and needs extensive testing. Note that at least some form of Pandas may still be necessary to continue supporting the --query option (which allows Pandas-based queries and is widely used).
I had explored DuckDB (code) but decided against it due to issues with certain characters in metadata. This seems to be resolved now so it may be worth revisiting.
Context
See parent issue for context on how Pandas is used in augur filter and why it is slow.
The alternative way of working with large datasets is to load/keep it on disk. There is a spectrum of alternatives which can be divided into two categories:
--query
option (which allows Pandas-based queries and is widely used).Progress
The text was updated successfully, but these errors were encountered: