-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Rewrite using SQLite3 #1242
base: master
Are you sure you want to change the base?
Conversation
799fe02
to
27afe62
Compare
27afe62
to
f9d5c56
Compare
f9d5c56
to
5a9e0cc
Compare
5a9e0cc
to
28754e3
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1242 +/- ##
==========================================
+ Coverage 67.67% 70.26% +2.59%
==========================================
Files 69 71 +2
Lines 7518 7440 -78
Branches 1844 1775 -69
==========================================
+ Hits 5088 5228 +140
+ Misses 2158 1895 -263
- Partials 272 317 +45 ☔ View full report in Codecov by Sentry. |
64f9325
to
ee1a62b
Compare
augur/filter/_run.py
Outdated
|
||
if args.output_strains: | ||
output_strains.close() | ||
import_metadata(args.metadata, args.metadata_id_columns, args.metadata_delimiters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One significant change in this rewrite is that the metadata is no longer processed in chunks. This has implementation benefits, but comes with the downside of substantially increased disk usage.
Previously, there was the issue of augur filter
using too much memory, which prompted the current chunked implementation. Now, I'm worried about augur filter
using too much disk space, especially for large datasets such as ncov/open/metadata.tsv.zst, which results in a 12GB database file based on a trial run just now. This should be addressed before merging.
Two ideas to address this:
- Make improvements to minimize the amount of size of the database file. The easy win here is to read a subset of the metadata columns, since most usages of
augur filter
only use a few columns for computation, while the remainder is only used for final output. I started on this in bc2a030. The difficult part will be determining which columns are used by--query-pandas
/--query-sqlite
. - Warn users if they don't have enough disk space. This requires an estimation of how large the database file will be, which isn't trivial. I started on this in b411c58.
I think (1) is worthwhile, and might remove the need for (2) if disk usage is reduced significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally concur with you here. Some ideas for implementing the disk usage reduction mitigations of (1)…
For determining which columns are used by --query-sqlite
, two possibilities come to mind:
- Parse the SQL query into tokens/an AST and inspect that. Two examples of where I did this previously using the
sqlparse
Python package: checking usage of transactions and checking usage ofSET search_path
statements. While both split a SQL script into a list of statements, they take two tacks from there: the first walks the tokens in each statement, the second matches a regex against each statement. We'd probably want the first tack here. - Use SQLite's introspection capabilities to ask what the query will use/do (e.g. with
EXPLAIN
or other methods). I've done this manually before in other database systems (with structuredEXPLAIN
output), but not programmatically in SQLite. It may not be possible and/or feasible, but seems worth looking into esp. if (1) isn't workable for some reason.
For determining what's used by --query-pandas
, I'd think to either:
- Leverage Pandas' internals here (if reasonably possible), as I believe you've prototyped elsewhere…
- Implement a minimal parser for the Pandas query grammar, e.g. using pyparsing. This doesn't have to support the whole grammar, just what's sufficient to extract column names. In a similar vein, I took this approach for Markdown modification in Nextstrain CLI.
- Rig up a
pandas.DataFrame
with monkeypatching, subclassing, or a proxy object that monitors and inspects column access and then apply the query to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsibley thanks for the suggestions! As you mentioned elsewhere, Python's ast
library seems to work well for pandas queries. I've successfully implemented (1) loading a subset of columns in #1294. If that gets merged, I'll rework this PR around those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ee1a62b
to
60dadc9
Compare
8d4b9e2
to
2c13fcf
Compare
0924b31
to
1d52bc5
Compare
2c13fcf
to
710c000
Compare
5d1c045
to
b56f699
Compare
710c000
to
6b4b9f1
Compare
This will be used in future commits by a new implementation of augur filter.
These will be used in future commits by a new implementation of augur filter.
To be used in future commits.
This serves multiple purposes: 1. Detect duplicates. 2. Speed up strain-based queries. 3. Provide a marker for a single primary index column (akin to a pandas DataFrame index column). Also adds DuplicateError, a new exception class to expose duplicates for custom error messages. This will be used in a future commit by a new implementation of augur filter.
This is unused but can come in handy when debugging a query.
…alue() This will be used in a future commit by a new implementation of augur filter.
… constants.py These variables are temporary to bridge the transition from using shared variables to a shared database. They will be removed in a future commit.
WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove metadata input and examples in include_exclude_rules. These are not applicable in the SQLite-based implementation.
WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove sequence index parameter in include_exclude_rules. These are not applicable in the SQLite-based implementation. It also removes the need to check for DataFrame instances in kwargs.
This is the final change that replaces the Pandas-based implementation of augur filter with a SQLite3-based implementation. Breaking changes in behavior (see changes under tests/): 1. `--subsample-seed` is still deterministic but differs from previous implementation. 2. `--include*`: Previously, both exclusion and force-inclusion would be shown in console output and `--output-log`. Now, only the force-inclusion is reported. Implementation differences with no functional changes: 1. Tabular files are loaded into tables in a temporary SQLite3 database file on disk rather than Pandas DataFrames. This generally means less memory usage and more disk usage. Tables are indexed on strain. 2. Since chunked loading of metadata was introduced to avoid high memory usage¹, that is no longer necessary and all operations are now on the entire metadata (except for `--query`/`--query-pandas`). 3. For large datasets, the SQLite3 implementation is much faster than the Pandas implementation. 4. Instead of relying on continually updated variables (e.g. `valid_strains`), new tables in the database are created at various stages in the process. The "filter reason table" is used as a source of truth for all outputs (and is conveniently a ~direct representation of `--output-log`). This also allows the function `augur.filter._run.run()` to be further broken into smaller parts. 5. Exclusion/inclusion is done using WHERE expressions. 6. For subsampling, priority queues are no longer necessary, as the the highest priority strains can be determined using a ORDER BY across all strains. 7. Date parsing has improved with caching and a a min/max approach to resolving date ranges. Note that sequence I/O remains unchanged. ¹ 87ca73c
The re-implementation of augur filter accounts for this directly in filter_by_ambiguous_date(). Remove old tests and add equivalent tests. Note that some date strings have been changed since the previous values would fail date format validation.
Now that disk space usage is uncapped (compared to previous pandas implementation using in-memory chunks), it can be useful to know how large the database file was. However, I'll have to think more about how useful this is once database files are passed in by the user. Ideas: - Mark this as an experimental feature for `augur filter`, to be changed or removed with any version. - Add it to the `augur db` interface, e.g. output of `augur db inspect`. However, it can still be useful to know the sizes of "intermediate" tables. It'd also be useful to add runtime information here. Ideas: - Print run times of each "major" function in real-time. This can probably be achieved by some sort of decorator function.
This adds a new flag to query the SQLite database natively. `--query`/`--query-pandas` will still behave as expected. All Pandas-based query functions are renamed to be Pandas-specific. To avoid breaking changes, alias `--query` to `--query-pandas`.
6b4b9f1
to
99c5a0d
Compare
Description of proposed changes
A work in progress - see commit messages
Related issue(s)
Related to #866
Testing
nextstrain/base
image from this branch: nextstrain/docker-base@5d4fad9--query
still works for existing workflows.--query-sqlite
works with SQLite query syntax.Checklist
--debug
option to show database file size breakdown