filter: Rewrite using SQLite3 #1242

victorlin · 2023-06-17T05:15:13Z

Description of proposed changes

A work in progress - see commit messages

Related issue(s)

Related to #866

Testing

Create a nextstrain/base image from this branch: nextstrain/docker-base@5d4fad9
Verify that --query still works for existing workflows.
Verify that --query-sqlite works with SQLite query syntax.
- I'm going to run a ncov rebuild using this. Will post results in a thread.

Checklist

Delete the temporary DB file (implemented)
Add a --debug option to show database file size breakdown
Address disk space usage concerns (ref)
Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

codecov · 2023-06-27T18:32:54Z

Codecov Report

Attention: 80 lines in your changes are missing coverage. Please review.

Comparison is base (e2ca468) 67.67% compared to head (ee1a62b) 70.26%.
Report is 1 commits behind head on master.

❗ Current head ee1a62b differs from pull request most recent head 6b4b9f1. Consider uploading reports for the commit 6b4b9f1 to get more accurate results

Files	Patch %	Lines
augur/io/tabular_file.py	66.66%	10 Missing and 9 partials ⚠️
augur/filter/io.py	89.02%	16 Missing and 2 partials ⚠️
augur/filter/include_exclude_rules.py	90.74%	8 Missing and 7 partials ⚠️
augur/filter/debug.py	56.25%	6 Missing and 1 partial ⚠️
augur/filter/dates.py	94.28%	6 Missing ⚠️
augur/io/sqlite3.py	93.33%	6 Missing ⚠️
augur/filter/_run.py	86.95%	1 Missing and 2 partials ⚠️
augur/dates/__init__.py	77.77%	1 Missing and 1 partial ⚠️
augur/filter/report.py	96.77%	1 Missing and 1 partial ⚠️
augur/io/metadata.py	81.81%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1242      +/-   ##
==========================================
+ Coverage   67.67%   70.26%   +2.59%     
==========================================
  Files          69       71       +2     
  Lines        7518     7440      -78     
  Branches     1844     1775      -69     
==========================================
+ Hits         5088     5228     +140     
+ Misses       2158     1895     -263     
- Partials      272      317      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2023-07-08T00:00:42Z

augur/filter/_run.py


-    if args.output_strains:
-        output_strains.close()
+        import_metadata(args.metadata, args.metadata_id_columns, args.metadata_delimiters)


One significant change in this rewrite is that the metadata is no longer processed in chunks. This has implementation benefits, but comes with the downside of substantially increased disk usage.

Previously, there was the issue of augur filter using too much memory, which prompted the current chunked implementation. Now, I'm worried about augur filter using too much disk space, especially for large datasets such as ncov/open/metadata.tsv.zst, which results in a 12GB database file based on a trial run just now. This should be addressed before merging.

Two ideas to address this:

Make improvements to minimize the amount of size of the database file. The easy win here is to read a subset of the metadata columns, since most usages of augur filter only use a few columns for computation, while the remainder is only used for final output. I started on this in bc2a030. The difficult part will be determining which columns are used by --query-pandas/--query-sqlite.

Warn users if they don't have enough disk space. This requires an estimation of how large the database file will be, which isn't trivial. I started on this in b411c58.

I think (1) is worthwhile, and might remove the need for (2) if disk usage is reduced significantly.

Generally concur with you here. Some ideas for implementing the disk usage reduction mitigations of (1)…

For determining which columns are used by --query-sqlite, two possibilities come to mind:

Parse the SQL query into tokens/an AST and inspect that. Two examples of where I did this previously using the sqlparse Python package: checking usage of transactions and checking usage of SET search_path statements. While both split a SQL script into a list of statements, they take two tacks from there: the first walks the tokens in each statement, the second matches a regex against each statement. We'd probably want the first tack here.

Use SQLite's introspection capabilities to ask what the query will use/do (e.g. with EXPLAIN or other methods). I've done this manually before in other database systems (with structured EXPLAIN output), but not programmatically in SQLite. It may not be possible and/or feasible, but seems worth looking into esp. if (1) isn't workable for some reason.

For determining what's used by --query-pandas, I'd think to either:

Leverage Pandas' internals here (if reasonably possible), as I believe you've prototyped elsewhere…

Implement a minimal parser for the Pandas query grammar, e.g. using pyparsing. This doesn't have to support the whole grammar, just what's sufficient to extract column names. In a similar vein, I took this approach for Markdown modification in Nextstrain CLI.

Rig up a pandas.DataFrame with monkeypatching, subclassing, or a proxy object that monitors and inspects column access and then apply the query to it.

@tsibley thanks for the suggestions! As you mentioned elsewhere, Python's ast library seems to work well for pandas queries. I've successfully implemented (1) loading a subset of columns in #1294. If that gets merged, I'll rework this PR around those changes.

Updates:

For --query-sqlite, I've started implementation using sqlparse in 3cf5670.

For --query-pandas using ast, that was recently added in 0c5a137, released, broke some workflows, and now we are trying to leverage Pandas internals (again, but hopefully better).

victorlin · 2024-02-03T00:57:32Z

Rebased this PR onto #1294 since it builds on those changes.

I am also becoming less confident about supporting a SQLite-backed --query so I've gathered related changes into 3cf5670. The existing --query is still supported by loading metadata from the database into DataFrame chunks.

This will be used in future commits by a new implementation of augur filter.

These will be used in future commits by a new implementation of augur filter.

To be used in future commits.

This serves multiple purposes: 1. Detect duplicates. 2. Speed up strain-based queries. 3. Provide a marker for a single primary index column (akin to a pandas DataFrame index column). Also adds DuplicateError, a new exception class to expose duplicates for custom error messages. This will be used in a future commit by a new implementation of augur filter.

This is unused but can come in handy when debugging a query.

…alue() This will be used in a future commit by a new implementation of augur filter.

… constants.py These variables are temporary to bridge the transition from using shared variables to a shared database. They will be removed in a future commit.

WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove metadata input and examples in include_exclude_rules. These are not applicable in the SQLite-based implementation.

WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove sequence index parameter in include_exclude_rules. These are not applicable in the SQLite-based implementation. It also removes the need to check for DataFrame instances in kwargs.

This is the final change that replaces the Pandas-based implementation of augur filter with a SQLite3-based implementation. Breaking changes in behavior (see changes under tests/): 1. `--subsample-seed` is still deterministic but differs from previous implementation. 2. `--include*`: Previously, both exclusion and force-inclusion would be shown in console output and `--output-log`. Now, only the force-inclusion is reported. Implementation differences with no functional changes: 1. Tabular files are loaded into tables in a temporary SQLite3 database file on disk rather than Pandas DataFrames. This generally means less memory usage and more disk usage. Tables are indexed on strain. 2. Since chunked loading of metadata was introduced to avoid high memory usage¹, that is no longer necessary and all operations are now on the entire metadata (except for `--query`/`--query-pandas`). 3. For large datasets, the SQLite3 implementation is much faster than the Pandas implementation. 4. Instead of relying on continually updated variables (e.g. `valid_strains`), new tables in the database are created at various stages in the process. The "filter reason table" is used as a source of truth for all outputs (and is conveniently a ~direct representation of `--output-log`). This also allows the function `augur.filter._run.run()` to be further broken into smaller parts. 5. Exclusion/inclusion is done using WHERE expressions. 6. For subsampling, priority queues are no longer necessary, as the the highest priority strains can be determined using a ORDER BY across all strains. 7. Date parsing has improved with caching and a a min/max approach to resolving date ranges. Note that sequence I/O remains unchanged. ¹ 87ca73c

The re-implementation of augur filter accounts for this directly in filter_by_ambiguous_date(). Remove old tests and add equivalent tests. Note that some date strings have been changed since the previous values would fail date format validation.

Now that disk space usage is uncapped (compared to previous pandas implementation using in-memory chunks), it can be useful to know how large the database file was. However, I'll have to think more about how useful this is once database files are passed in by the user. Ideas: - Mark this as an experimental feature for `augur filter`, to be changed or removed with any version. - Add it to the `augur db` interface, e.g. output of `augur db inspect`. However, it can still be useful to know the sizes of "intermediate" tables. It'd also be useful to add runtime information here. Ideas: - Print run times of each "major" function in real-time. This can probably be achieved by some sort of decorator function.

This adds a new flag to query the SQLite database natively. `--query`/`--query-pandas` will still behave as expected. All Pandas-based query functions are renamed to be Pandas-specific. To avoid breaking changes, alias `--query` to `--query-pandas`.

victorlin self-assigned this Jun 17, 2023

victorlin mentioned this pull request Jun 17, 2023

Minor improvements #1241

Merged

1 task

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 799fe02 to 27afe62 Compare June 17, 2023 05:46

victorlin marked this pull request as draft June 20, 2023 17:19

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 27afe62 to f9d5c56 Compare June 20, 2023 22:15

tsibley self-requested a review June 21, 2023 22:37

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from f9d5c56 to 5a9e0cc Compare June 23, 2023 22:23

Base automatically changed from victorlin/filter-rewrite-prep to master June 23, 2023 23:16

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 5a9e0cc to 28754e3 Compare June 27, 2023 18:31

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch 5 times, most recently from 64f9325 to ee1a62b Compare July 7, 2023 23:57

victorlin commented Jul 8, 2023

View reviewed changes

This was referenced Aug 24, 2023

filter: Stream filtered metadata to outputs directly #1291

Closed

Read a subset of metadata columns #1294

Merged

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from ee1a62b to 60dadc9 Compare February 3, 2024 00:52

victorlin changed the base branch from master to victorlin/improved-filter-io February 3, 2024 00:52

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch 4 times, most recently from 8d4b9e2 to 2c13fcf Compare February 7, 2024 00:10

victorlin force-pushed the victorlin/improved-filter-io branch from 0924b31 to 1d52bc5 Compare February 7, 2024 00:10

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 2c13fcf to 710c000 Compare February 7, 2024 01:25

victorlin force-pushed the victorlin/improved-filter-io branch 2 times, most recently from 5d1c045 to b56f699 Compare February 8, 2024 02:14

Base automatically changed from victorlin/improved-filter-io to master February 8, 2024 02:57

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 710c000 to 6b4b9f1 Compare February 13, 2024 01:29

victorlin mentioned this pull request Feb 16, 2024

Support databases #866

Open

victorlin mentioned this pull request Mar 8, 2024

filter: Split filtering and subsampling #1432

Draft

3 tasks

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter by replacing Pandas #1574

Open

2 tasks

victorlin added 15 commits September 6, 2024 15:33

io: Add class interfaces for working with SQLite3

068a216

This will be used in future commits by a new implementation of augur filter.

io: Split Metadata class into TabularFile and File

28f49a9

These will be used in future commits by a new implementation of augur filter.

Allow custom delimiter, header, columns to TabularFile

acc4913

To be used in future commits.

io/sqlite3: Add debugging function

0c71334

This is unused but can come in handy when debugging a query.

dates: Allow ambiguity resolution method in get_numerical_date_from_v…

22b8054

…alue() This will be used in a future commit by a new implementation of augur filter.

filter/run: Split into smaller functions by using shared variables in…

9f08218

… constants.py These variables are temporary to bridge the transition from using shared variables to a shared database. They will be removed in a future commit.

(1/3) filter: Rewrite using SQLite3

2e46b3a

WARNING: This change won't work as-is. Broken out as a separate commit for easier review. Remove metadata input and examples in include_exclude_rules. These are not applicable in the SQLite-based implementation.

dates: Remove unused is_date_ambiguous()

deb00d6

The re-implementation of augur filter accounts for this directly in filter_by_ambiguous_date(). Remove old tests and add equivalent tests. Note that some date strings have been changed since the previous values would fail date format validation.

🚧 Add timing output to --debug for long-running functions

260343d

🚧 support db file as metadata index

99c5a0d

victorlin force-pushed the victorlin/filter-rewrite-using-sqlite branch from 6b4b9f1 to 99c5a0d Compare September 9, 2024 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Rewrite using SQLite3 #1242

filter: Rewrite using SQLite3 #1242

victorlin commented Jun 17, 2023 •

edited

Loading

codecov bot commented Jun 27, 2023 •

edited

Loading

victorlin Jul 8, 2023

tsibley Aug 23, 2023

victorlin Aug 25, 2023

victorlin Feb 15, 2024

victorlin commented Feb 3, 2024 •

edited

Loading

filter: Rewrite using SQLite3 #1242

Are you sure you want to change the base?

filter: Rewrite using SQLite3 #1242

Conversation

victorlin commented Jun 17, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Checklist

codecov bot commented Jun 27, 2023 • edited Loading

Codecov Report

victorlin Jul 8, 2023

Choose a reason for hiding this comment

tsibley Aug 23, 2023

Choose a reason for hiding this comment

victorlin Aug 25, 2023

Choose a reason for hiding this comment

victorlin Feb 15, 2024

Choose a reason for hiding this comment

victorlin commented Feb 3, 2024 • edited Loading

victorlin commented Jun 17, 2023 •

edited

Loading

codecov bot commented Jun 27, 2023 •

edited

Loading

victorlin commented Feb 3, 2024 •

edited

Loading