filter: re-implement with SQLite and add option to use with `--engine` #854

victorlin · 2022-02-17T20:50:23Z

Description of proposed changes

Rewrite pandas logic with a database approach.
- Split logic into smaller files.
- sqlite.py contains the working class – an implementation of the abstract base.py.
- Add new io functions in io_support/.
- Rewrite date parsing: date_parsing.py
Breaking changes:
- --query takes a SQL expression instead of pandas syntax.
- Error when --include-where, --exclude-where, or --query column doesn't exist (partial fix for Throw an error when any requested filter or group-by columns don't exist #754).
Behavioral changes:
- Report min/max date filters separately.
- (internal) less memory usage, more disk usage

Related issue(s)

Support databases #866

Testing

Rewrite tests for new implementation
Modify functional tests to check --non-nucleotide and incomplete date
Add tests to improve coverage and verify functionality.

TODO

for later:

parse dates only if needed

codecov · 2022-02-17T20:58:49Z

Codecov Report

Merging #854 (8acea55) into master (2e4aa88) will decrease coverage by 20.89%.
The diff coverage is 85.84%.

❗ Current head 8acea55 differs from pull request most recent head 0385409. Consider uploading reports for the commit 0385409 to get more accurate results

@@             Coverage Diff             @@
##           master     #854       +/-   ##
===========================================
- Coverage   59.64%   38.75%   -20.90%     
===========================================
  Files          53       49        -4     
  Lines        6317     6379       +62     
  Branches     1586     1608       +22     
===========================================
- Hits         3768     2472     -1296     
- Misses       2286     3802     +1516     
+ Partials      263      105      -158

Impacted Files	Coverage Δ
augur/frequencies.py	`10.25% <33.33%> (-41.42%)`	⬇️
augur/filter_support/db/base.py	`67.94% <67.94%> (ø)`
augur/utils.py	`42.54% <83.33%> (-21.72%)`	⬇️
augur/filter_support/output.py	`84.61% <84.61%> (ø)`
augur/io_support/db/sqlite.py	`89.13% <89.13%> (ø)`
augur/io_support/db/__init__.py	`92.15% <92.15%> (ø)`
augur/filter_support/db/sqlite.py	`95.43% <95.43%> (ø)`
augur/dates.py	`97.61% <97.34%> (+7.76%)`	⬆️
augur/filter_support/subsample.py	`97.43% <97.43%> (ø)`
augur/filter.py	`100.00% <100.00%> (+4.17%)`	⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e4aa88...0385409. Read the comment docs.

tsibley · 2022-02-17T23:24:40Z

A few quick notes I had during our call just now:

Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.
Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.
It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?
Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

victorlin · 2022-02-18T00:23:59Z

@tsibley

Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.

Could you clarify what you mean by using interpolation vs. placeholders?

Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.

Good point! I did this for the ID and date columns here 54d6ddf...a83a17e, but this should at least be expanded to user input such as in --exclude-where:

augur/augur/filter_support/db/sqlite.py

Lines 242 to 244 in 1513f1e

    
                           SELECT "{self.metadata_id_column}" 
        
                           FROM {METADATA_TABLE_NAME} 
        
                           WHERE {column} {op} '{value}'

And agreed on blanket quote.

It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

tsibley · 2022-02-18T18:25:29Z

Could you clarify what you mean by using interpolation vs. placeholders?

cur.execute(f"select * from x where y = '{z}'")       # interpolation (problematic)
cur.execute("select * from x where y = ?", (z,))      # placeholder
cur.execute("select * from x where y = :z", {"z": z}) # placeholder

Placeholders are ubiquitous in SQL interfaces, with ? being the most common syntax. SQLite supports several forms.

Good point! I did this for the ID and date columns here 54d6ddf...a83a17e

Ah, surrounding identifiers in double quotes is a start, but the identifier also needs double quotes escaped by way of doubling them up, e.g.

identifier.replace('"', '""')

before interpolation.

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

👍

The default transaction handling of Python's sqlite3 module is a bit unusual: it's not auto-commit, which ok good, but it is auto-begin for DML, which is a weird combo. IME, it is easier to reason about transactions by disabling that default with isolation_level = None and using explicit transaction handling in our code. It's then clear when you need to issue a commit: it's when you've issued a begin earlier. This can be wrapped up into a context handler too (extending the default context manager for connections just a bit).

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

Yeah, the time the date table creation takes makes me think a lot of it is spent shuttling between SQLite (C) and Python and back for those functions. I agree get_date_max() is not enticing to write in SQL, but (especially initially) it doesn't have to be all or nothing: we can register get_date_max() from Python but use built-in funcs for the others.

victorlin · 2022-02-23T23:54:14Z

Force-pushing date parsing improvements (pre-rebase d3e466d...903d92e)

victorlin · 2022-02-25T17:23:26Z

Force-pushing date parsing improvements and force-include reporting (pre-rebase 903d92e...911e137)

victorlin · 2022-02-25T17:31:14Z

@tsibley

After some more experimenting and optimization of current code, I've decided to not to pursue native SQLite for date logic:

Caching the Python user-defined functions brings the total run time of date table creation for current GISAID data down from 58s without caching to 12s with caching on my machine. This should be sufficient.

Implementing in SQLite is not very straightforward with limited built-in functionality. To get year/month/day from an ISO 8601 date string:

with t1 as (
    select
        substr("2018-03-25", 0, instr("2018-03-25", '-')) as year,
        substr("2018-03-25", instr("2018-03-25", '-') + 1) as no_year
)
select
    t1.year,
    substr(no_year, 0, instr(no_year, '-')) as month,
    substr(no_year, instr(no_year, '-') + 1) as day
from t1;

and this is without error handling (returning NULL if not parseable).

Python functions have more potential to be re-used in other areas.

tsibley · 2022-02-25T23:25:55Z

@victorlin Ok, sounds good. These are great candidates for caching, indeed.

That example SQL is a little more tortuous than I was expecting, which was using a date/time function or string handling function to extract the parts from an ISO date, e.g.:

sqlite> select cast(strftime('%m', '2018-03-25') as int) as month
3

Though I realize now it would be a little more complicated than that with date/time funcs thanks to our need to handle forms like 2018-03-XX and 2018 too. String handling (like the Python versions do) would sidestep those issues, but I see now that SQLite doesn't have a split() or equivalent (which is a bit boggling).

victorlin · 2022-03-01T03:06:48Z

Force-pushing updated tests, SQL identifier sanitizing, connection handling improvements (pre-rebase 911e137...3eb8c98)

victorlin · 2022-03-01T03:16:09Z

I investigated some data loading options today:

executemany with a generator (I was doing this wrong until 6e14640 😬)
isolation_level=None with explicit begin/commit
PRAGMA journal_mode = WAL
PRAGMA synchronous = OFF

Testing on my local machine, executemany without any non-default pragmas yields the fastest run time (difference with synchronous = OFF is negligible, and I was surprised that journal_mode = WAL was just a bit slower, but still noticeable).

Due to the single-threaded and sequential nature of augur filter (load data -> compute intermediate tables -> write outputs), there isn't a need for concurrency of multiple readers which is why people seem to use PRAGMA journal_mode = WAL. I don't see any logical way to change things where it would make sense to have concurrency, as there is no need to read until it is all loaded. The only benefit would be concurrent writers for potential speed-up, which isn't so great with one write lock.

For proper auto-committing, I changed to use the default connection context manager rather than sharing one connection across calls.

There is potential to re-use the same metadata for cases like subsampling schemes in the ncov workflow which prompts "concurrency", but for that I think it's easier to just create copies of the database file rather than sharing a single one:

Current SQLite logic relies on creating intermediate tables with predefined names.
Sharing a database file would require concurrent writes to the file which is tricky.

The alternative process would require a new pre-compute augur command similar to augur index, and may look something like:

augur load metadata.tsv db1.sqlite3
cp db1.sqlite3 db2.sqlite3
# the following can be run in parallel
augur filter --metadata-db db1.sqlite3 --query 'country = "USA"' --output-strains usa.txt
augur filter --metadata-db db2.sqlite3 --query 'country != "USA"' --output-strains non-usa.txt

This would come in a later PR, but point being there should be no need to worry about supporting concurrency in this PR.

victorlin · 2022-03-09T08:27:12Z

Force-pushing additional tests, styling fixes, improved type hints and docstrings (pre-rebase 3eb8c98...a73aec3)

victorlin · 2022-03-11T21:02:42Z

Force-pushing rebase onto master (equivalent merge commit 4237047)

tsibley · 2022-05-04T19:38:42Z

…but for that I think it's easier to just create copies of the database file rather than sharing a single one:

Current SQLite logic relies on creating intermediate tables with predefined names.

Sharing a database file would require concurrent writes to the file which is tricky.

Instead of copying the database to avoid the two problems above, augur filter could ATTACH to the input db from within the "main" temporary/in-memory db it uses to do its work. This treats the input db as something conceptually separate from the transient working db (which is, roughly speaking, an implementation detail) and more like an input tsv, rather than treating the input db as a pre-loaded snapshot of the transient working db. Input db would want to be in WAL mode to support concurrent reads.

This also simplifies the implementation of validate_arguments() to raise AugurErrors directly instead of returning a boolean to be translated to a proper error message by the caller.

These can be used by any engine.

This commit contains files supporting an independent implementation of augur filter using SQLite. It is a full refactor and splits the logic into several smaller files: - New under filter_support/: - base.py - abstract class containing database-independent filtering/subsampling logic - sqlite.py - contains the working class, an implementation of base.py - Add new io functions in io/db/. TODO: address the following differences in other commits Breaking changes: - Error when `--include-where`, `--exclude-where`, or `--query` column doesn't exist - partially fixes 754 Other changes: - (internal) less memory usage, more disk usage

- `--min-date`/`--max-date`: add support for ambiguity (e.g. 2020-XX-XX) - `--max-date` with year only will work properly now

These tests are specific to the newly coined pandas engine, so the file deserves a new name.

- Translate original filter pytests to work with an in-memory database. - Split filter tests into separate classes: - file loading - filtering - filter groupby TODO: split this into a separate PR - min/max date: check that silently passing invalid values (e.g. "?") get filtered out

- test_load_invalid_id_column - test_load_custom_id_column - test_load_custom_id_column_with_spaces - test_load_priority_scores_extra_column - test_load_priority_scores_missing_column - test_load_sequences_subset_strains - test_generate_sequence_index

victorlin · 2022-08-08T23:12:54Z

augur/filter.py

+    input_group.add_argument('--engine', default="pandas", choices=["pandas", "sqlite"], help="Internal engine, determines syntax used for --query")



Running some workflows using the current implementation, will add results to this thread.

ncov-tutorial seems to run fine.

Running ncov's rebuild-gisaid.yml:

Created nextstrain/docker-base@ad3f5fb

Created nextstrain/ncov@e3490aa

Ran rebuild-gisaid.yml with:

branch: victorlin/test-augur-filter-sqlite-engine

trial build name: victorlin-test-sqlite

image: nextstrain/base:branch-victorlin-test-augur-filter-sqlite-engine

View results:

Batch job

S3: s3://nextstrain-ncov-private/trial/victorlin-test-sqlite/

Auspice views: https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-test-sqlite/REGION_NAME/TIME_SPAN, e.g. https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-test-sqlite/global/6m

Noting that for the global/6m dataset, n=2780 which contrasts a bit with previous dated builds:

2022-08-08: 2948

2022-08-06: 2897

2022-08-04: 2865

2022-08-02: 2866

2022-08-01: 2843

Not sure if this is just due to random sampling differences. Would need to dig into Batch logs to investigate augur filter output, since I don't see the --output-log files under s3://nextstrain-ncov-private/trial/victorlin-test-sqlite/.

victorlin · 2022-11-04T00:27:41Z

Update: this stalled as I shifted gears to work on other projects. The current state is that it works but opening for formal review is pending:

Incorporating newer changes (e.g. weekly grouping).
Refactoring either the existing code or proposed changes to have a consistent "internal API" between the two engines.
Extensive testing to make sure results are comparable to --engine pandas.
A good story to maintain multiple engines so that new features are added to both simultaneously (at least until one is deprecated) and it is not cumbersome to do so. This applies to both implementation and automated tests.

victorlin · 2023-03-28T20:35:47Z

Update: I have a new version of this on the branch victorlin/rewrite-filter-using-sqlite. I'm still cleaning up the commits, but will open a new PR (or more) for those changes.

victorlin force-pushed the filter/sqlite branch from 1513f1e to b477cda Compare February 23, 2022 23:54

victorlin force-pushed the filter/sqlite branch from b477cda to 3710df5 Compare February 25, 2022 17:23

victorlin force-pushed the filter/sqlite branch from 3710df5 to 892c1ca Compare March 1, 2022 03:06

victorlin force-pushed the filter/sqlite branch from 892c1ca to 22e485f Compare March 9, 2022 08:27

victorlin mentioned this pull request Mar 11, 2022

Support databases #866

Open

victorlin force-pushed the filter/sqlite branch from 22e485f to 85ece02 Compare March 11, 2022 21:02

victorlin mentioned this pull request Mar 23, 2022

filter: Support relative dates for --min-date and --max-date #740

Merged

victorlin force-pushed the filter/sqlite branch 5 times, most recently from 86e374d to 110af66 Compare April 19, 2022 06:05

victorlin mentioned this pull request Apr 19, 2022

filter: --max-date with year only is not inclusive #893

Open

victorlin force-pushed the filter/sqlite branch 2 times, most recently from a600f97 to 4404c25 Compare April 25, 2022 23:18

victorlin mentioned this pull request May 6, 2022

Add documentation page on supported date formats #882

Open

victorlin force-pushed the filter/sqlite branch from 4404c25 to 7d0b036 Compare May 6, 2022 22:31

victorlin added 5 commits July 20, 2022 16:14

[DROP FOR 937] Move FilterException to filter_support.exceptions

8232839

[DROP FOR 937] Move filter_kwargs_to_str() to filter_support.exceptions

937bf39

[DROP FOR 937] Simplify doctests

9dfcd9b

[DROP FOR 931] filter: Use print_err and AugurError

b15a6b1

This also simplifies the implementation of validate_arguments() to raise AugurErrors directly instead of returning a boolean to be translated to a proper error message by the caller.

Move common subsampling logic to filter_support

66e684b

These can be used by any engine.

victorlin force-pushed the filter/sqlite branch from 2f4611d to e2b3039 Compare July 20, 2022 23:14

victorlin added 14 commits July 30, 2022 00:59

[SQUASH FOR 931] db: remove FilterBase.validate_arguments()

bbff6de

[DROP FOR 928] dates: improved date parsing

8cea26d

- `--min-date`/`--max-date`: add support for ambiguity (e.g. 2020-XX-XX) - `--max-date` with year only will work properly now

dates: Add user-defined functions for sqlite engine

c6f2cfa

filter: Enable usage of sqlite engine from entrypoint

21c5f9c

Resolve cyclic dependency between utils and io

d2b3519

pytest: Rename pandas-specific pytest file

1dd1a11

These tests are specific to the newly coined pandas engine, so the file deserves a new name.

sqlite: Add missing tests for filtering

e98a03f

sqlite: Add test for "all samples dropped"

7fd8db4

sqlite: Add tests for data loading

3a504da

- test_load_invalid_id_column - test_load_custom_id_column - test_load_custom_id_column_with_spaces - test_load_priority_scores_extra_column - test_load_priority_scores_missing_column - test_load_sequences_subset_strains - test_generate_sequence_index

sqlite: Add tests for subsampling

e9eab5c

[wip] copy cram tests for sqlite engine in existing files

a928406

test: Make changes to reflect minor differences in behavior

0385409

victorlin force-pushed the filter/sqlite branch from e2b3039 to 0385409 Compare August 1, 2022 16:24

victorlin commented Aug 8, 2022

View reviewed changes

This was referenced Aug 9, 2022

URL preview from rebuild-gisaid.yml output does not work nextstrain/ncov#990

Closed

Speed up filtering by not processing entire file #789

Closed

victorlin mentioned this pull request Nov 8, 2022

augur db with import/export of metadata #1094

Draft

2 tasks

victorlin mentioned this pull request Mar 15, 2023

Cram-ify "functional" pytests #1183

Merged

3 tasks

victorlin closed this Mar 28, 2023

victorlin deleted the filter/sqlite branch March 28, 2023 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: re-implement with SQLite and add option to use with `--engine` #854

filter: re-implement with SQLite and add option to use with `--engine` #854

victorlin commented Feb 17, 2022 •

edited

Loading

codecov bot commented Feb 17, 2022 •

edited

Loading

tsibley commented Feb 17, 2022

victorlin commented Feb 18, 2022

tsibley commented Feb 18, 2022 •

edited

Loading

victorlin commented Feb 23, 2022

victorlin commented Feb 25, 2022

victorlin commented Feb 25, 2022

tsibley commented Feb 25, 2022

victorlin commented Mar 1, 2022

victorlin commented Mar 1, 2022

victorlin commented Mar 9, 2022

victorlin commented Mar 11, 2022

tsibley commented May 4, 2022 •

edited

Loading

victorlin Aug 8, 2022

victorlin Aug 8, 2022

victorlin Aug 8, 2022 •

edited

Loading

victorlin commented Nov 4, 2022

victorlin commented Mar 28, 2023

		input_group.add_argument('--engine', default="pandas", choices=["pandas", "sqlite"], help="Internal engine, determines syntax used for --query")

filter: re-implement with SQLite and add option to use with --engine #854

filter: re-implement with SQLite and add option to use with --engine #854

Conversation

victorlin commented Feb 17, 2022 • edited Loading

Description of proposed changes

Related issue(s)

Testing

TODO

codecov bot commented Feb 17, 2022 • edited Loading

Codecov Report

tsibley commented Feb 17, 2022

victorlin commented Feb 18, 2022

tsibley commented Feb 18, 2022 • edited Loading

victorlin commented Feb 23, 2022

victorlin commented Feb 25, 2022

victorlin commented Feb 25, 2022

tsibley commented Feb 25, 2022

victorlin commented Mar 1, 2022

victorlin commented Mar 1, 2022

victorlin commented Mar 9, 2022

victorlin commented Mar 11, 2022

tsibley commented May 4, 2022 • edited Loading

victorlin Aug 8, 2022

Choose a reason for hiding this comment

victorlin Aug 8, 2022

Choose a reason for hiding this comment

victorlin Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

victorlin commented Nov 4, 2022

victorlin commented Mar 28, 2023

filter: re-implement with SQLite and add option to use with `--engine` #854

filter: re-implement with SQLite and add option to use with `--engine` #854

victorlin commented Feb 17, 2022 •

edited

Loading

codecov bot commented Feb 17, 2022 •

edited

Loading

tsibley commented Feb 18, 2022 •

edited

Loading

tsibley commented May 4, 2022 •

edited

Loading

victorlin Aug 8, 2022 •

edited

Loading