Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: re-implement with SQLite and add option to use with --engine #854

Closed
wants to merge 19 commits into from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Feb 17, 2022

Description of proposed changes

Related issue(s)

Testing

  • Rewrite tests for new implementation
  • Modify functional tests to check --non-nucleotide and incomplete date
  • Add tests to improve coverage and verify functionality.

TODO

for later:

  • parse dates only if needed

@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #854 (8acea55) into master (2e4aa88) will decrease coverage by 20.89%.
The diff coverage is 85.84%.

❗ Current head 8acea55 differs from pull request most recent head 0385409. Consider uploading reports for the commit 0385409 to get more accurate results

@@             Coverage Diff             @@
##           master     #854       +/-   ##
===========================================
- Coverage   59.64%   38.75%   -20.90%     
===========================================
  Files          53       49        -4     
  Lines        6317     6379       +62     
  Branches     1586     1608       +22     
===========================================
- Hits         3768     2472     -1296     
- Misses       2286     3802     +1516     
+ Partials      263      105      -158     
Impacted Files Coverage Δ
augur/frequencies.py 10.25% <33.33%> (-41.42%) ⬇️
augur/filter_support/db/base.py 67.94% <67.94%> (ø)
augur/utils.py 42.54% <83.33%> (-21.72%) ⬇️
augur/filter_support/output.py 84.61% <84.61%> (ø)
augur/io_support/db/sqlite.py 89.13% <89.13%> (ø)
augur/io_support/db/__init__.py 92.15% <92.15%> (ø)
augur/filter_support/db/sqlite.py 95.43% <95.43%> (ø)
augur/dates.py 97.61% <97.34%> (+7.76%) ⬆️
augur/filter_support/subsample.py 97.43% <97.43%> (ø)
augur/filter.py 100.00% <100.00%> (+4.17%) ⬆️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e4aa88...0385409. Read the comment docs.

@tsibley
Copy link
Member

tsibley commented Feb 17, 2022

A few quick notes I had during our call just now:

  1. Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.

  2. Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.

  3. It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?

  4. Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

@victorlin
Copy link
Member Author

@tsibley

  1. Much of the templated SQL seems to interpolate values directly without proper quoting. If interpolation is used, quoting is essential, but using placeholders would be ideal.

Could you clarify what you mean by using interpolation vs. placeholders?

  1. Similarly, much of the templated SQL seems to interpolate identifiers (column names, table names) without quoting. At the very least it's essential to quote any identifiers that come from the user. It's ok to skip quoting things that we know don't need it, like our constants, although I think it's often easier to blanket quote it all and then not have to worry about it in the future.

Good point! I did this for the ID and date columns here 54d6ddf...a83a17e, but this should at least be expanded to user input such as in --exclude-where:

SELECT "{self.metadata_id_column}"
FROM {METADATA_TABLE_NAME}
WHERE {column} {op} '{value}'

And agreed on blanket quote.

  1. It's worth checking how often load_tsv() commits the inserts. Especially during bulk inserts it can be beneficial to commit every X rows (where X is large but not too large). A quick check shows to me that load_tsv() doesn't commit at all?

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

  1. Why register date handling functions from Python into SQLite instead of using SQLite's built-in date handling functions? (Or string handling functions, I guess, since that's all the Python is doing.)

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

@tsibley
Copy link
Member

tsibley commented Feb 18, 2022

Could you clarify what you mean by using interpolation vs. placeholders?

cur.execute(f"select * from x where y = '{z}'")       # interpolation (problematic)
cur.execute("select * from x where y = ?", (z,))      # placeholder
cur.execute("select * from x where y = :z", {"z": z}) # placeholder

Placeholders are ubiquitous in SQL interfaces, with ? being the most common syntax. SQLite supports several forms.

Good point! I did this for the ID and date columns here 54d6ddf...a83a17e

Ah, surrounding identifiers in double quotes is a start, but the identifier also needs double quotes escaped by way of doubling them up, e.g.

identifier.replace('"', '""')

before interpolation.

Yeah, I don't know where I got the false notion that commit isn't necessary with executemany(). I'll do some experimenting with different data loading options as we discussed.

👍

The default transaction handling of Python's sqlite3 module is a bit unusual: it's not auto-commit, which ok good, but it is auto-begin for DML, which is a weird combo. IME, it is easier to reason about transactions by disabling that default with isolation_level = None and using explicit transaction handling in our code. It's then clear when you need to issue a commit: it's when you've issued a begin earlier. This can be wrapped up into a context handler too (extending the default context manager for connections just a bit).

Mostly because I imagined the complexity of writing get_date_max() in SQL and thought it would be good to stick to existing logic implemented in Python.

However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation.

Yeah, the time the date table creation takes makes me think a lot of it is spent shuttling between SQLite (C) and Python and back for those functions. I agree get_date_max() is not enticing to write in SQL, but (especially initially) it doesn't have to be all or nothing: we can register get_date_max() from Python but use built-in funcs for the others.

@victorlin
Copy link
Member Author

Force-pushing date parsing improvements (pre-rebase d3e466d...903d92e)

@victorlin
Copy link
Member Author

Force-pushing date parsing improvements and force-include reporting (pre-rebase 903d92e...911e137)

@victorlin
Copy link
Member Author

@tsibley

After some more experimenting and optimization of current code, I've decided to not to pursue native SQLite for date logic:

  1. Caching the Python user-defined functions brings the total run time of date table creation for current GISAID data down from 58s without caching to 12s with caching on my machine. This should be sufficient.

  2. Implementing in SQLite is not very straightforward with limited built-in functionality. To get year/month/day from an ISO 8601 date string:

    with t1 as (
        select
            substr("2018-03-25", 0, instr("2018-03-25", '-')) as year,
            substr("2018-03-25", instr("2018-03-25", '-') + 1) as no_year
    )
    select
        t1.year,
        substr(no_year, 0, instr(no_year, '-')) as month,
        substr(no_year, instr(no_year, '-') + 1) as day
    from t1;

    and this is without error handling (returning NULL if not parseable).

  3. Python functions have more potential to be re-used in other areas.

@tsibley
Copy link
Member

tsibley commented Feb 25, 2022

@victorlin Ok, sounds good. These are great candidates for caching, indeed.

That example SQL is a little more tortuous than I was expecting, which was using a date/time function or string handling function to extract the parts from an ISO date, e.g.:

sqlite> select cast(strftime('%m', '2018-03-25') as int) as month
3

Though I realize now it would be a little more complicated than that with date/time funcs thanks to our need to handle forms like 2018-03-XX and 2018 too. String handling (like the Python versions do) would sidestep those issues, but I see now that SQLite doesn't have a split() or equivalent (which is a bit boggling).

@victorlin
Copy link
Member Author

Force-pushing updated tests, SQL identifier sanitizing, connection handling improvements (pre-rebase 911e137...3eb8c98)

@victorlin
Copy link
Member Author

I investigated some data loading options today:

  1. executemany with a generator (I was doing this wrong until 6e14640 😬)
  2. isolation_level=None with explicit begin/commit
  3. PRAGMA journal_mode = WAL
  4. PRAGMA synchronous = OFF

Testing on my local machine, executemany without any non-default pragmas yields the fastest run time (difference with synchronous = OFF is negligible, and I was surprised that journal_mode = WAL was just a bit slower, but still noticeable).

Due to the single-threaded and sequential nature of augur filter (load data -> compute intermediate tables -> write outputs), there isn't a need for concurrency of multiple readers which is why people seem to use PRAGMA journal_mode = WAL. I don't see any logical way to change things where it would make sense to have concurrency, as there is no need to read until it is all loaded. The only benefit would be concurrent writers for potential speed-up, which isn't so great with one write lock.

For proper auto-committing, I changed to use the default connection context manager rather than sharing one connection across calls.


There is potential to re-use the same metadata for cases like subsampling schemes in the ncov workflow which prompts "concurrency", but for that I think it's easier to just create copies of the database file rather than sharing a single one:

  1. Current SQLite logic relies on creating intermediate tables with predefined names.
  2. Sharing a database file would require concurrent writes to the file which is tricky.

The alternative process would require a new pre-compute augur command similar to augur index, and may look something like:

augur load metadata.tsv db1.sqlite3
cp db1.sqlite3 db2.sqlite3
# the following can be run in parallel
augur filter --metadata-db db1.sqlite3 --query 'country = "USA"' --output-strains usa.txt
augur filter --metadata-db db2.sqlite3 --query 'country != "USA"' --output-strains non-usa.txt

This would come in a later PR, but point being there should be no need to worry about supporting concurrency in this PR.

@victorlin
Copy link
Member Author

Force-pushing additional tests, styling fixes, improved type hints and docstrings (pre-rebase 3eb8c98...a73aec3)

@victorlin
Copy link
Member Author

Force-pushing rebase onto master (equivalent merge commit 4237047)

@tsibley
Copy link
Member

tsibley commented May 4, 2022

…but for that I think it's easier to just create copies of the database file rather than sharing a single one:

  1. Current SQLite logic relies on creating intermediate tables with predefined names.
  2. Sharing a database file would require concurrent writes to the file which is tricky.

Instead of copying the database to avoid the two problems above, augur filter could ATTACH to the input db from within the "main" temporary/in-memory db it uses to do its work. This treats the input db as something conceptually separate from the transient working db (which is, roughly speaking, an implementation detail) and more like an input tsv, rather than treating the input db as a pre-loaded snapshot of the transient working db. Input db would want to be in WAL mode to support concurrent reads.

This also simplifies the implementation of validate_arguments() to raise AugurErrors directly instead of returning a boolean to be translated to a proper error message by the caller.
These can be used by any engine.
This commit contains files supporting an independent implementation of augur filter using SQLite. It is a full refactor and splits the logic into several smaller files:

- New under filter_support/:
    - base.py - abstract class containing database-independent filtering/subsampling logic
    - sqlite.py - contains the working class, an implementation of base.py
- Add new io functions in io/db/.

TODO: address the following differences in other commits

Breaking changes:

- Error when `--include-where`, `--exclude-where`, or `--query` column doesn't exist
    - partially fixes 754

Other changes:

- (internal) less memory usage, more disk usage
- `--min-date`/`--max-date`: add support for ambiguity (e.g. 2020-XX-XX)
    - `--max-date` with year only will work properly now
These tests are specific to the newly coined pandas engine, so the file
deserves a new name.
- Translate original filter pytests to work with an in-memory database.
- Split filter tests into separate classes:
    - file loading
    - filtering
    - filter groupby

TODO: split this into a separate PR

- min/max date: check that silently passing invalid values (e.g. "?") get filtered out
- test_load_invalid_id_column
- test_load_custom_id_column
- test_load_custom_id_column_with_spaces
- test_load_priority_scores_extra_column
- test_load_priority_scores_missing_column
- test_load_sequences_subset_strains
- test_generate_sequence_index
Comment on lines +47 to 48
input_group.add_argument('--engine', default="pandas", choices=["pandas", "sqlite"], help="Internal engine, determines syntax used for --query")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running some workflows using the current implementation, will add results to this thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncov-tutorial seems to run fine.

Copy link
Member Author

@victorlin victorlin Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running ncov's rebuild-gisaid.yml:

  1. Created nextstrain/docker-base@ad3f5fb
  2. Created nextstrain/ncov@e3490aa
  3. Ran rebuild-gisaid.yml with:
    • branch: victorlin/test-augur-filter-sqlite-engine
    • trial build name: victorlin-test-sqlite
    • image: nextstrain/base:branch-victorlin-test-augur-filter-sqlite-engine
  4. View results:

Noting that for the global/6m dataset, n=2780 which contrasts a bit with previous dated builds:

Not sure if this is just due to random sampling differences. Would need to dig into Batch logs to investigate augur filter output, since I don't see the --output-log files under s3://nextstrain-ncov-private/trial/victorlin-test-sqlite/.

@victorlin
Copy link
Member Author

Update: this stalled as I shifted gears to work on other projects. The current state is that it works but opening for formal review is pending:

  1. Incorporating newer changes (e.g. weekly grouping).
  2. Refactoring either the existing code or proposed changes to have a consistent "internal API" between the two engines.
  3. Extensive testing to make sure results are comparable to --engine pandas.
  4. A good story to maintain multiple engines so that new features are added to both simultaneously (at least until one is deprecated) and it is not cumbersome to do so. This applies to both implementation and automated tests.

@victorlin
Copy link
Member Author

Update: I have a new version of this on the branch victorlin/rewrite-filter-using-sqlite. I'm still cleaning up the commits, but will open a new PR (or more) for those changes.

@victorlin victorlin closed this Mar 28, 2023
@victorlin victorlin deleted the filter/sqlite branch March 28, 2023 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants