Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Migrating --query syntax from pandas to SQLite #920

Closed
victorlin opened this issue May 10, 2022 · 4 comments
Closed

filter: Migrating --query syntax from pandas to SQLite #920

victorlin opened this issue May 10, 2022 · 4 comments
Assignees

Comments

@victorlin
Copy link
Member

victorlin commented May 10, 2022

[note] This issue reflects a specific concern in #854. Discussion here will be used to refine that PR.

Context

The introduction of --query in version 8.0.0 allowed for complex queries against metadata that aren't handled by built-in filtering options. Here's a simple GitHub search showing some current usage of the feature.

In #854, the internals of augur filter are rewritten using SQLite. This makes it difficult to maintain compatibility for current usage of --query, which uses pandas under the hood to evaluate the expression. We should aim to transition users from pandas to SQLite expression syntax.

Background - difference between pandas and SQLite expression syntax

The transition may be easier if pandas expression syntax was distinct from SQLite expression syntax. However, this is not the case and many operators overlap with slight differences. So when current queries are run with the new SQLite implementation, they don't fail but instead give inaccurate results, which is worse.

Example from running the genomic surveillance tutorial

  • using pandas augur filter (expected behavior):

    augur filter \
    --query '(`reference_data` == '"'"'yes'"'"' & _length >= 27000) | (`custom_data` == '"'"'yes'"'"' & _length >= 27000) | (`background_data` == '"'"'yes'"'"' & _length >= 27000)'
    # ...
    # 28 strains were dropped during filtering
    # 	28 of these were dropped because they were in results/idaho/excluded_by_diagnostics.txt
    # 	2 strains were added back because they were in defaults/include.txt
    # 32 strains passed all filters
  • using SQLite augur filter (inaccurate behavior):

    augur filter \
    --query '(`reference_data` == '"'"'yes'"'"' & _length >= 27000) | (`custom_data` == '"'"'yes'"'"' & _length >= 27000) | (`background_data` == '"'"'yes'"'"' & _length >= 27000)'
    # ...
    # 61 strains were dropped during filtering
    # 	61 of these were filtered out by the query: "(`reference_data` == 'yes' & _length >= 27000) | (`custom_data` == 'yes' & _length >= 27000) | (`background_data` == 'yes' & _length >= 27000)"
    # 	2 strains were force-included because they were in defaults/include.txt
    # 2 strains passed all filters

With SQLite, the pandas expression resolves to FALSE and the query is effectively --exclude-all. This is due to differences in operator meaning and order of operations:

  • In pandas, &/| are logical operators whereas in SQLite, &/| are bitwise AND/OR respectively. The difference isn't a problem for this example since in SQL, TRUE & TRUE == TRUE AND TRUE.
  • In SQLite, bitwise operators have higher precedence than comparison operators (e.g. ==, >=). This is the main reason for inaccurate results in this example.
Click to expand detailed examples from a SQLite prompt.
-- This is the expression we are breaking down.
-- One might naively expect the output to be TRUE (1).
SELECT ('foo' == 'foo' & 2 >= 1);
-- 0

-- This is TRUE.
SELECT ('foo' == 'foo');
-- 1

-- This is TRUE.
SELECT (2 >= 1);
-- 1

-- This is TRUE.
SELECT (1 & 1);
-- 1

-- Put parentheses around bitwise operator to make implicit order explicit.
SELECT ('foo' == ('foo' & 2) >= 1);
-- 0

-- Examples of SQLite expressions that serve the same intents and purposes:

-- a. Put parentheses around comparisons to force proper order of operations.
SELECT (('foo' == 'foo') & (2 >= 1));
-- 1

-- b. Use logical operator for AND which has lower precedence than comparisons, as expected.
SELECT ('foo' == 'foo' AND 2 >= 1);
-- 1

There are more examples, but the point here is: we should draw a clear line between pandas and SQLite query, and ensure users do not mix the two.

For reference:

Tasks

  1. Ensure that current workflows do not break during transition. This could be done by having users opt-in to use SQLite expression syntax.
  2. Provide a guide for users to learn how to convert existing expressions and write new ones.

Possible solutions

Note that in the long-term, we may also support other database systems with varying expression syntax.

  1. Opt into the SQLite implementation using a separate parameter (e.g. --engine). Keep the current pandas implementation as the default --engine pandas, but in "maintenance mode" with a DeprecationWarning. Only add new features to --engine sqlite.
  2. Add a parameter --query-engine with default as pandas, with the option to opt into the more efficient --query-engine sqlite. This would:
    1. Force users to use the internal SQLite implementation.
    2. Require implementing a pandas query evaluator, which will be inefficient since it needs to translate all metadata from a SQL database to pandas DataFrame and back.
    3. Come with less code duplication.
  3. Write a converter from pandas to SQLite expression. This approach would make things work more "out of the box", but I don't prefer it because:
    1. From a quick scan, pandas.eval() does not have great documentation or a interface that can be used to parse tokens from expressions for easy conversion.
    2. pandas.eval() relies on either numexpr or native Python expression compilation. The support is wide, so the converter will also need wide support.
    3. Conversion seems overkill for transition purposes.
@victorlin victorlin self-assigned this May 10, 2022
@victorlin
Copy link
Member Author

Thinking some more on this, I'm developing a preference towards option 1 with the --engine parameter:

  • It introduces the SQLite implementation in the least invasive way, as more of an "experimental" feature which we could merge earlier and do some testing with.
  • We can add a DeprecationWarning for --engine pandas and swap the default to --engine sqlite in a major change whenever appropriate. This may still cause inaccurate results when using a pandas query expression on --engine sqlite, but we may be able to catch that in another fashion, and recommend --engine pandas as an easy fallback.
  • It does not limit the internal "engine" to just --query. Instead, --query should reflect the internal engine. So, option (2) is weird by mixing a pandas query engine with everything else being SQLite.
  • This paves the way for different [database] implementations as "engines" where we can have values such as pandas, sqlite, mysql, postgresql, ...

I will start updating #854 with this unless anyone has other opinions here.

@huddlej
Copy link
Contributor

huddlej commented May 17, 2022

@victorlin I like this idea of the --engine argument, as a general solution to how we add or remove support for different query syntax. Along the lines of @rneher's comment about how augur filter commands could eventually become a HTTP query, you could imagine some backend that supports a GET string as its query input using an "http" or "rest" engine.

@tsibley
Copy link
Member

tsibley commented Jun 15, 2022

Late to the party, but I concur with the analysis and comments above, particularly @victorlin's rationale in #920 (comment).

While I don't think it's necessarily worth our time, I think an "even better" option would be introducing an engine-neutral query syntax that can be implemented by all engines easily and used by most augur filter invocations for --query by default. (And there could still be a way to opt into engine-specific query syntax for advanced use.) This would solve the general issue of query portability that we're likely going to keep facing.

@victorlin
Copy link
Member Author

We met a few months ago and the decision was to maintain a single implementation of augur filter with multiple query options (e.g. --query-pandas, --query-sqlite). So far this is working well in a development branch victorlin/rewrite-filter-using-sqlite.

For compatibility, --query will alias to --query-pandas in the SQLite implementation, which will load data into a temporary pandas DataFrame and apply the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants