filter: Migrating `--query` syntax from pandas to SQLite #920

victorlin · 2022-05-10T21:30:14Z

[note] This issue reflects a specific concern in #854. Discussion here will be used to refine that PR.

Context

The introduction of --query in version 8.0.0 allowed for complex queries against metadata that aren't handled by built-in filtering options. Here's a simple GitHub search showing some current usage of the feature.

In #854, the internals of augur filter are rewritten using SQLite. This makes it difficult to maintain compatibility for current usage of --query, which uses pandas under the hood to evaluate the expression. We should aim to transition users from pandas to SQLite expression syntax.

Background - difference between pandas and SQLite expression syntax

The transition may be easier if pandas expression syntax was distinct from SQLite expression syntax. However, this is not the case and many operators overlap with slight differences. So when current queries are run with the new SQLite implementation, they don't fail but instead give inaccurate results, which is worse.

Example from running the genomic surveillance tutorial

using pandas augur filter (expected behavior):

augur filter \
--query '(`reference_data` == '"'"'yes'"'"' & _length >= 27000) | (`custom_data` == '"'"'yes'"'"' & _length >= 27000) | (`background_data` == '"'"'yes'"'"' & _length >= 27000)'
# ...
# 28 strains were dropped during filtering
# 	28 of these were dropped because they were in results/idaho/excluded_by_diagnostics.txt
# 	2 strains were added back because they were in defaults/include.txt
# 32 strains passed all filters

using SQLite augur filter (inaccurate behavior):

augur filter \
--query '(`reference_data` == '"'"'yes'"'"' & _length >= 27000) | (`custom_data` == '"'"'yes'"'"' & _length >= 27000) | (`background_data` == '"'"'yes'"'"' & _length >= 27000)'
# ...
# 61 strains were dropped during filtering
# 	61 of these were filtered out by the query: "(`reference_data` == 'yes' & _length >= 27000) | (`custom_data` == 'yes' & _length >= 27000) | (`background_data` == 'yes' & _length >= 27000)"
# 	2 strains were force-included because they were in defaults/include.txt
# 2 strains passed all filters

With SQLite, the pandas expression resolves to FALSE and the query is effectively --exclude-all. This is due to differences in operator meaning and order of operations:

In pandas, &/| are logical operators whereas in SQLite, &/| are bitwise AND/OR respectively. The difference isn't a problem for this example since in SQL, TRUE & TRUE == TRUE AND TRUE.
In SQLite, bitwise operators have higher precedence than comparison operators (e.g. ==, >=). This is the main reason for inaccurate results in this example.

Click to expand detailed examples from a SQLite prompt.

-- This is the expression we are breaking down.
-- One might naively expect the output to be TRUE (1).
SELECT ('foo' == 'foo' & 2 >= 1);
-- 0

-- This is TRUE.
SELECT ('foo' == 'foo');
-- 1

-- This is TRUE.
SELECT (2 >= 1);
-- 1

-- This is TRUE.
SELECT (1 & 1);
-- 1

-- Put parentheses around bitwise operator to make implicit order explicit.
SELECT ('foo' == ('foo' & 2) >= 1);
-- 0

-- Examples of SQLite expressions that serve the same intents and purposes:

-- a. Put parentheses around comparisons to force proper order of operations.
SELECT (('foo' == 'foo') & (2 >= 1));
-- 1

-- b. Use logical operator for AND which has lower precedence than comparisons, as expected.
SELECT ('foo' == 'foo' AND 2 >= 1);
-- 1

There are more examples, but the point here is: we should draw a clear line between pandas and SQLite query, and ensure users do not mix the two.

For reference:

pandas expression syntax
- This varies with engine param, default is engine='numexpr'. See numexpr operators doc
SQLite expression syntax
- Here's a nice breakdown of operator types.

Tasks

Ensure that current workflows do not break during transition. This could be done by having users opt-in to use SQLite expression syntax.
Provide a guide for users to learn how to convert existing expressions and write new ones.

Possible solutions

Note that in the long-term, we may also support other database systems with varying expression syntax.

Opt into the SQLite implementation using a separate parameter (e.g. --engine). Keep the current pandas implementation as the default --engine pandas, but in "maintenance mode" with a DeprecationWarning. Only add new features to --engine sqlite.
Add a parameter --query-engine with default as pandas, with the option to opt into the more efficient --query-engine sqlite. This would:
1. Force users to use the internal SQLite implementation.
2. Require implementing a pandas query evaluator, which will be inefficient since it needs to translate all metadata from a SQL database to pandas DataFrame and back.
3. Come with less code duplication.
Write a converter from pandas to SQLite expression. This approach would make things work more "out of the box", but I don't prefer it because:
1. From a quick scan, pandas.eval() does not have great documentation or a interface that can be used to parse tokens from expressions for easy conversion.
2. pandas.eval() relies on either numexpr or native Python expression compilation. The support is wide, so the converter will also need wide support.
3. Conversion seems overkill for transition purposes.

The text was updated successfully, but these errors were encountered:

victorlin · 2022-05-12T20:49:01Z

Thinking some more on this, I'm developing a preference towards option 1 with the --engine parameter:

It introduces the SQLite implementation in the least invasive way, as more of an "experimental" feature which we could merge earlier and do some testing with.
We can add a DeprecationWarning for --engine pandas and swap the default to --engine sqlite in a major change whenever appropriate. This may still cause inaccurate results when using a pandas query expression on --engine sqlite, but we may be able to catch that in another fashion, and recommend --engine pandas as an easy fallback.
It does not limit the internal "engine" to just --query. Instead, --query should reflect the internal engine. So, option (2) is weird by mixing a pandas query engine with everything else being SQLite.
This paves the way for different [database] implementations as "engines" where we can have values such as pandas, sqlite, mysql, postgresql, ...

I will start updating #854 with this unless anyone has other opinions here.

huddlej · 2022-05-17T17:43:33Z

@victorlin I like this idea of the --engine argument, as a general solution to how we add or remove support for different query syntax. Along the lines of @rneher's comment about how augur filter commands could eventually become a HTTP query, you could imagine some backend that supports a GET string as its query input using an "http" or "rest" engine.

tsibley · 2022-06-15T19:23:37Z

Late to the party, but I concur with the analysis and comments above, particularly @victorlin's rationale in #920 (comment).

While I don't think it's necessarily worth our time, I think an "even better" option would be introducing an engine-neutral query syntax that can be implemented by all engines easily and used by most augur filter invocations for --query by default. (And there could still be a way to opt into engine-specific query syntax for advanced use.) This would solve the general issue of query portability that we're likely going to keep facing.

victorlin · 2023-03-28T20:41:30Z

We met a few months ago and the decision was to maintain a single implementation of augur filter with multiple query options (e.g. --query-pandas, --query-sqlite). So far this is working well in a development branch victorlin/rewrite-filter-using-sqlite.

For compatibility, --query will alias to --query-pandas in the SQLite implementation, which will load data into a temporary pandas DataFrame and apply the query.

victorlin self-assigned this May 10, 2022

victorlin mentioned this issue May 12, 2022

filter: re-implement with SQLite and add option to use with --engine #854

Closed

21 tasks

victorlin mentioned this issue May 18, 2022

filter: Reduce file size with filter_support/ #937

Closed

victorlin closed this as completed Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Migrating `--query` syntax from pandas to SQLite #920

filter: Migrating `--query` syntax from pandas to SQLite #920

victorlin commented May 10, 2022 •

edited

Loading

victorlin commented May 12, 2022

huddlej commented May 17, 2022

tsibley commented Jun 15, 2022

victorlin commented Mar 28, 2023

filter: Migrating --query syntax from pandas to SQLite #920

filter: Migrating --query syntax from pandas to SQLite #920

Comments

victorlin commented May 10, 2022 • edited Loading

Context

Background - difference between pandas and SQLite expression syntax

Tasks

Possible solutions

victorlin commented May 12, 2022

huddlej commented May 17, 2022

tsibley commented Jun 15, 2022

victorlin commented Mar 28, 2023

filter: Migrating `--query` syntax from pandas to SQLite #920

filter: Migrating `--query` syntax from pandas to SQLite #920

victorlin commented May 10, 2022 •

edited

Loading