Order by columns not present in the result? #8896

ivirshup · 2024-04-05T15:40:00Z

ivirshup
Apr 5, 2024

Hi,

I would like to make a query where the result is ordered by some columns which won't actually end up in the result. E.g.:

(
    table
    .order_by([col1, col2])
    .drop([col1, col2])
    .execute()
)

However, at least with the duckdb backend, this doesn't seem to maintain the requested ordering. Is there any way to enforce this?

I've included a reproducible example below, apologies for not being able to cut it down further, but I think the query has to already be a bit complicated to see this behaviour. I also can't seem to grab the demo data from the docs (update: that was #8874)

Example

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"])
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

Thanks for any help!

Answered by gforsyth

Apr 12, 2024

Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it.

The selected columns gene_tx and tx_id are pairwise unique, so you would expect the output to be consistently ordered. However, the ordering columns seq_name, tx_seq_start, tx_seq_end are not pairwise unique, there are a few overlaps, so the ordering in those overlaps can vary because there's not sufficient information for a deterministic ordering.

To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deter…

View full answer

gforsyth · 2024-04-10T21:13:00Z

gforsyth
Apr 10, 2024
Maintainer

Thanks for raising this @ivirshup !

So, yes, this should absolutely be possible -- I'm still poking at this but I think it's a DuckDB issue, and it might be specific to using their arrow format for returning results (which we always do, even if out eventual target is a pandas dataframe).

See this duckdb-only reproducer:

[ins] In [1]: import duckdb

[ins] In [2]: con = duckdb.connect()

[ins] In [3]: con.sql("ATTACH 'EnsDb.Hsapiens.v108.sqlite' (TYPE SQLITE)")

[ins] In [4]: con.sql("USE EnsDb")

[ins] In [5]: query = """SELECT
         ...:   "t4"."gene_id",
         ...:   "t4"."tx_id"
         ...: FROM (
         ...:   SELECT
         ...:     "t2"."gene_id",
         ...:     "t2"."gene_name",
         ...:     "t2"."gene_biotype",
         ...:     "t2"."gene_seq_start",
         ...:     "t2"."gene_seq_end",
         ...:     "t2"."seq_name",
         ...:     "t2"."seq_strand",
         ...:     "t2"."seq_coord_system",
         ...:     "t2"."description",
         ...:     "t2"."gene_id_version",
         ...:     "t2"."canonical_transcript",
         ...:     "t3"."tx_id",
         ...:     "t3"."tx_biotype",
         ...:     "t3"."tx_seq_start",
         ...:     "t3"."tx_seq_end",
         ...:     "t3"."tx_cds_seq_start",
         ...:     "t3"."tx_cds_seq_end",
         ...:     "t3"."tx_support_level",
         ...:     "t3"."tx_id_version",
         ...:     "t3"."gc_content",
         ...:     "t3"."tx_external_name",
         ...:     "t3"."tx_is_canonical"
         ...:   FROM "gene" AS "t2"
         ...:   INNER JOIN "tx" AS "t3"
         ...:     ON "t2"."gene_id" = "t3"."gene_id"
         ...: ) AS "t4"
         ...: ORDER BY
         ...:   "t4"."seq_name" ASC,
         ...:   "t4"."tx_seq_start" ASC,
         ...:   "t4"."tx_seq_end" ASC"""

[ins] In [6]: results= [con.sql(query).arrow() for i in range(5)]

[ins] In [7]: for i in range(1, 5):
         ...:     print(results[0].equals(results[i]))
True
False
True
False

I'm going to try to see if I can reproduce this using DuckDB tables, then we can report it upstream

6 replies

gforsyth Apr 11, 2024
Maintainer

Dammit, now I just triggered the issue by bumping the number of attempts up to 30. Ok, I'm reporting this upstream.

gforsyth Apr 11, 2024
Maintainer

Ok, I've reproduced this without SQLite in the mix.

gforsyth Apr 11, 2024
Maintainer

Ok! Latest update!

I'm pretty sure this is an Ibis bug, or a bit of missing Ibis surface area because the arrow() method in duckdb accepts a batch_size argument that we weren't passing, so getting back slightly nondeterministic batch sizes was leading to the mismatches. PR incoming

gforsyth Apr 11, 2024
Maintainer

Sweet! Here's the original reproducer, now with the expected AssertionError:

[ins] In [1]: import ibis
         ...: from pathlib import Path
         ...: 
         ...: DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
         ...: 
         ...: if not DB_PTH.exists():
         ...:     !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub
         ...: /AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
         ...: 
         ...: conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
         ...: conn.attach_sqlite(DB_PTH)
         ...: 
         ...: genes = conn.table("gene")
         ...: tx = conn.table("tx")
         ...: 
         ...: query = (
         ...:     genes
         ...:     .join(tx, predicates=["gene_id"])
         ...:     .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
         ...:     .select(["gene_id", "tx_id"])
         ...: )
         ...: 
         ...: def check_consistent_order(table):
         ...:     return table.execute().equals(table.execute())
         ...: 
         ...: assert not all(check_consistent_order(query) for _ in range(5))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 25
     22 def check_consistent_order(table):
     23     return table.execute().equals(table.execute())
---> 25 assert not all(check_consistent_order(query) for _ in range(5))

AssertionError:

gforsyth Apr 11, 2024
Maintainer

#8942

gforsyth · 2024-04-12T15:32:00Z

gforsyth
Apr 12, 2024
Maintainer

Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it.

The selected columns gene_tx and tx_id are pairwise unique, so you would expect the output to be consistently ordered. However, the ordering columns seq_name, tx_seq_start, tx_seq_end are not pairwise unique, there are a few overlaps, so the ordering in those overlaps can vary because there's not sufficient information for a deterministic ordering.

To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deterministic.

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"]).mutate(_id=ibis.row_number())
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end", "_id"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order by columns not present in the result? #8896

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Order by columns not present in the result? #8896

ivirshup Apr 5, 2024

Replies: 2 comments · 6 replies

gforsyth Apr 10, 2024 Maintainer

gforsyth Apr 11, 2024 Maintainer

gforsyth Apr 11, 2024 Maintainer

gforsyth Apr 11, 2024 Maintainer

gforsyth Apr 11, 2024 Maintainer

gforsyth Apr 11, 2024 Maintainer

gforsyth Apr 12, 2024 Maintainer

ivirshup
Apr 5, 2024

Replies: 2 comments 6 replies

gforsyth
Apr 10, 2024
Maintainer

gforsyth Apr 11, 2024
Maintainer

gforsyth Apr 11, 2024
Maintainer

gforsyth Apr 11, 2024
Maintainer

gforsyth Apr 11, 2024
Maintainer

gforsyth Apr 11, 2024
Maintainer

gforsyth
Apr 12, 2024
Maintainer