-
Hi, I would like to make a query where the result is ordered by some columns which won't actually end up in the result. E.g.: (
table
.order_by([col1, col2])
.drop([col1, col2])
.execute()
) However, at least with the duckdb backend, this doesn't seem to maintain the requested ordering. Is there any way to enforce this? I've included a reproducible example below, apologies for not being able to cut it down further, but I think the query has to already be a bit complicated to see this behaviour. I also can't seem to grab the demo data from the docs (update: that was #8874) Exampleimport ibis
from pathlib import Path
DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
if not DB_PTH.exists():
!wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)
genes = conn.table("gene")
tx = conn.table("tx")
query = (
genes
.join(tx, predicates=["gene_id"])
.order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
.select(["gene_id", "tx_id"])
)
def check_consistent_order(table):
return table.execute().equals(table.execute())
assert not all(check_consistent_order(query) for _ in range(5)) Thanks for any help! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Thanks for raising this @ivirshup ! So, yes, this should absolutely be possible -- I'm still poking at this but I think it's a DuckDB issue, and it might be specific to using their See this duckdb-only reproducer: [ins] In [1]: import duckdb
[ins] In [2]: con = duckdb.connect()
[ins] In [3]: con.sql("ATTACH 'EnsDb.Hsapiens.v108.sqlite' (TYPE SQLITE)")
[ins] In [4]: con.sql("USE EnsDb")
[ins] In [5]: query = """SELECT
...: "t4"."gene_id",
...: "t4"."tx_id"
...: FROM (
...: SELECT
...: "t2"."gene_id",
...: "t2"."gene_name",
...: "t2"."gene_biotype",
...: "t2"."gene_seq_start",
...: "t2"."gene_seq_end",
...: "t2"."seq_name",
...: "t2"."seq_strand",
...: "t2"."seq_coord_system",
...: "t2"."description",
...: "t2"."gene_id_version",
...: "t2"."canonical_transcript",
...: "t3"."tx_id",
...: "t3"."tx_biotype",
...: "t3"."tx_seq_start",
...: "t3"."tx_seq_end",
...: "t3"."tx_cds_seq_start",
...: "t3"."tx_cds_seq_end",
...: "t3"."tx_support_level",
...: "t3"."tx_id_version",
...: "t3"."gc_content",
...: "t3"."tx_external_name",
...: "t3"."tx_is_canonical"
...: FROM "gene" AS "t2"
...: INNER JOIN "tx" AS "t3"
...: ON "t2"."gene_id" = "t3"."gene_id"
...: ) AS "t4"
...: ORDER BY
...: "t4"."seq_name" ASC,
...: "t4"."tx_seq_start" ASC,
...: "t4"."tx_seq_end" ASC"""
[ins] In [6]: results= [con.sql(query).arrow() for i in range(5)]
[ins] In [7]: for i in range(1, 5):
...: print(results[0].equals(results[i]))
True
False
True
False I'm going to try to see if I can reproduce this using DuckDB tables, then we can report it upstream |
Beta Was this translation helpful? Give feedback.
-
Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it. The selected columns To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deterministic. import ibis
from pathlib import Path
DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
if not DB_PTH.exists():
!wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)
genes = conn.table("gene")
tx = conn.table("tx")
query = (
genes
.join(tx, predicates=["gene_id"]).mutate(_id=ibis.row_number())
.order_by(["seq_name", "tx_seq_start", "tx_seq_end", "_id"])
.select(["gene_id", "tx_id"])
)
def check_consistent_order(table):
return table.execute().equals(table.execute())
assert not all(check_consistent_order(query) for _ in range(5)) |
Beta Was this translation helpful? Give feedback.
Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it.
The selected columns
gene_tx
andtx_id
are pairwise unique, so you would expect the output to be consistently ordered. However, the ordering columnsseq_name
,tx_seq_start
,tx_seq_end
are not pairwise unique, there are a few overlaps, so the ordering in those overlaps can vary because there's not sufficient information for a deterministic ordering.To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deter…