-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong or inconsitent behaviour with take
#2058
Comments
Hi, it would be easier to understand the problem better if you write what of output you are expecting. |
The behaviour should not really depend on the integer number of things for take. I have provided many examples to help as clues but perhaps they are distracting. I think the take 2 (example B) is correct. And take 1 should behave the same way (obviously with "1" replacing "2"). |
It does look like there are some issues here. (edited to fix a mistake of mine) This returns reasonable results: from invoices
select [billing_country, billing_city]
group [billing_city] (
take 1
)
sort billing_city WITH table_1 AS (
SELECT
DISTINCT billing_country,
billing_city
FROM
invoices
)
SELECT
billing_country,
billing_city
FROM
table_1
ORDER BY
billing_city ...since taking the distinct of So this works fine when the distinct over selected columns is the same as the distinct over the group. But commenting out the select: from invoices
# select [billing_country, billing_city]
group [billing_city] (
take 1
)
sort billing_city ...means we get WITH table_1 AS (
SELECT
DISTINCT *
FROM
invoices
)
SELECT
*
FROM
table_1
ORDER BY
billing_city I haven't gone through all of these. It's on my list to attempt an issue template (#2004) so it's easy to fill in the PRQL, incorrect SQL, and fixed SQL. If there are examples that this report doesn't cover, please do add them here in that format. Possibly the rule here should be: Only use Thanks for reporting @cottrell , it's appreciated.
SQL has a |
@max-sixty "Possibly the rule here should be: Only use DISTINCT when the group contains all columns ?" That sounds about right. I tried to grok the parser code to see where the "take" rule is because it sounds like a simple logic errors of some fork that happens only in the case of "no cols selected" and "take = 1". Interesting that https://github.com/PRQL/prql/blob/main/prql-compiler/src/sql/preprocess.rs#L127 ... seems like it should handle this. update ... oh there is a TODO: from 2022 to handle this. Will have a look @aljazerzen might remember but it's a year since that note I think. A minimal thought fix is to simply comment that "optimisation" out (I've tested this) |
Oh, yes I do remember :D It's a thing that I was putting off for a long time. At the time it was not convenient to do, but now it may not be so anymore. |
We could comment it out. It's a shame to lose the nice SQL we get with distinct, but it's better to be correct. We would need to update the docs too, which refer to @cottrell if you want to do that, that'd be good. We can leave a TODO and a reference to this issue. If we can figure out how how to keep |
@max-sixty I had a quick look but didn't immediately understand how to access the "by" or the "columns" at that point in the code to do the TODO check. Likely there is some global understanding necessary ... would be good if anyone has ideas about how to do this or ideas about the blockers to sketch them out to have a slightly more indicative TODO. |
I had a quick look, and can look more tomorrow. This is code I'm hoping to get into myself more too (@aljazerzen wrote almost all of it). FWIW the FYI I added #2080 so I could add some |
FWIW just did the dbg! print. ~/dev/prql$ cat example_1.prql
from invoices
group [billing_country] (
# sort [-invoice_date]
take 1
)
~/dev/prql$ cat ./example_1.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
dialect: GenericDialect,
anchor: AnchorContext {
column_decls: {
column-0: RelationColumn(
TIId(
0,
),
column-0,
Single(
Some(
"billing_country",
),
),
),
column-1: RelationColumn(
TIId(
0,
),
column-1,
Wildcard,
),
},
column_names: {},
table_decls: {
table-0: SqlTableDecl {
id: table-0,
name: Some(
"invoices",
),
relation: None,
},
},
table_instances: {
TIId(
0,
): TableRef {
source: table-0,
columns: [
(
Single(
Some(
"billing_country",
),
),
column-0,
),
(
Wildcard,
column-1,
),
],
name: Some(
"invoices",
),
},
},
col_name: NameGenerator {
prefix: "_expr_",
id: IdGenerator {
next_id: 0,
phantom: PhantomData,
},
},
table_name: NameGenerator {
prefix: "table_",
id: IdGenerator {
next_id: 0,
phantom: PhantomData,
},
},
cid: IdGenerator {
next_id: 2,
phantom: PhantomData,
},
tid: IdGenerator {
next_id: 1,
phantom: PhantomData,
},
tiid: IdGenerator {
next_id: 1,
phantom: PhantomData,
},
},
query: QueryOpts {
omit_ident_prefix: false,
pre_projection: false,
allow_ctes: true,
allow_stars: true,
},
query_stack: [],
ctes: [],
}
SELECT
DISTINCT *
FROM
invoices
-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
~/dev/prql$ cat example_2.prql
from invoices
group [billing_country] (
# sort [-invoice_date]
take 2
)
~/dev/prql$ cat ./example_2.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
dialect: GenericDialect,
anchor: AnchorContext {
column_decls: {
column-1: RelationColumn(
TIId(
0,
),
column-1,
Wildcard,
),
column-0: RelationColumn(
TIId(
0,
),
column-0,
Single(
Some(
"billing_country",
),
),
),
},
column_names: {},
table_decls: {
table-0: SqlTableDecl {
id: table-0,
name: Some(
"invoices",
),
relation: None,
},
},
table_instances: {
TIId(
0,
): TableRef {
source: table-0,
columns: [
(
Single(
Some(
"billing_country",
),
),
column-0,
),
(
Wildcard,
column-1,
),
],
name: Some(
"invoices",
),
},
},
col_name: NameGenerator {
prefix: "_expr_",
id: IdGenerator {
next_id: 0,
phantom: PhantomData,
},
},
table_name: NameGenerator {
prefix: "table_",
id: IdGenerator {
next_id: 0,
phantom: PhantomData,
},
},
cid: IdGenerator {
next_id: 2,
phantom: PhantomData,
},
tid: IdGenerator {
next_id: 1,
phantom: PhantomData,
},
tiid: IdGenerator {
next_id: 1,
phantom: PhantomData,
},
},
query: QueryOpts {
omit_ident_prefix: false,
pre_projection: false,
allow_ctes: true,
allow_stars: true,
},
query_stack: [],
ctes: [],
}
WITH table_1 AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY billing_country) AS _expr_0
FROM
invoices
)
SELECT
*
FROM
table_1 AS table_0
WHERE
_expr_0 <= 2
-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org) |
In the To get the columns, you can use Then is just the matter of comparing two lists of cids. |
I don't think we should be looking into (note that |
Closed by #2109. Thanks for the guidance @aljazerzen . I'm also going to reflect on this as a case of #1840. The actual code was fairly easy. But:
So maybe we try and frame issues & tests as "we need to change this RQ into this RQ". Does that resonate at all?? |
Unfortunately, I don't see a way around this. There is a lot of tools in the toolbox, it's the part of the job to know which tool to use. The best thing we can do here is lots of guidance :D
It's nice to hear that my major refactor that added RQ was worth it. It turns out that looking at architectures of other compilers is a good practice, hehe. Maybe we should add even more IRs that are even closer to SQL? Just a thought, idk where I'd split it.
That works for some issues, not all of them. But in general yes, that would probably be helpful for someone to get started. |
I think with your guidance we're 90% of the way there — this issue was a good case of that. As you know, one of my priorities is to give you better leverage on your time, which we'd get from contributions requiring less guidance. Letting folks be independent would also be a better contribution experience. It's possible that we've done everything we can, and I don't have specific examples for where we haven't yet. But in general, I've found there's often lots of potential to make things more local, requiring less context & guidance. RQ is a good example of that I reckon, even if we haven't fully exploited it yet. I'm going to try and pivot back to the compiler by fixing some of our bugs (like this) — I think we're probably under-weighing how important these are for having the language feel reliable. While I'm doing that, I'll keep an eye open for opportunities on this dimension. |
I suspect some minor problem in the parser in the case of take 1
A) (wrong)
B) (correct?)
C) (very wrong)
D) (correct?)
The text was updated successfully, but these errors were encountered: