Wrong or inconsitent behaviour with `take` #2058

cottrell · 2023-03-08T11:01:28Z

I suspect some minor problem in the parser in the case of take 1

A) (wrong)

from invoices
group [billing_city] (
  take 1
)

SELECT
  DISTINCT *
FROM
  invoices

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

B) (correct?)

from invoices
group [billing_city] (
  take 2
)

WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY billing_city) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
WHERE
  _expr_0 <= 2

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

C) (very wrong)

from invoices
group [billing_city] (
  take 1
)
select [a, b, c]

SELECT
  DISTINCT a,
  b,
  c
FROM
  invoices

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

D) (correct?)

from invoices
group [billing_city] (
  sort [-invoice_date]
  take 1
)

WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY billing_city
      ORDER BY
        invoice_date DESC
    ) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
WHERE
  _expr_0 <= 1

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

The text was updated successfully, but these errors were encountered:

eitsupi · 2023-03-08T13:52:59Z

Hi, it would be easier to understand the problem better if you write what of output you are expecting.

cottrell · 2023-03-08T14:22:34Z

The behaviour should not really depend on the integer number of things for take.

I have provided many examples to help as clues but perhaps they are distracting.

I think the take 2 (example B) is correct. And take 1 should behave the same way (obviously with "1" replacing "2").

cottrell · 2023-03-08T14:26:49Z

Here is yet another example where "sort" being inserted takes us from error to correct result. The "sort" should have no impact on the result

I suspect it is something obvious in the "group" command where when everything but "take" op is omitted is has wrong behaviour.

cottrell · 2023-03-08T14:32:51Z

Another example, notice there are MANY entries per group in the first one and only 2 per group in the second one.

max-sixty · 2023-03-09T04:55:52Z

It does look like there are some issues here.

(edited to fix a mistake of mine)

This returns reasonable results:

from invoices
select [billing_country, billing_city]
group [billing_city] (
  take 1
)
sort billing_city

WITH table_1 AS (
  SELECT
    DISTINCT billing_country,
    billing_city
  FROM
    invoices
)
SELECT
  billing_country,
  billing_city
FROM
  table_1
ORDER BY
  billing_city

...since taking the distinct of [billing_country, billing_city] returns the same rows as taking the distinct of billing_city.

So this works fine when the distinct over selected columns is the same as the distinct over the group.

But commenting out the select:

from invoices
# select [billing_country, billing_city]
group [billing_city] (
  take 1
)
sort billing_city

...means we get DISTINCT *, which is a very different number of rows — just anything that's not a duplicate.

WITH table_1 AS (
  SELECT
    DISTINCT *
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
ORDER BY
  billing_city

I haven't gone through all of these. It's on my list to attempt an issue template (#2004) so it's easy to fill in the PRQL, incorrect SQL, and fixed SQL. If there are examples that this report doesn't cover, please do add them here in that format.

Possibly the rule here should be: Only use DISTINCT when the group contains all columns ?

Thanks for reporting @cottrell , it's appreciated.

The behaviour should not really depend on the integer number of things for take.

SQL has a DISTINCT keyword which makes the resulting queries much simpler in the case of take 1, so we do try and use that, under the principle that readable SQL is a good goal

cottrell · 2023-03-09T09:22:44Z

@max-sixty "Possibly the rule here should be: Only use DISTINCT when the group contains all columns ?" That sounds about right. I tried to grok the parser code to see where the "take" rule is because it sounds like a simple logic errors of some fork that happens only in the case of "no cols selected" and "take = 1".

Interesting that https://github.com/PRQL/prql/blob/main/prql-compiler/src/sql/preprocess.rs#L127 ... seems like it should handle this.

update ... oh there is a TODO: from 2022 to handle this. Will have a look

@aljazerzen might remember but it's a year since that note I think.

A minimal thought fix is to simply comment that "optimisation" out (I've tested this)

aljazerzen · 2023-03-09T12:03:29Z

Oh, yes I do remember :D It's a thing that I was putting off for a long time.

#944

At the time it was not convenient to do, but now it may not be so anymore.

max-sixty · 2023-03-10T00:29:40Z

We could comment it out. It's a shame to lose the nice SQL we get with distinct, but it's better to be correct. We would need to update the docs too, which refer to DISTINCT (these could also be commented out as a stop-gap).

@cottrell if you want to do that, that'd be good. We can leave a TODO and a reference to this issue.

If we can figure out how how to keep DISTINCT by fixing that TODO, then even better. Though my guess is that it's not trivial, since it's still a TODO!

cottrell · 2023-03-10T08:56:15Z

@max-sixty I had a quick look but didn't immediately understand how to access the "by" or the "columns" at that point in the code to do the TODO check. Likely there is some global understanding necessary ... would be good if anyone has ideas about how to do this or ideas about the blockers to sketch them out to have a slightly more indicative TODO.

max-sixty · 2023-03-10T09:34:01Z

I had a quick look, and can look more tomorrow. This is code I'm hoping to get into myself more too (@aljazerzen wrote almost all of it).

FWIW the by is a remnant from us using by to specify the columns in a group — it just means the columns in the group.

FYI I added #2080 so I could add some dbg! statements and understand where the information was on the ctx variable.

cottrell · 2023-03-10T09:58:12Z

FWIW just did the dbg! print.

~/dev/prql$ cat example_1.prql
from invoices
group [billing_country] (
  # sort [-invoice_date]
  take 1
)

~/dev/prql$ cat ./example_1.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
    dialect: GenericDialect,
    anchor: AnchorContext {
        column_decls: {
            column-0: RelationColumn(
                TIId(
                    0,
                ),
                column-0,
                Single(
                    Some(
                        "billing_country",
                    ),
                ),
            ),
            column-1: RelationColumn(
                TIId(
                    0,
                ),
                column-1,
                Wildcard,
            ),
        },
        column_names: {},
        table_decls: {
            table-0: SqlTableDecl {
                id: table-0,
                name: Some(
                    "invoices",
                ),
                relation: None,
            },
        },
        table_instances: {
            TIId(
                0,
            ): TableRef {
                source: table-0,
                columns: [
                    (
                        Single(
                            Some(
                                "billing_country",
                            ),
                        ),
                        column-0,
                    ),
                    (
                        Wildcard,
                        column-1,
                    ),
                ],
                name: Some(
                    "invoices",
                ),
            },
        },
        col_name: NameGenerator {
            prefix: "_expr_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        table_name: NameGenerator {
            prefix: "table_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        cid: IdGenerator {
            next_id: 2,
            phantom: PhantomData,
        },
        tid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
        tiid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
    },
    query: QueryOpts {
        omit_ident_prefix: false,
        pre_projection: false,
        allow_ctes: true,
        allow_stars: true,
    },
    query_stack: [],
    ctes: [],
}
SELECT
  DISTINCT *
FROM
  invoices

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
~/dev/prql$ cat example_2.prql
from invoices
group [billing_country] (
  # sort [-invoice_date]
  take 2
)
~/dev/prql$ cat ./example_2.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
    dialect: GenericDialect,
    anchor: AnchorContext {
        column_decls: {
            column-1: RelationColumn(
                TIId(
                    0,
                ),
                column-1,
                Wildcard,
            ),
            column-0: RelationColumn(
                TIId(
                    0,
                ),
                column-0,
                Single(
                    Some(
                        "billing_country",
                    ),
                ),
            ),
        },
        column_names: {},
        table_decls: {
            table-0: SqlTableDecl {
                id: table-0,
                name: Some(
                    "invoices",
                ),
                relation: None,
            },
        },
        table_instances: {
            TIId(
                0,
            ): TableRef {
                source: table-0,
                columns: [
                    (
                        Single(
                            Some(
                                "billing_country",
                            ),
                        ),
                        column-0,
                    ),
                    (
                        Wildcard,
                        column-1,
                    ),
                ],
                name: Some(
                    "invoices",
                ),
            },
        },
        col_name: NameGenerator {
            prefix: "_expr_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        table_name: NameGenerator {
            prefix: "table_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        cid: IdGenerator {
            next_id: 2,
            phantom: PhantomData,
        },
        tid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
        tiid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
    },
    query: QueryOpts {
        omit_ident_prefix: false,
        pre_projection: false,
        allow_ctes: true,
        allow_stars: true,
    },
    query_stack: [],
    ctes: [],
}
WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY billing_country) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1 AS table_0
WHERE
  _expr_0 <= 2

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)

aljazerzen · 2023-03-10T09:58:29Z

In the preprocess, all the columns are known and identifiable by CIDs.

To get the columns, you can use determine_select_columns on the current pipeline.

Then is just the matter of comparing two lists of cids.

cottrell · 2023-03-10T11:16:00Z

Yeah, it looks like need to do some inspection of the ctx.anchor.colum_decls values, filter out Wildcard and compare to the keys of determine_select_columns (output in screenshot).

The first two are cases to exclude and the third would be the case to allow distinct op

aljazerzen · 2023-03-10T12:02:44Z

I don't think we should be looking into colum_decls decls at all - it does not matter what the columns are, if they are all included in the take's partition, that's a DISTINCT.

(note that Wildcard does not mean all columns, but all unspecified columns)

max-sixty · 2023-03-11T21:42:44Z

Closed by #2109.

Thanks for the guidance @aljazerzen .

I'm also going to reflect on this as a case of #1840. The actual code was fairly easy. But:

knowing about determine_select_columns would have been hard without your guidance.
It's also v helpful to look at the RQ — possible we try and make it so that's the first place people look — lots of this is just RQ transformations, a much easier problem than understanding the full compiler.

So maybe we try and frame issues & tests as "we need to change this RQ into this RQ". Does that resonate at all??

aljazerzen · 2023-03-12T11:39:05Z

knowing about determine_select_columns would have been hard without your guidance

Unfortunately, I don't see a way around this. There is a lot of tools in the toolbox, it's the part of the job to know which tool to use. The best thing we can do here is lots of guidance :D

It's also v helpful to look at the RQ

It's nice to hear that my major refactor that added RQ was worth it. It turns out that looking at architectures of other compilers is a good practice, hehe.

Maybe we should add even more IRs that are even closer to SQL? Just a thought, idk where I'd split it.

So maybe we try and frame issues & tests as "we need to change this RQ into this RQ"

That works for some issues, not all of them. But in general yes, that would probably be helpful for someone to get started.

max-sixty · 2023-03-13T22:28:17Z

Unfortunately, I don't see a way around this. There is a lot of tools in the toolbox, it's the part of the job to know which tool to use. The best thing we can do here is lots of guidance :D

I think with your guidance we're 90% of the way there — this issue was a good case of that.

As you know, one of my priorities is to give you better leverage on your time, which we'd get from contributions requiring less guidance. Letting folks be independent would also be a better contribution experience.

It's possible that we've done everything we can, and I don't have specific examples for where we haven't yet. But in general, I've found there's often lots of potential to make things more local, requiring less context & guidance. RQ is a good example of that I reckon, even if we haven't fully exploited it yet.

I'm going to try and pivot back to the compiler by fixing some of our bugs (like this) — I think we're probably under-weighing how important these are for having the language feel reliable. While I'm doing that, I'll keep an eye open for opportunities on this dimension.

max-sixty added the bug Invalid compiler output or panic label Mar 9, 2023

max-sixty closed this as completed Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong or inconsitent behaviour with `take` #2058

Wrong or inconsitent behaviour with `take` #2058

cottrell commented Mar 8, 2023 •

edited

Loading

eitsupi commented Mar 8, 2023

cottrell commented Mar 8, 2023

cottrell commented Mar 8, 2023

cottrell commented Mar 8, 2023

max-sixty commented Mar 9, 2023 •

edited

Loading

cottrell commented Mar 9, 2023 •

edited

Loading

aljazerzen commented Mar 9, 2023

max-sixty commented Mar 10, 2023

cottrell commented Mar 10, 2023

max-sixty commented Mar 10, 2023

cottrell commented Mar 10, 2023

aljazerzen commented Mar 10, 2023

cottrell commented Mar 10, 2023

aljazerzen commented Mar 10, 2023

max-sixty commented Mar 11, 2023

aljazerzen commented Mar 12, 2023

max-sixty commented Mar 13, 2023

Wrong or inconsitent behaviour with take #2058

Wrong or inconsitent behaviour with take #2058

Comments

cottrell commented Mar 8, 2023 • edited Loading

eitsupi commented Mar 8, 2023

cottrell commented Mar 8, 2023

cottrell commented Mar 8, 2023

cottrell commented Mar 8, 2023

max-sixty commented Mar 9, 2023 • edited Loading

cottrell commented Mar 9, 2023 • edited Loading

aljazerzen commented Mar 9, 2023

max-sixty commented Mar 10, 2023

cottrell commented Mar 10, 2023

max-sixty commented Mar 10, 2023

cottrell commented Mar 10, 2023

aljazerzen commented Mar 10, 2023

cottrell commented Mar 10, 2023

aljazerzen commented Mar 10, 2023

max-sixty commented Mar 11, 2023

aljazerzen commented Mar 12, 2023

max-sixty commented Mar 13, 2023

Wrong or inconsitent behaviour with `take` #2058

Wrong or inconsitent behaviour with `take` #2058

cottrell commented Mar 8, 2023 •

edited

Loading

max-sixty commented Mar 9, 2023 •

edited

Loading

cottrell commented Mar 9, 2023 •

edited

Loading