Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong or inconsitent behaviour with take #2058

Closed
cottrell opened this issue Mar 8, 2023 · 17 comments
Closed

Wrong or inconsitent behaviour with take #2058

cottrell opened this issue Mar 8, 2023 · 17 comments
Labels
bug Invalid compiler output or panic

Comments

@cottrell
Copy link

cottrell commented Mar 8, 2023

I suspect some minor problem in the parser in the case of take 1

A) (wrong)

from invoices
group [billing_city] (
  take 1
)
SELECT
  DISTINCT *
FROM
  invoices

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

B) (correct?)

from invoices
group [billing_city] (
  take 2
)
WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY billing_city) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
WHERE
  _expr_0 <= 2

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

C) (very wrong)

from invoices
group [billing_city] (
  take 1
)
select [a, b, c]
SELECT
  DISTINCT a,
  b,
  c
FROM
  invoices

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)

D) (correct?)

from invoices
group [billing_city] (
  sort [-invoice_date]
  take 1
)
WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY billing_city
      ORDER BY
        invoice_date DESC
    ) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
WHERE
  _expr_0 <= 1

-- Generated by PRQL compiler version:0.5.2 (https://prql-lang.org)
@eitsupi
Copy link
Member

eitsupi commented Mar 8, 2023

Hi, it would be easier to understand the problem better if you write what of output you are expecting.

@cottrell
Copy link
Author

cottrell commented Mar 8, 2023

The behaviour should not really depend on the integer number of things for take.

I have provided many examples to help as clues but perhaps they are distracting.

I think the take 2 (example B) is correct. And take 1 should behave the same way (obviously with "1" replacing "2").

@cottrell
Copy link
Author

cottrell commented Mar 8, 2023

Here is yet another example where "sort" being inserted takes us from error to correct result. The "sort" should have no impact on the result

image

image

I suspect it is something obvious in the "group" command where when everything but "take" op is omitted is has wrong behaviour.

@cottrell
Copy link
Author

cottrell commented Mar 8, 2023

Another example, notice there are MANY entries per group in the first one and only 2 per group in the second one.

image
image

@max-sixty
Copy link
Member

max-sixty commented Mar 9, 2023

It does look like there are some issues here.

(edited to fix a mistake of mine)

This returns reasonable results:

from invoices
select [billing_country, billing_city]
group [billing_city] (
  take 1
)
sort billing_city
WITH table_1 AS (
  SELECT
    DISTINCT billing_country,
    billing_city
  FROM
    invoices
)
SELECT
  billing_country,
  billing_city
FROM
  table_1
ORDER BY
  billing_city

...since taking the distinct of [billing_country, billing_city] returns the same rows as taking the distinct of billing_city.

So this works fine when the distinct over selected columns is the same as the distinct over the group.

But commenting out the select:

from invoices
# select [billing_country, billing_city]
group [billing_city] (
  take 1
)
sort billing_city

...means we get DISTINCT *, which is a very different number of rows — just anything that's not a duplicate.

WITH table_1 AS (
  SELECT
    DISTINCT *
  FROM
    invoices
)
SELECT
  *
FROM
  table_1
ORDER BY
  billing_city

I haven't gone through all of these. It's on my list to attempt an issue template (#2004) so it's easy to fill in the PRQL, incorrect SQL, and fixed SQL. If there are examples that this report doesn't cover, please do add them here in that format.

Possibly the rule here should be: Only use DISTINCT when the group contains all columns ?

Thanks for reporting @cottrell , it's appreciated.


The behaviour should not really depend on the integer number of things for take.

SQL has a DISTINCT keyword which makes the resulting queries much simpler in the case of take 1, so we do try and use that, under the principle that readable SQL is a good goal

@max-sixty max-sixty added the bug Invalid compiler output or panic label Mar 9, 2023
@cottrell
Copy link
Author

cottrell commented Mar 9, 2023

@max-sixty "Possibly the rule here should be: Only use DISTINCT when the group contains all columns ?" That sounds about right. I tried to grok the parser code to see where the "take" rule is because it sounds like a simple logic errors of some fork that happens only in the case of "no cols selected" and "take = 1".

Interesting that https://github.com/PRQL/prql/blob/main/prql-compiler/src/sql/preprocess.rs#L127 ... seems like it should handle this.

update ... oh there is a TODO: from 2022 to handle this. Will have a look
image

@aljazerzen might remember but it's a year since that note I think.

A minimal thought fix is to simply comment that "optimisation" out (I've tested this)

image

@aljazerzen
Copy link
Member

Oh, yes I do remember :D It's a thing that I was putting off for a long time.

#944

At the time it was not convenient to do, but now it may not be so anymore.

@max-sixty
Copy link
Member

We could comment it out. It's a shame to lose the nice SQL we get with distinct, but it's better to be correct. We would need to update the docs too, which refer to DISTINCT (these could also be commented out as a stop-gap).

@cottrell if you want to do that, that'd be good. We can leave a TODO and a reference to this issue.

If we can figure out how how to keep DISTINCT by fixing that TODO, then even better. Though my guess is that it's not trivial, since it's still a TODO!

@cottrell
Copy link
Author

@max-sixty I had a quick look but didn't immediately understand how to access the "by" or the "columns" at that point in the code to do the TODO check. Likely there is some global understanding necessary ... would be good if anyone has ideas about how to do this or ideas about the blockers to sketch them out to have a slightly more indicative TODO.

@max-sixty
Copy link
Member

I had a quick look, and can look more tomorrow. This is code I'm hoping to get into myself more too (@aljazerzen wrote almost all of it).

FWIW the by is a remnant from us using by to specify the columns in a group — it just means the columns in the group.

FYI I added #2080 so I could add some dbg! statements and understand where the information was on the ctx variable.

@cottrell
Copy link
Author

FWIW just did the dbg! print.

~/dev/prql$ cat example_1.prql
from invoices
group [billing_country] (
  # sort [-invoice_date]
  take 1
)

~/dev/prql$ cat ./example_1.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
    dialect: GenericDialect,
    anchor: AnchorContext {
        column_decls: {
            column-0: RelationColumn(
                TIId(
                    0,
                ),
                column-0,
                Single(
                    Some(
                        "billing_country",
                    ),
                ),
            ),
            column-1: RelationColumn(
                TIId(
                    0,
                ),
                column-1,
                Wildcard,
            ),
        },
        column_names: {},
        table_decls: {
            table-0: SqlTableDecl {
                id: table-0,
                name: Some(
                    "invoices",
                ),
                relation: None,
            },
        },
        table_instances: {
            TIId(
                0,
            ): TableRef {
                source: table-0,
                columns: [
                    (
                        Single(
                            Some(
                                "billing_country",
                            ),
                        ),
                        column-0,
                    ),
                    (
                        Wildcard,
                        column-1,
                    ),
                ],
                name: Some(
                    "invoices",
                ),
            },
        },
        col_name: NameGenerator {
            prefix: "_expr_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        table_name: NameGenerator {
            prefix: "table_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        cid: IdGenerator {
            next_id: 2,
            phantom: PhantomData,
        },
        tid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
        tiid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
    },
    query: QueryOpts {
        omit_ident_prefix: false,
        pre_projection: false,
        allow_ctes: true,
        allow_stars: true,
    },
    query_stack: [],
    ctes: [],
}
SELECT
  DISTINCT *
FROM
  invoices

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
~/dev/prql$ cat example_2.prql
from invoices
group [billing_country] (
  # sort [-invoice_date]
  take 2
)
~/dev/prql$ cat ./example_2.prql | ./prqlc compile
[prql-compiler/src/sql/preprocess.rs:128] &ctx = Context {
    dialect: GenericDialect,
    anchor: AnchorContext {
        column_decls: {
            column-1: RelationColumn(
                TIId(
                    0,
                ),
                column-1,
                Wildcard,
            ),
            column-0: RelationColumn(
                TIId(
                    0,
                ),
                column-0,
                Single(
                    Some(
                        "billing_country",
                    ),
                ),
            ),
        },
        column_names: {},
        table_decls: {
            table-0: SqlTableDecl {
                id: table-0,
                name: Some(
                    "invoices",
                ),
                relation: None,
            },
        },
        table_instances: {
            TIId(
                0,
            ): TableRef {
                source: table-0,
                columns: [
                    (
                        Single(
                            Some(
                                "billing_country",
                            ),
                        ),
                        column-0,
                    ),
                    (
                        Wildcard,
                        column-1,
                    ),
                ],
                name: Some(
                    "invoices",
                ),
            },
        },
        col_name: NameGenerator {
            prefix: "_expr_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        table_name: NameGenerator {
            prefix: "table_",
            id: IdGenerator {
                next_id: 0,
                phantom: PhantomData,
            },
        },
        cid: IdGenerator {
            next_id: 2,
            phantom: PhantomData,
        },
        tid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
        tiid: IdGenerator {
            next_id: 1,
            phantom: PhantomData,
        },
    },
    query: QueryOpts {
        omit_ident_prefix: false,
        pre_projection: false,
        allow_ctes: true,
        allow_stars: true,
    },
    query_stack: [],
    ctes: [],
}
WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY billing_country) AS _expr_0
  FROM
    invoices
)
SELECT
  *
FROM
  table_1 AS table_0
WHERE
  _expr_0 <= 2

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)

@aljazerzen
Copy link
Member

In the preprocess, all the columns are known and identifiable by CIDs.

To get the columns, you can use determine_select_columns on the current pipeline.

Then is just the matter of comparing two lists of cids.

@cottrell
Copy link
Author

Yeah, it looks like need to do some inspection of the ctx.anchor.colum_decls values, filter out Wildcard and compare to the keys of determine_select_columns (output in screenshot).

The first two are cases to exclude and the third would be the case to allow distinct op

image

@aljazerzen
Copy link
Member

I don't think we should be looking into colum_decls decls at all - it does not matter what the columns are, if they are all included in the take's partition, that's a DISTINCT.

(note that Wildcard does not mean all columns, but all unspecified columns)

@max-sixty
Copy link
Member

Closed by #2109.

Thanks for the guidance @aljazerzen .

I'm also going to reflect on this as a case of #1840. The actual code was fairly easy. But:

  • knowing about determine_select_columns would have been hard without your guidance.
  • It's also v helpful to look at the RQ — possible we try and make it so that's the first place people look — lots of this is just RQ transformations, a much easier problem than understanding the full compiler.

So maybe we try and frame issues & tests as "we need to change this RQ into this RQ". Does that resonate at all??

@aljazerzen
Copy link
Member

knowing about determine_select_columns would have been hard without your guidance

Unfortunately, I don't see a way around this. There is a lot of tools in the toolbox, it's the part of the job to know which tool to use. The best thing we can do here is lots of guidance :D

It's also v helpful to look at the RQ

It's nice to hear that my major refactor that added RQ was worth it. It turns out that looking at architectures of other compilers is a good practice, hehe.

Maybe we should add even more IRs that are even closer to SQL? Just a thought, idk where I'd split it.

So maybe we try and frame issues & tests as "we need to change this RQ into this RQ"

That works for some issues, not all of them. But in general yes, that would probably be helpful for someone to get started.

@max-sixty
Copy link
Member

Unfortunately, I don't see a way around this. There is a lot of tools in the toolbox, it's the part of the job to know which tool to use. The best thing we can do here is lots of guidance :D

I think with your guidance we're 90% of the way there — this issue was a good case of that.

As you know, one of my priorities is to give you better leverage on your time, which we'd get from contributions requiring less guidance. Letting folks be independent would also be a better contribution experience.

It's possible that we've done everything we can, and I don't have specific examples for where we haven't yet. But in general, I've found there's often lots of potential to make things more local, requiring less context & guidance. RQ is a good example of that I reckon, even if we haven't fully exploited it yet.

I'm going to try and pivot back to the compiler by fixing some of our bugs (like this) — I think we're probably under-weighing how important these are for having the language feel reliable. While I'm doing that, I'll keep an eye open for opportunities on this dimension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic
Projects
None yet
Development

No branches or pull requests

4 participants