Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by is better than distinct #1319

Merged
merged 8 commits into from
Mar 3, 2021
Merged

Conversation

VincentLanglet
Copy link
Member

@VincentLanglet VincentLanglet commented Feb 25, 2021

Subject

I have a Contract entity and a Client entity.
A client can have multiple contracts. A contract can be associated to multiple clients.

In the configureQuery of my ContractAdmin, I'm using a leftJoin('contract.client').
I got a weird behavior:

  • The total of result was 102.
  • I got 4 pages of 32 results (which mean 128 results)
  • Some results were missing (in database the query was returning 133 results, but there was no page 5 on the admin)

Distinct is not a good thing to use with pagination:
If the results are:

-----------
contract | client
-----------
1 | 1
-----------
2 | 1
-----------
2 | 2
-----------
1 | 2
-----------
3 | 1
-----------

With distinct on col 1 and a limit of 2

  • Page 1 return contract 1 and 2
  • Page 2 return contract 2 and 1
  • There is no page 3

With groupby, it's applied BEFORE the pagination, so it became

-----------
contract | client
-----------
1 | 1
-----------
2 | 1
-----------
3 | 1
-----------

and the result are the one expected.

I tried on my project and it works.

Another thing: setDistinct and isDistinct are not in the ProxyQueryInterface.
I think we should deprecate them, and always adding the groupBy in the fixQueryBuilder method.

WDYT @sonata-project/contributors ?

Changelog

### Added
- `Sonata\DoctrineORMAdminBundle\Datagrid\ProxyQueryInterface::getDoctrineQuery()`

### Deprecated
- class `Sonata\DoctrineORMAdminBundle\Datagrid\OrderByToSelectWalker`
- `Sonata\DoctrineORMAdminBundle\Datagrid\ProxyQuery::getFixedQueryBuilder()`
- `Sonata\DoctrineORMAdminBundle\Datagrid\ProxyQuery::getSingleScalarResult()`

### Fixed
- Do not display multiple times the same row in the admin list and the export list.

@VincentLanglet VincentLanglet requested review from a team, phansys, greg0ire and franmomu February 25, 2021 17:30
@greg0ire
Copy link
Contributor

greg0ire commented Feb 25, 2021

This post suggests that you can have issues when using pagination and no sort, but what you are saying is surprising to say the least. I would expect LIMIT to be performed after DISTINCT. You did not mention LIMIT though, you mentioned pagination. Is this maybe based on subqueries? If the limit happens in a subquery, and the DISTINCT is applied on the result, I could imagine this happening. I upvoted your post because I remember being told to use GROUP BY over DISTINCT, but that was for performance reasons.

Can you maybe share the final SQL that is executed when you observe your bug?

Also, your PR adds the GROUP BY, but does not remove the DISTINCT, why is that?

@VincentLanglet
Copy link
Member Author

VincentLanglet commented Feb 25, 2021

Also, your PR adds the GROUP BY, but does not remove the DISTINCT, why is that?

I didn't know if I could remove it.

Can you maybe share the final SQL that is executed when you observe your bug?

Sure

The query page 4:

SELECT
  DISTINCT a0_.id AS id_0,
  a1_.usagename AS usagename_1,
  a1_.firstname AS firstname_2,
  a1_.id AS id_3,
  a0_.id AS id_4
FROM
  asv_contract a0_
  LEFT JOIN asv_client_role a2_ ON a0_.id = a2_.contract_id
  LEFT JOIN asv_external_client_id a3_ ON a2_.external_client_id_id = a3_.id
  LEFT JOIN asv_client a1_ ON a3_.client_id = a1_.id
WHERE
  (
    a1_.portfolio_id IS NULL
    OR a1_.portfolio_id IN (1, 5, 3, 4, 2)
  )
  AND a1_.usagename LIKE '%DELABIE%'
ORDER BY
  a1_.usagename ASC,
  a1_.firstname ASC,
  a1_.id ASC,
  a0_.id ASC
LIMIT
  32 OFFSET 96;

I get 32 results.

If I tried

SELECT
  DISTINCT a0_.id AS id_0,
  a1_.usagename AS usagename_1,
  a1_.firstname AS firstname_2,
  a1_.id AS id_3,
  a0_.id AS id_4
FROM
  asv_contract a0_
  LEFT JOIN asv_client_role a2_ ON a0_.id = a2_.contract_id
  LEFT JOIN asv_external_client_id a3_ ON a2_.external_client_id_id = a3_.id
  LEFT JOIN asv_client a1_ ON a3_.client_id = a1_.id
WHERE
  (
    a1_.portfolio_id IS NULL
    OR a1_.portfolio_id IN (1, 5, 3, 4, 2)
  )
  AND a1_.usagename LIKE '%DELABIE%'
ORDER BY
  a1_.usagename ASC,
  a1_.firstname ASC,
  a1_.id ASC,
  a0_.id ASC

I get 133 results. (But 102 with the group by)

This is because I have multiple rows with the same id with the Distinct:
image

I know understand the following part

        // for SELECT DISTINCT, ORDER BY expressions must appear in idxSelect list
        /* Consider
            SELECT DISTINCT x FROM tab ORDER BY y;
        For any particular x-value in the table there might be many different y
        values.  Which one will you use to sort that x-value in the output?
        */
        $queryId = $queryBuilderId->getQuery();
        $queryId->setHint(Query::HINT_CUSTOM_TREE_WALKERS, [OrderByToSelectWalker::class]);

The distinct is applied on the id AND on the fields used for the order by.
But I'm not sure it makes sens two displays multiple times the same contract in the admin because we don't know which value we should use to order.

The count is made this way:

SELECT COUNT(*) AS dctrn_count
FROM
  (
    SELECT
      DISTINCT id_0
    FROM
      (
        SELECT
          a0_.id AS id_0,
          a0_.admin_comment AS admin_comment_1,
          a0_.external_contract_id AS external_contract_id_2,
          a0_.effective_date AS effective_date_3,
          a0_.end_date AS end_date_4,
          a0_.updated_date AS updated_date_5,
          a0_.sent_to_partner_date AS sent_to_partner_date_6,
          a0_.created_by_grc_date AS created_by_grc_date_7,
          a0_.completed_by_grc_date AS completed_by_grc_date_8,
          a0_.pledge AS pledge_9,
          a0_.management_mandate AS management_mandate_10,
          a0_.contract_status AS contract_status_11,
          a0_.buy_worth AS buy_worth_12,
          a0_.current_advances AS current_advances_13,
          a0_.is_complete AS is_complete_14,
          a0_.has_been_matched_with_created_contract AS has_been_matched_with_created_contract_15,
          a0_.distribution_in_progress AS distribution_in_progress_16,
          a0_.cash_amount AS cash_amount_17,
          a0_.manual_input AS manual_input_18,
          a0_.bank_account_number AS bank_account_number_19,
          a0_.received_at AS received_at_20,
          a0_.current_management_mode_for_partner AS current_management_mode_for_partner_21,
          a0_.management_mode_change_date AS management_mode_change_date_22,
          a0_.business_identifier_code AS business_identifier_code_23,
          a0_.origin AS origin_24,
          a0_.pending_reasons AS pending_reasons_25
        FROM
          asv_contract a0_
          LEFT JOIN asv_client_role a1_ ON a0_.id = a1_.contract_id
          LEFT JOIN asv_external_client_id a2_ ON a1_.external_client_id_id = a2_.id
          LEFT JOIN asv_client a3_ ON a2_.client_id = a3_.id
        WHERE
          (
            a3_.portfolio_id IS NULL
            OR a3_.portfolio_id IN (1, 5, 3, 4, 2)
          )
          AND a3_.usagename LIKE '%DELABIE%'
        ORDER BY
          a3_.usagename ASC,
          a3_.firstname ASC,
          a3_.id ASC
      ) dctrn_result
  ) dctrn_table;

I get 102.

This is done by the Paginator introduced recently in the Pager, that's why I only have the bug now.
You can notice that the Paginator only apply the distinct to the id.

@VincentLanglet VincentLanglet marked this pull request as draft February 25, 2021 18:51
@greg0ire
Copy link
Contributor

greg0ire commented Feb 25, 2021

I'm not sure what your requirements are: do you mean to display one line per contract, or should there be 2 lines if one contract happens to have 2 clients? Anyway, displaying a many-to-many in a table is not going to be simple. Maybe what you want is a GROUP_CONCAT so that you have 16549 | DELABIE | Maud, Jean-Claude?

@VincentLanglet
Copy link
Member Author

I'm not sure what your requirements are: do you mean to display one line per contract, or should there be 2 lines if one contract happens to have 2 clients? Anyway, displaying a many-to-many in a table is not going to be simple.

I wanted to

  • display only the contract I'm allowed to see = the contract of the client I'm allowed to see. This is why there is a join with client, there is also a where client.portfolio in .....
  • order the list by the client firstname/usagename of the contract.

In the SonataAdmin, the query is only returning an array of object, Contract[] here. Then for each contract, I display the contract id in a cell, the contract name in a second cell and all the client of the contract in a third cell (Sonata handle without any issue the OTM or MTM relation in the ListMapper).

The only issue I get is with the query which is returning multiples times the same contract (which I don't want).
I suppose this can be debatable but since, again, we're returning Contract[] I don't see it would be interesting to returning multiple times the same contract since it lead to the same row in the table. (If the contract is owned by client1 and client2, I don't have a row with the contract and client1 and a row with the contract and client2 but 2 rows with the contract and client1 and client2).

And since the Pager returns a count different of the count of my query results, all the results are not displayed, so there is definitely something to do.

@VincentLanglet
Copy link
Member Author

I remembered about this issue #1225
And It was also solved by changing distinct into groupBy.

@VincentLanglet
Copy link
Member Author

The current test error is

 An exception has been thrown during the rendering of a template ("[Semantical Error] line 0, col 147 near 'IDENTITY(o.command)': Error: Cannot group by undefined identification or result variable."). (500 Internal Server Error) 

@VincentLanglet VincentLanglet force-pushed the groupBy branch 4 times, most recently from 483747a to b7e20cc Compare February 25, 2021 23:33
src/Datagrid/OrderByToSelectWalker.php Outdated Show resolved Hide resolved
With groupby, distinct is not needed.
So do the OrderByToSelectWalker
@VincentLanglet VincentLanglet marked this pull request as ready for review February 25, 2021 23:51
@VincentLanglet
Copy link
Member Author

WDYT about also deprecating ProxyQuery::setDistinct and ProxyQuery::isDistinct ? @sonata-project/contributors

@VincentLanglet VincentLanglet requested review from phansys and a team February 25, 2021 23:53
@greg0ire
Copy link
Contributor

greg0ire commented Feb 26, 2021

With groupby, it's applied BEFORE the pagination, so it became

-----------
contract | client
-----------
1 | 1
-----------
2 | 1
-----------
3 | 1
-----------

Ok so… we're losing a lot of information here, are we not? What's your point?

But I'm not sure it makes sens two displays multiple times the same contract
in the admin because we don't know which value we should use to order.

Well we can always use the id of the joined table, can we not? To me, the
issue with displaying the same contract multiple times is that it makes no
sense in a contract list. Or maybe it should be shown like follows:

contract id Usage Name First name client id
16538 DELABIE Guillaume 15696
16549 DELABIE Jean-Claude 15746
DELABIE Maud 15749

Not printing ids that don't change might help the user spot that this is the
same contract. BTW, please don't use screenshots of text,
it makes it super cumbersome for me to reuse your examples.

There's something peculiar in the SQL request you shared: a0_.id is selected
twice, once as id_0, then as id_4.

I wanted to display only the contract I'm allowed to see = the contract of
the client I'm allowed to see. This is why there is a join with client, there
is also a where client.portfolio in

What does that have to do with the issue at hand? My question was more about
whether you think it's best to have one line per contract, or several if that
contract has many clients.

I display the contract id in a cell, the contract name in a second cell and
all the client of the contract in a third cell (Sonata handle without any
issue the OTM or MTM relation in the ListMapper).

Ok so it's the latter, you want one line, and apparently, that already works
somehow.

The only issue I get is with the query which is returning multiples times the
same contract (which I don't want).

If we are talking about the SQL query, I disagree, it's normal, that's how
fetch joins work
.
If we are talking about the DQL query, I agree, it would be best to have one
Contract object hydrated with all its clients if you want to display one line
per contract.

I suppose this can be debatable but since, again, we're returning Contract[]

You mention Contract here, so you must be talking about the DQL query then?

I don't see it would be interesting to returning multiple times the same
contract since it lead to the same row in the table. (If the contract is
owned by client1 and client2, I don't have a row with the contract and
client1 and a row with the contract and client2 but 2 rows with the contract
and client1 and client2).

Ah and now you are complaining about multiple rows… so you were talking about
SQL in the end? It's unclear to me.

Overall I would instinctively favor GROUP BY over DISTINCT too, but I think you can make your case clearer as for why.

@VincentLanglet
Copy link
Member Author

What does that have to do with the issue at hand? My question was more about
whether you think it's best to have one line per contract, or several if that
contract has many clients.

I think it's best to have one line per contract. That's what I was trying to explain.
The row I was talking about were the row of the Admin table, not the row of the SQL results.

If we are talking about the SQL query, I disagree, it's normal, that's how fetch joins work.
If we are talking about the DQL query, I agree, it would be best to have one
Contract object hydrated with all its clients if you want to display one line
per contract.

I know how a join works in SQL, I was talking about the DQL indeed.

Ah and now you are complaining about multiple rows… so you were talking about
SQL in the end? It's unclear to me.

If the contract2 has two clients, the DQL is returning something like

[$contract1, $contract2, $contract3, $contract2]

Both $contract2 are obviously fully hydrated. So in the Admin, I get 4 row instead of 3, and the row (of the admin list) 2 and the row (of the admin list) 4 display the same information.

This is fixed with this PR, now, the result of the query is

[$contract1, $contract2, $contract3]

@VincentLanglet
Copy link
Member Author

@greg0ire I'd like to find a BC way to introduce this change instead of targeting master.

First, I found this one: creating a getFixedQueryBuilder2 private method and deprecating getFixedQueryBuilder.

But then, I discovered that the DataSource::createIterator had similar issues to ProxyQuery::execute.

  • The same results were displayed multiple times in the export
  • The sort were different in the export than in the list (because of the order of addSortBy calls)

So the way we were fixing the queryBuilder need to be public in order to be called in the DataSource::createIterator method.
I think the best way is to provide a getDoctrineQuery method in the public Api.

WDYT about this updated PR ? :)

Copy link
Contributor

@greg0ire greg0ire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good idea 👍

src/Datagrid/ProxyQueryInterface.php Outdated Show resolved Hide resolved
src/Exporter/DataSource.php Outdated Show resolved Hide resolved
@VincentLanglet VincentLanglet requested a review from a team March 2, 2021 08:34
@greg0ire greg0ire merged commit 5c30b1a into sonata-project:3.x Mar 3, 2021
@greg0ire
Copy link
Contributor

greg0ire commented Mar 3, 2021

Thanks @VincentLanglet !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants