Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server-generated column statistics #4680

Merged
merged 39 commits into from
Nov 2, 2023

Conversation

niloc132
Copy link
Member

Reintroduces the column statistics feature from DHE. This code is copied, with a few changes to how it behaves and how it functions:

  • Results are sent to the client in a static table rather than a simple object payload.
  • Client can control the max number of unique values to display, up to a point (previously the max was technically 19, new default is 20).
  • Non-Comparable objects now have their "popularity" counts as well as Comparable, until a max limit is reached. Only the most common 20 entries will be sent to the client.
  • Deephaven "null" values are used to represent that a standard deviation or average cannot be computed.

Note that until #188 is solved, the JS API and web UI will not display the unique value list.

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto changes look structurally sound; but I have a larger question. Are we fundamentally performing computations here that can't be performed with our existing APIs / table operations? If the answer is "yes", would it be better to try and improve our fundamental operations to handle this use case? ColumnStatisticsRequest looks like a heavy hammer, and I'm sad to see the bespoke logic in server/src/main/java/io/deephaven/server/table/stats/. (I understand this is mostly a copy of DHE code, but worth raising Qs IMO.)

Comment on lines +614 to +616
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details('Method not implemented!')
raise NotImplementedError('Method not implemented!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need implementation?

Copy link
Member Author

@niloc132 niloc132 Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - every generated implementation looks like this. If you were to implement a python server, you would override this method and offer a real implementation.

@niloc132
Copy link
Member Author

@devinrsmith

Are we fundamentally performing computations here that can't be performed with our existing APIs / table operations?

Yes, and deliberately so - we do not have a "Count" operation that only counts non-nulls, and we do not have a "unique" operation that only selects the N most unique. Neither is impossible to add, but @rcaudy specifically prefers that for now we just stick to a specialized implementation like DHE has.

As one alternative, we could implement the two above aggregations, which would give us these same operations, except now ticking - however, the only consumer of this API is the web UI's column stats popup, which doesn't expect a Table instance, and so doesn't support ticking data. So the only win here would be that we would implement two new aggregation types, and then not need the rest of this new code.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a review of everything but the JS* classes.

@rcaudy
Copy link
Member

rcaudy commented Oct 26, 2023

@devinrsmith

Are we fundamentally performing computations here that can't be performed with our existing APIs / table operations?

Yes, and deliberately so - we do not have a "Count" operation that only counts non-nulls, and we do not have a "unique" operation that only selects the N most unique. Neither is impossible to add, but @rcaudy specifically prefers that for now we just stick to a specialized implementation like DHE has.

As one alternative, we could implement the two above aggregations, which would give us these same operations, except now ticking - however, the only consumer of this API is the web UI's column stats popup, which doesn't expect a Table instance, and so doesn't support ticking data. So the only win here would be that we would implement two new aggregation types, and then not need the rest of this new code.

We discussed this at some length before doing the porting. I would prefer to use our aggs, and thus standardize on one set of code for this. What convinced me that we should just port is:

  1. The client cannot consume ticking stats (if it could, I would insist we use aggs)
  2. This code can be "tighter", only scanning each chunk a single time
  3. We can ship faster this way

niloc132 and others added 4 commits October 27, 2023 10:06
…sticsGrpcImpl.java

Co-authored-by: Ryan Caudy <rcaudy@gmail.com>
Co-authored-by: Ryan Caudy <rcaudy@gmail.com>
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor bug, and some code cleanup suggestions.

@niloc132 niloc132 merged commit 7066260 into deephaven:main Nov 2, 2023
10 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants