Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: reduce overhead of distinct estimation #140772

Open
yuzefovich opened this issue Feb 8, 2025 · 0 comments
Open

stats: reduce overhead of distinct estimation #140772

yuzefovich opened this issue Feb 8, 2025 · 0 comments
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Feb 8, 2025

We currently use hyperloglog library to estimate the number of distinct elements. I just collected a 50s cpu profile non-nil datum alloc (about the time it took for ANALYZE to complete) on a cluster that only had ANALYZE tpcc.customer running, and this distinct estimation is the most expensive part of the stats collection (this was on dbb0baa plus a revert of 2831511 and another commit to introduce a cluster setting for using nil or non-nil DatumAlloc in stats):

Image

We should investigate whether it's possible to reduce this overhead. We recently upgraded the hyperloglog library, so there is no quick fix like that :/

There were some ideas floated around that we could avoid this expensive computation altogether for key columns if we were to scan the secondary indexes.

Related to #135988.

nil datum alloc

Jira issue: CRDB-47355

@yuzefovich yuzefovich added A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team labels Feb 8, 2025
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team
Projects
Status: Triage
Development

No branches or pull requests

1 participant