stats: reduce overhead of distinct estimation #140772

yuzefovich · 2025-02-08T04:22:42Z

We currently use hyperloglog library to estimate the number of distinct elements. I just collected a 50s cpu profile non-nil datum alloc (about the time it took for ANALYZE to complete) on a cluster that only had ANALYZE tpcc.customer running, and this distinct estimation is the most expensive part of the stats collection (this was on dbb0baa plus a revert of 2831511 and another commit to introduce a cluster setting for using nil or non-nil DatumAlloc in stats):

We should investigate whether it's possible to reduce this overhead. We recently upgraded the hyperloglog library, so there is no quick fix like that :/

There were some ideas floated around that we could avoid this expensive computation altogether for key columns if we were to scan the secondary indexes.

Related to #135988.

nil datum alloc

Jira issue: CRDB-47355

The text was updated successfully, but these errors were encountered:

yuzefovich added A-sql-table-stats Table statistics (and their automatic refresh). C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team labels Feb 8, 2025

yuzefovich added this to SQL Queries Feb 8, 2025

github-project-automation bot moved this to Triage in SQL Queries Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: reduce overhead of distinct estimation #140772

stats: reduce overhead of distinct estimation #140772

yuzefovich commented Feb 8, 2025 •

edited

Loading

stats: reduce overhead of distinct estimation #140772

stats: reduce overhead of distinct estimation #140772

Comments

yuzefovich commented Feb 8, 2025 • edited Loading

yuzefovich commented Feb 8, 2025 •

edited

Loading