stats: reduce overhead of distinct estimation #140772
Labels
A-sql-table-stats
Table statistics (and their automatic refresh).
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
T-sql-queries
SQL Queries Team
We currently use
hyperloglog
library to estimate the number of distinct elements. I just collected a 50s cpu profile non-nil datum alloc (about the time it took for ANALYZE to complete) on a cluster that only hadANALYZE tpcc.customer
running, and this distinct estimation is the most expensive part of the stats collection (this was on dbb0baa plus a revert of 2831511 and another commit to introduce a cluster setting for using nil or non-nil DatumAlloc in stats):We should investigate whether it's possible to reduce this overhead. We recently upgraded the hyperloglog library, so there is no quick fix like that :/
There were some ideas floated around that we could avoid this expensive computation altogether for key columns if we were to scan the secondary indexes.
Related to #135988.
nil datum alloc
Jira issue: CRDB-47355
The text was updated successfully, but these errors were encountered: