-
Notifications
You must be signed in to change notification settings - Fork 227
Use stats to guide query parallelization #49
Comments
I'm going to split this up into multiple enhancement issues. Turns out that Jesse is already working on the stats system table (item 1-4). I'll split these from the others. Tony - how about you start from adding a new PTableStats interface stored off of PTable? You can dummy up the data until Jesse is ready and figure out how to change ParallelIterators to take advantage of the new byte[] guideposts. |
Pushed up a branch with the initial implementation of hbase-stat (from an internal project). Right now, it builds on its own and makes its own jar. If you'd like James, I can work on a couple more commits on top of it to get it into shape for building along with the rest of Phoenix. However, it wouldn't be all that terrible if we wanted to release it separately from phoenix - your call. |
Hi Jesse, can you send a pull request to James so we can pull the package into the phoenix project? Thanks |
We're currently not using stats, beyond a table-wide min key/max key cached per client connection, to guide parallelization. If a query targets just a few regions, we don't know how to evenly divide the work among threads, because we don't know the data distribution. This other issue is targeting gather and maintaining the stats, while this issue is focused on using the stats.
The main changes are:
This should help boost query performance, especially in cases where the data is highly skewed. It's likely the cause for the slowness reported in this issue: #47.
The text was updated successfully, but these errors were encountered: