Use stats to guide query parallelization #49

jtaylor-sfdc · 2013-02-14T03:12:43Z

We're currently not using stats, beyond a table-wide min key/max key cached per client connection, to guide parallelization. If a query targets just a few regions, we don't know how to evenly divide the work among threads, because we don't know the data distribution. This other issue is targeting gather and maintaining the stats, while this issue is focused on using the stats.

The main changes are:

Create a PTableStats interface that encapsulates the stats information (and implements the Writable interface so that it can be serialized back from the server).
Add a stats member variable off of PTable to hold this.
From MetaDataEndPointImpl, lookup the stats row for the table in the stats table. If the stats have changed, return a new PTable with the updated stats information. We may want to cache the stats row and have the stats gatherer invalidate the cache row when updated so we don't have to always do a scan for it. Additionally, it would be idea if we could use the same split policy on the stats table that we use on the system table to guarantee co-location of data (for the sake of caching).
modify the client-side parallelization (ParallelIterators.getSplits()) to use this information to guide how to chunk up the scans at query time.

This should help boost query performance, especially in cases where the data is highly skewed. It's likely the cause for the slowness reported in this issue: #47.

jtaylor-sfdc · 2013-02-22T01:45:50Z

I'm going to split this up into multiple enhancement issues. Turns out that Jesse is already working on the stats system table (item 1-4). I'll split these from the others. Tony - how about you start from adding a new PTableStats interface stored off of PTable? You can dummy up the data until Jesse is ready and figure out how to change ParallelIterators to take advantage of the new byte[] guideposts.

jyates · 2013-04-05T00:41:06Z

Pushed up a branch with the initial implementation of hbase-stat (from an internal project). Right now, it builds on its own and makes its own jar. If you'd like James, I can work on a couple more commits on top of it to get it into shape for building along with the rest of Phoenix. However, it wouldn't be all that terrible if we wanted to release it separately from phoenix - your call.

tonyhuang · 2013-04-08T20:27:28Z

Hi Jesse, can you send a pull request to James so we can pull the package into the phoenix project?

Thanks
Tony

mujtabachohan · 2013-04-23T22:48:20Z

Stats needed to decide when to use Skip Scan

Schema
CREATE TABLE T (HOST VARCHAR NOT NULL, DATE DATE NOT NULL,CF1.A BIGINT,CF1.B BIGINT,CF2.C BIGINT CONSTRAINT PK PRIMARY KEY (HOST, DATE))

Query example
SELECT COUNT(1) FROM TABLE WHERE HOST LIKE '?%' AND DATE >CURRENT_DATE()-?;

Factor NOT affecting performance
Percentage of rows filtered by leading PK - Same performance curve as range scan

Factors affecting performance
Percentage of rows filtered by trailing PK - Affected by cardinality of leading PK. Low cardinality of leading PK: Skip scan gets faster as number of rows matching filter decreased, at worst (all rows match) it is equal to scan. High cardinality of leading PK: Always slower than scan and shows the same performance curve as range scan.

Cardinality of leading PK - As elaborated above, Skip Scan performance is dependent of cardinality of leading PK. Performance get better as number of rows filtered by trailing PK gets lower.

Therefore the main factor that affects skip scan performance is cardinality of leading PK (see graph below which shows query time on Y axis and percentage of rows filtered by trailing PK on X axis). Our test result shows that Skip Scan should be used when cardinality of leading PK is less than 75% (i.e. less than every 3 out of 4 leading PK is unique).

Legend: P1.2 - Phoenix 1.2 using Skip Scan, P1.1 - Phoenix 1.1 using Range Scan

Note: 20M rows were used on 4 machine cluster for this test running HBase 0.94.6.

ghost assigned tonyhuang Feb 14, 2013

jtaylor-sfdc mentioned this issue Feb 20, 2013

slow query #47

Closed

jyates mentioned this issue Apr 10, 2013

Add hbase-stats project to contrib/ #131

Open

jtaylor-sfdc unassigned tonyhuang Feb 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use stats to guide query parallelization #49

Use stats to guide query parallelization #49

jtaylor-sfdc commented Feb 14, 2013

jtaylor-sfdc commented Feb 22, 2013

jyates commented Apr 5, 2013

tonyhuang commented Apr 8, 2013

mujtabachohan commented Apr 23, 2013

Use stats to guide query parallelization #49

Use stats to guide query parallelization #49

Comments

jtaylor-sfdc commented Feb 14, 2013

jtaylor-sfdc commented Feb 22, 2013

jyates commented Apr 5, 2013

tonyhuang commented Apr 8, 2013

mujtabachohan commented Apr 23, 2013

Stats needed to decide when to use Skip Scan