Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a config param and session properties for setting Parquet's max read block row count #15474

Merged
merged 1 commit into from
Dec 22, 2022

Conversation

assaf2
Copy link
Member

@assaf2 assaf2 commented Dec 20, 2022

Description

Lately (#15257) we increased the hard-coded value from 1k to 8k, but in some cases, 8k is much worse than 1k. A larger value may spare some CPU time but can also lead to more lazy block decoding, inputDataSize increase, and significant query performance degradation.
In the future, this param should be adaptive.

The following query took 36.83s on 403 and now it takes 59.76s (when max vector length is 8k):
SELECT count(c_description60) FROM storesalesflat_mixed WHERE ss_customer_sk>64992484;
The table is based on tpcds and contains 10,678,104,288 rows:

 CREATE TABLE hive.perf.storesalesflat_mixed (                    
    ss_sold_date_sk integer,                                      
    ss_sold_time_sk integer,                                      
    ss_item_sk integer,                                           
    ss_customer_sk integer,                                       
    ss_cdemo_sk integer,                                          
    ss_hdemo_sk integer,                                          
    ss_addr_sk integer,                                           
    ss_store_sk integer,                                          
    ss_promo_sk integer,                                          
    ss_ticket_number bigint,                                      
    ss_quantity integer,                                          
    ss_wholesale_cost double,                                     
    ss_list_price double,                                         
    ss_sales_price double,                                        
    ss_ext_discount_amt double,                                   
    ss_ext_sales_price double,                                    
    ss_ext_wholesale_cost double,                                 
    ss_ext_list_price double,                                     
    ss_ext_tax double,                                            
    ss_coupon_amt double,                                         
    ss_net_paid double,                                           
    ss_net_paid_inc_tax double,                                   
    ss_net_profit double,                                         
    d_date date,                                                  
    d_year integer,                                               
    d_quater integer,                                             
    t_am_pm char(2),                                              
    c_description100 varchar(100),                                
    c_string_num varchar(4),                                      
    c_town varchar(15),                                           
    c_description200 varchar(300),                                
    c_jibrish30 varchar(30),                                      
    c_description60 varchar(60)                                   
 ) 

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Iceberg, Delta
* Allow configuring batch size for reads on parquet files. The configuration property `parquet.max-read-block-row-count` or the catalog session property `parquet_max_read_block_row_count` can be used for this. ({issue}`15474`)

@cla-bot cla-bot bot added the cla-signed label Dec 20, 2022
@assaf2 assaf2 requested a review from raunaqmorarka December 20, 2022 13:16
@assaf2 assaf2 self-assigned this Dec 20, 2022
@skrzypo987
Copy link
Member

What's wrong with using parquet.max-read-block-size to limit the page size if indeed there is a regression?

Can you also share some info about the regressing query. I somehow don't see how this could significantly increase inputDataSize, as it tackles the reader's output.

@raunaqmorarka
Copy link
Member

What's wrong with using parquet.max-read-block-size to limit the page size if indeed there is a regression?

Can you also share some info about the regressing query. I somehow don't see how this could significantly increase inputDataSize, as it tackles the reader's output.

parquet.max-read-block-size is quite hard to tune since it includes size of the data, row-count is straightforward thing to understand the impact of.
Regressing query has the pattern SELECT expensive_to_decode_column FROM table WHERE (selective predicate on another column).
It's a specific scenario where there are a small number of rows surviving from the filter for every few thousand input rows. When batch size is small, there are more lazy block which don't need to be loaded, when batch size is big chances increase that you need to load the whole lazy block just to get a few rows out of it.

@assaf2 assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch 2 times, most recently from 34df627 to ebf1ceb Compare December 20, 2022 14:32
@assaf2 assaf2 requested a review from raunaqmorarka December 20, 2022 14:35
@assaf2 assaf2 changed the title Create a config param for Parquet's max vector length Create a config param and session properties for setting Parquet's max read block row count Dec 20, 2022
@assaf2 assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch 2 times, most recently from 6b1fa2a to b916c6b Compare December 20, 2022 16:24
@assaf2 assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from b916c6b to b9b4df7 Compare December 21, 2022 08:17
@assaf2 assaf2 requested a review from raunaqmorarka December 21, 2022 08:20
@assaf2 assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from b9b4df7 to f12e4e5 Compare December 21, 2022 08:47
@assaf2 assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from f12e4e5 to 34b737d Compare December 21, 2022 13:34
@assaf2 assaf2 requested a review from raunaqmorarka December 21, 2022 13:34
Includes config param and session properties for Hive, Iceberg, and Deltalake.
In some cases, the previously hard-coded 8k value is much worse than, for instance, 1k.
A larger value may spare some CPU time but can also lead to more lazy block decoding,
inputDataSize increase, and significant query performance degradation.
In the future, this param should be adaptive.
@raunaqmorarka raunaqmorarka force-pushed the decrease-MAX_VECTOR_LENGTH branch from 34b737d to 5ee8982 Compare December 21, 2022 20:14
@raunaqmorarka raunaqmorarka merged commit 2752540 into trinodb:master Dec 22, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants