Create a config param and session properties for setting Parquet's max read block row count #15474

assaf2 · 2022-12-20T13:16:27Z

Description

Lately (#15257) we increased the hard-coded value from 1k to 8k, but in some cases, 8k is much worse than 1k. A larger value may spare some CPU time but can also lead to more lazy block decoding, inputDataSize increase, and significant query performance degradation.
In the future, this param should be adaptive.

The following query took 36.83s on 403 and now it takes 59.76s (when max vector length is 8k):
SELECT count(c_description60) FROM storesalesflat_mixed WHERE ss_customer_sk>64992484;
The table is based on tpcds and contains 10,678,104,288 rows:

 CREATE TABLE hive.perf.storesalesflat_mixed (                    
    ss_sold_date_sk integer,                                      
    ss_sold_time_sk integer,                                      
    ss_item_sk integer,                                           
    ss_customer_sk integer,                                       
    ss_cdemo_sk integer,                                          
    ss_hdemo_sk integer,                                          
    ss_addr_sk integer,                                           
    ss_store_sk integer,                                          
    ss_promo_sk integer,                                          
    ss_ticket_number bigint,                                      
    ss_quantity integer,                                          
    ss_wholesale_cost double,                                     
    ss_list_price double,                                         
    ss_sales_price double,                                        
    ss_ext_discount_amt double,                                   
    ss_ext_sales_price double,                                    
    ss_ext_wholesale_cost double,                                 
    ss_ext_list_price double,                                     
    ss_ext_tax double,                                            
    ss_coupon_amt double,                                         
    ss_net_paid double,                                           
    ss_net_paid_inc_tax double,                                   
    ss_net_profit double,                                         
    d_date date,                                                  
    d_year integer,                                               
    d_quater integer,                                             
    t_am_pm char(2),                                              
    c_description100 varchar(100),                                
    c_string_num varchar(4),                                      
    c_town varchar(15),                                           
    c_description200 varchar(300),                                
    c_jibrish30 varchar(30),                                      
    c_description60 varchar(60)                                   
 )

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Iceberg, Delta
* Allow configuring batch size for reads on parquet files. The configuration property `parquet.max-read-block-row-count` or the catalog session property `parquet_max_read_block_row_count` can be used for this. ({issue}`15474`)

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java

skrzypo987 · 2022-12-20T14:11:51Z

What's wrong with using parquet.max-read-block-size to limit the page size if indeed there is a regression?

Can you also share some info about the regressing query. I somehow don't see how this could significantly increase inputDataSize, as it tackles the reader's output.

raunaqmorarka · 2022-12-20T14:26:29Z

What's wrong with using parquet.max-read-block-size to limit the page size if indeed there is a regression?

Can you also share some info about the regressing query. I somehow don't see how this could significantly increase inputDataSize, as it tackles the reader's output.

parquet.max-read-block-size is quite hard to tune since it includes size of the data, row-count is straightforward thing to understand the impact of.
Regressing query has the pattern SELECT expensive_to_decode_column FROM table WHERE (selective predicate on another column).
It's a specific scenario where there are a small number of rows surviving from the filter for every few thousand input rows. When batch size is small, there are more lazy block which don't need to be loaded, when batch size is big chances increase that you need to load the whole lazy block just to get a few rows out of it.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetReaderConfig.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java

Includes config param and session properties for Hive, Iceberg, and Deltalake. In some cases, the previously hard-coded 8k value is much worse than, for instance, 1k. A larger value may spare some CPU time but can also lead to more lazy block decoding, inputDataSize increase, and significant query performance degradation. In the future, this param should be adaptive.

cla-bot bot added the cla-signed label Dec 20, 2022

assaf2 requested a review from raunaqmorarka December 20, 2022 13:16

assaf2 self-assigned this Dec 20, 2022

raunaqmorarka reviewed Dec 20, 2022

View reviewed changes

github-actions bot added the tests:hive label Dec 20, 2022

assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch 2 times, most recently from 34df627 to ebf1ceb Compare December 20, 2022 14:32

assaf2 requested a review from raunaqmorarka December 20, 2022 14:35

assaf2 changed the title ~~Create a config param for Parquet's max vector length~~ Create a config param and session properties for setting Parquet's max read block row count Dec 20, 2022

assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch 2 times, most recently from 6b1fa2a to b916c6b Compare December 20, 2022 16:24

raunaqmorarka reviewed Dec 21, 2022

View reviewed changes

assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from b916c6b to b9b4df7 Compare December 21, 2022 08:17

assaf2 requested a review from raunaqmorarka December 21, 2022 08:20

assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from b9b4df7 to f12e4e5 Compare December 21, 2022 08:47

raunaqmorarka reviewed Dec 21, 2022

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java Outdated Show resolved Hide resolved

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java Outdated Show resolved Hide resolved

raunaqmorarka reviewed Dec 21, 2022

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetReaderOptions.java Show resolved Hide resolved

assaf2 force-pushed the decrease-MAX_VECTOR_LENGTH branch from f12e4e5 to 34b737d Compare December 21, 2022 13:34

assaf2 requested a review from raunaqmorarka December 21, 2022 13:34

raunaqmorarka approved these changes Dec 21, 2022

View reviewed changes

raunaqmorarka requested a review from sopel39 December 21, 2022 14:45

raunaqmorarka force-pushed the decrease-MAX_VECTOR_LENGTH branch from 34b737d to 5ee8982 Compare December 21, 2022 20:14

raunaqmorarka mentioned this pull request Dec 21, 2022

Add Trino 405 release notes #15139

Merged

raunaqmorarka merged commit 2752540 into trinodb:master Dec 22, 2022

github-actions bot added this to the 404 milestone Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a config param and session properties for setting Parquet's max read block row count #15474

Create a config param and session properties for setting Parquet's max read block row count #15474

assaf2 commented Dec 20, 2022 •

edited by raunaqmorarka

Loading

skrzypo987 commented Dec 20, 2022

raunaqmorarka commented Dec 20, 2022

Create a config param and session properties for setting Parquet's max read block row count #15474

Create a config param and session properties for setting Parquet's max read block row count #15474

Conversation

assaf2 commented Dec 20, 2022 • edited by raunaqmorarka Loading

Description

Release notes

skrzypo987 commented Dec 20, 2022

raunaqmorarka commented Dec 20, 2022

assaf2 commented Dec 20, 2022 •

edited by raunaqmorarka

Loading