Build sst file in more resource friendly way #486

jiacai2050 · 2022-12-16T14:44:17Z

Describe This Problem

In current implementation, sst write involves two loop:

First loop to calculate bloom filters of each row group
Second loop to write to parquet

Two loops mean more cpu usage, what's worse it that this may eat too many memory.

Proposal

It's best we can reduce build procedure to one loop, this depends on apache/arrow-rs#3356

If this is not possible, then we may need to spit RecordBatch to disk in order to reduce memory consumption.

Additional Context

No response

jiacai2050 · 2022-12-21T01:12:49Z

Glad parquet has accepted my proposal, we can just wait version release for both data fusion and parquet.

ShiKaiWi · 2023-03-15T06:25:49Z

Proposal

Current procedure:

1. Fetch all record batches from the input stream to organize them in row groups;
2. Build the metadata based on the row groups;
3. Encode all the row groups and obtain the encoded bytes;
4. Upload the encoded bytes into OSS;

New procedure:

while true {
 1. Fetch enough rows from the input stream to form a row group, break if the input stream is exhausted;
 2. Collect the necesary information from the row group for building final metadata;
 3. Encode the row group and upload the encoded bytes;
}
4. Encode and upload the final metadata to OSS.

ShiKaiWi · 2023-03-15T07:10:49Z

However, after reviewing the api of parquet writer, async writing has been supported yet, and the relating issue is apache/arrow-rs#1269.

To solve this problem, I guess there are two ways:

Implement this feature in the upstream which may takes more time to move on.
Write the local file first and then upload to OSS.

Personally, I vote for the first way.

jiacai2050 added the feature New feature or request label Dec 16, 2022

ShiKaiWi mentioned this issue Dec 19, 2022

Reduce the memory consumption required by Compaction #489

Closed

jiacai2050 mentioned this issue Mar 15, 2023

Tracking issue: streaming building/reading SST #735

Closed

2 tasks

ShiKaiWi mentioned this issue Mar 16, 2023

feat: build sst in stream way #747

Merged

ShiKaiWi closed this as completed in #747 Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build sst file in more resource friendly way #486

Build sst file in more resource friendly way #486

jiacai2050 commented Dec 16, 2022

jiacai2050 commented Dec 21, 2022

ShiKaiWi commented Mar 15, 2023

ShiKaiWi commented Mar 15, 2023 •

edited

Loading

Build sst file in more resource friendly way #486

Build sst file in more resource friendly way #486

Comments

jiacai2050 commented Dec 16, 2022

Describe This Problem

Proposal

Additional Context

jiacai2050 commented Dec 21, 2022

ShiKaiWi commented Mar 15, 2023

Proposal

Current procedure:

New procedure:

ShiKaiWi commented Mar 15, 2023 • edited Loading

ShiKaiWi commented Mar 15, 2023 •

edited

Loading