Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build sst file in more resource friendly way #486

Closed
jiacai2050 opened this issue Dec 16, 2022 · 3 comments · Fixed by #747
Closed

Build sst file in more resource friendly way #486

jiacai2050 opened this issue Dec 16, 2022 · 3 comments · Fixed by #747
Labels
feature New feature or request

Comments

@jiacai2050
Copy link
Contributor

Describe This Problem

In current implementation, sst write involves two loop:

  1. First loop to calculate bloom filters of each row group
  2. Second loop to write to parquet

Two loops mean more cpu usage, what's worse it that this may eat too many memory.

Proposal

It's best we can reduce build procedure to one loop, this depends on apache/arrow-rs#3356

If this is not possible, then we may need to spit RecordBatch to disk in order to reduce memory consumption.

Additional Context

No response

@jiacai2050
Copy link
Contributor Author

Glad parquet has accepted my proposal, we can just wait version release for both data fusion and parquet.

@ShiKaiWi
Copy link
Member

Proposal

Current procedure:

1. Fetch all record batches from the input stream to organize them in row groups;
2. Build the metadata based on the row groups;
3. Encode all the row groups and obtain the encoded bytes;
4. Upload the encoded bytes into OSS;

New procedure:

while true {
 1. Fetch enough rows from the input stream to form a row group, break if the input stream is exhausted;
 2. Collect the necesary information from the row group for building final metadata;
 3. Encode the row group and upload the encoded bytes;
}
4. Encode and upload the final metadata to OSS.

@ShiKaiWi
Copy link
Member

ShiKaiWi commented Mar 15, 2023

However, after reviewing the api of parquet writer, async writing has been supported yet, and the relating issue is apache/arrow-rs#1269.

To solve this problem, I guess there are two ways:

  • Implement this feature in the upstream which may takes more time to move on.
  • Write the local file first and then upload to OSS.

Personally, I vote for the first way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants