Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to auto flushing API based on memory usage #809

Open
gingerwizard opened this issue Nov 3, 2022 · 0 comments
Open

Move to auto flushing API based on memory usage #809

gingerwizard opened this issue Nov 3, 2022 · 0 comments
Milestone

Comments

@gingerwizard
Copy link
Collaborator

gingerwizard commented Nov 3, 2022

Is your feature request related to a problem? Please describe.

Currently, the user is required to control when a batch is flushed or sent via the flush and send methods. This requires them to estimate the memory overhead of a batch and tune the number of rows they add - while also considering larger batches offer better insert performance.

Ideally, the user would simply specify a buffer size in bytes. Once this is exceeded (or possibly a timeout passes) the buffer should be flushed and written to the wire.

Describe the solution you'd like

This is tricky with the current implementation as currently we append rows to a block and call encode() when flush or send is invoked. Compression is performed after encode (across all the columns but we will improve this - see #755 - to reduce memory overhead of compression). We don't know the size of the buffer until encode is complete (some types e.g. strings, are variable length).

We also need to be able to encode column-by-column - since this is how the native format works.

The solution here is probably to encode on append vs doing it as a separate step. However, to ensure we still encode column by column we will need a separate buffer per column.

Once the total size of the buffers is exceeded, we'll flush them in order.

Advantages

This will also address #755 since we can compress each column one at a time - thus also reducing memory overhead.

Memory overhead will also be reduced as a result of not storing a buffer and an object representation of the column.

We are adding a check under #808 to avoid excessive memory usage by compressing column by column. This limit MaxCompressionBuffer can be exceeded on large batches as we only check after encoding each column. The above approach would allow us to more strictly adhere to memory limits. Being able to control memory tightly is important in many use cases

User experience should also be superior as they can tune their memory footprint without needing to test.

Cons

We'll need to benchmark and test. Having benchmarks on throughput is probably a pre-cursor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant