Move to auto flushing API based on memory usage #809

gingerwizard · 2022-11-03T11:44:01Z

Is your feature request related to a problem? Please describe.

Currently, the user is required to control when a batch is flushed or sent via the flush and send methods. This requires them to estimate the memory overhead of a batch and tune the number of rows they add - while also considering larger batches offer better insert performance.

Ideally, the user would simply specify a buffer size in bytes. Once this is exceeded (or possibly a timeout passes) the buffer should be flushed and written to the wire.

Describe the solution you'd like

This is tricky with the current implementation as currently we append rows to a block and call encode() when flush or send is invoked. Compression is performed after encode (across all the columns but we will improve this - see #755 - to reduce memory overhead of compression). We don't know the size of the buffer until encode is complete (some types e.g. strings, are variable length).

We also need to be able to encode column-by-column - since this is how the native format works.

The solution here is probably to encode on append vs doing it as a separate step. However, to ensure we still encode column by column we will need a separate buffer per column.

Once the total size of the buffers is exceeded, we'll flush them in order.

Advantages

This will also address #755 since we can compress each column one at a time - thus also reducing memory overhead.

Memory overhead will also be reduced as a result of not storing a buffer and an object representation of the column.

We are adding a check under #808 to avoid excessive memory usage by compressing column by column. This limit MaxCompressionBuffer can be exceeded on large batches as we only check after encoding each column. The above approach would allow us to more strictly adhere to memory limits. Being able to control memory tightly is important in many use cases

User experience should also be superior as they can tune their memory footprint without needing to test.

Cons

We'll need to benchmark and test. Having benchmarks on throughput is probably a pre-cursor.

The text was updated successfully, but these errors were encountered:

gingerwizard added this to the v3.0.0 milestone Nov 3, 2022

gingerwizard mentioned this issue Nov 3, 2022

Buffered compression column by column for native protocol #808

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to auto flushing API based on memory usage #809

Move to auto flushing API based on memory usage #809

gingerwizard commented Nov 3, 2022 •

edited

Loading

Move to auto flushing API based on memory usage #809

Move to auto flushing API based on memory usage #809

Comments

gingerwizard commented Nov 3, 2022 • edited Loading

gingerwizard commented Nov 3, 2022 •

edited

Loading