You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, the user is required to control when a batch is flushed or sent via the flush and send methods. This requires them to estimate the memory overhead of a batch and tune the number of rows they add - while also considering larger batches offer better insert performance.
Ideally, the user would simply specify a buffer size in bytes. Once this is exceeded (or possibly a timeout passes) the buffer should be flushed and written to the wire.
Describe the solution you'd like
This is tricky with the current implementation as currently we append rows to a block and call encode() when flush or send is invoked. Compression is performed after encode (across all the columns but we will improve this - see #755 - to reduce memory overhead of compression). We don't know the size of the buffer until encode is complete (some types e.g. strings, are variable length).
We also need to be able to encode column-by-column - since this is how the native format works.
The solution here is probably to encode on append vs doing it as a separate step. However, to ensure we still encode column by column we will need a separate buffer per column.
Once the total size of the buffers is exceeded, we'll flush them in order.
Advantages
This will also address #755 since we can compress each column one at a time - thus also reducing memory overhead.
Memory overhead will also be reduced as a result of not storing a buffer and an object representation of the column.
We are adding a check under #808 to avoid excessive memory usage by compressing column by column. This limit MaxCompressionBuffer can be exceeded on large batches as we only check after encoding each column. The above approach would allow us to more strictly adhere to memory limits. Being able to control memory tightly is important in many use cases
User experience should also be superior as they can tune their memory footprint without needing to test.
Cons
We'll need to benchmark and test. Having benchmarks on throughput is probably a pre-cursor.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently, the user is required to control when a batch is flushed or sent via the
flush
andsend
methods. This requires them to estimate the memory overhead of a batch and tune the number of rows they add - while also considering larger batches offer better insert performance.Ideally, the user would simply specify a buffer size in bytes. Once this is exceeded (or possibly a timeout passes) the buffer should be flushed and written to the wire.
Describe the solution you'd like
This is tricky with the current implementation as currently we append rows to a block and call encode() when flush or send is invoked. Compression is performed after encode (across all the columns but we will improve this - see #755 - to reduce memory overhead of compression). We don't know the size of the buffer until encode is complete (some types e.g. strings, are variable length).
We also need to be able to encode column-by-column - since this is how the native format works.
The solution here is probably to encode on append vs doing it as a separate step. However, to ensure we still encode column by column we will need a separate buffer per column.
Once the total size of the buffers is exceeded, we'll flush them in order.
Advantages
This will also address #755 since we can compress each column one at a time - thus also reducing memory overhead.
Memory overhead will also be reduced as a result of not storing a buffer and an object representation of the column.
We are adding a check under #808 to avoid excessive memory usage by compressing column by column. This limit
MaxCompressionBuffer
can be exceeded on large batches as we only check after encoding each column. The above approach would allow us to more strictly adhere to memory limits. Being able to control memory tightly is important in many use casesUser experience should also be superior as they can tune their memory footprint without needing to test.
Cons
We'll need to benchmark and test. Having benchmarks on throughput is probably a pre-cursor.
The text was updated successfully, but these errors were encountered: