Add pre_transform_spec support for inlining transformed data in arrow format #365
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR updates the Python pre_transform_spec function to support including transformed inline datasets in the resulting specification dictionary in arrow format as:
The pre_transform_spec function has two new arguments. From the new docstring
Applications
The motivation for these changes is to improve the efficiency of specs that include large inline dataset in certain environments:
apache-arrow
JavaScript library. One motivating example is the AltairChartWidget
that's in development, as the Jupyter Widget protocol makes it fairly easy to serialize dictionaries that include inline binary buffers.Serialization performance comparison
Here's an example of an unaggregated spec with a 1.2 million row dataset.
With default
pre_transform_spec
takes 1.8s-2.0s to convert the result to a dictionary. And an additional 800ms-1s to convert the dictionary to a string (with the default Python json module). The final serialized string length is 82,802,805 characters.With
arrow-ipc-base64
, thepre_transform_spec
spec takes 300ms-400ms, and the JSON serialization takes 100ms-200ms. The final serialized string length is 38,274,026 characters.Combined, this is ~5x faster and ~2.2x smaller. While not benchmarked here, the JavaScript deserialization time (from base64 encoded string to arrow table) should also be substantially faster than the default path.
It looks like the
arrow-ipc
andpyarrow
serialization formats are ~100ms faster thanarrow-base64
, so the expected performance improvement for the Jupyter Widget case (which supports direct binary serialization over websockets) is even greater.