Add pre_transform_spec support for inlining transformed data in arrow format #365

jonmmease · 2023-07-23T23:42:53Z

Overview

This PR updates the Python pre_transform_spec function to support including transformed inline datasets in the resulting specification dictionary in arrow format as:

pyarrow Table
arrow-ipc bytes
Base64 encoded arrow-ipc format

The pre_transform_spec function has two new arguments. From the new docstring

        :param data_encoding_threshold: threshold for encoding datasets
            When length of pre-transformed datasets exceeds data_encoding_threshold, datasets
            are encoded into an alternative format (as determined by the data_encoding_format
            argument). When None (the default), pre-transformed datasets are never encoded and
            are always included as JSON compatible lists of dictionaries.
        :param data_encoding_format: format of encoded datasets
            Format to use to encode datasets with length exceeding the data_encoding_threshold
            argument.
                - "pyarrow": Encode datasets as pyarrow Tables. Not JSON compatible.
                - "arrow-ipc": Encode datasets as bytes in Arrow IPC format. Not JSON compatible.
                - "arrow-ipc-base64": Encode datasets as strings in base64 encoded Arrow IPC format.
                    JSON compatible.

Applications

The motivation for these changes is to improve the efficiency of specs that include large inline dataset in certain environments:

For environments that support binary serialization, the "arrow-ipc" format can be used from Python. Then on the JavaScript side the buffers can be deserialized as arrow tables using the apache-arrow JavaScript library. One motivating example is the Altair ChartWidget that's in development, as the Jupyter Widget protocol makes it fairly easy to serialize dictionaries that include inline binary buffers.
For environments that support only JSON serialization, the "arrow-ipc-base64" format can be used to replace the usual list-of-object representation of tables with a base64-encoded string. The JavaScript side can then base64 decode the string and deserialize the table using the arrow JavaScript library. The could be used by the Altair html renderer. As well as in dashboarding environments that support only JSON serializations (e.g. Dash).

Serialization performance comparison

Here's an example of an unaggregated spec with a 1.2 million row dataset.

import vegafusion as vf
import json
import pandas as pd
import pyarrow as pa
import json

cars_df = pd.read_json("https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/cars.json")
cars_df = pd.concat([cars_df]*3000).reset_index()
len(cars_df)

spec = json.loads(r"""
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "background": "white",
  "padding": 5,
  "width": 300,
  "height": 300,
  "style": "cell",
  "data": [
    {
      "name": "source_0",
      "url": "table://cars",
      "format": {"type": "json"},
      "transform": [
        {
          "type": "filter",
          "expr": "isValid(datum[\"Horsepower\"]) && isFinite(+datum[\"Horsepower\"]) && isValid(datum[\"Miles_per_Gallon\"]) && isFinite(+datum[\"Miles_per_Gallon\"]) && isValid(datum[\"Acceleration\"]) && isFinite(+datum[\"Acceleration\"])"
        }
      ]
    }
  ],
  "marks": [
    {
      "name": "marks",
      "type": "symbol",
      "style": ["point"],
      "from": {"data": "source_0"},
      "encode": {
        "update": {
          "opacity": {"value": 0.7},
          "fill": {"value": "transparent"},
          "stroke": {"value": "#4c78a8"},
          "ariaRoleDescription": {"value": "point"},
          "description": {
            "signal": "\"Horsepower: \" + (format(datum[\"Horsepower\"], \"\")) + \"; Miles_per_Gallon: \" + (format(datum[\"Miles_per_Gallon\"], \"\")) + \"; Acceleration: \" + (format(datum[\"Acceleration\"], \"\"))"
          },
          "x": {"scale": "x", "field": "Horsepower"},
          "y": {"scale": "y", "field": "Miles_per_Gallon"},
          "size": {"scale": "size", "field": "Acceleration"}
        }
      }
    }
  ],
  "scales": [
    {
      "name": "x",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Horsepower"},
      "range": [0, {"signal": "width"}],
      "nice": true,
      "zero": true
    },
    {
      "name": "y",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Miles_per_Gallon"},
      "range": [{"signal": "height"}, 0],
      "nice": true,
      "zero": true
    },
    {
      "name": "size",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Acceleration"},
      "range": [0, 361],
      "zero": true
    }
  ],
  "axes": [
    {
      "scale": "x",
      "orient": "bottom",
      "gridScale": "y",
      "grid": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "gridScale": "x",
      "grid": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "x",
      "orient": "bottom",
      "grid": false,
      "title": "Horsepower",
      "labelFlush": true,
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "grid": false,
      "title": "Miles_per_Gallon",
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "zindex": 0
    }
  ],
  "legends": [
    {
      "size": "size",
      "symbolType": "circle",
      "title": "Acceleration",
      "encode": {
        "symbols": {
          "update": {
            "fill": {"value": "transparent"},
            "stroke": {"value": "#4c78a8"},
            "opacity": {"value": 0.7}
          }
        }
      }
    }
  ]
}
""")

With default pre_transform_spec takes 1.8s-2.0s to convert the result to a dictionary. And an additional 800ms-1s to convert the dictionary to a string (with the default Python json module). The final serialized string length is 82,802,805 characters.

%%time
(tx_spec, warnings) = vf.runtime.pre_transform_spec(
    spec, 
    inline_datasets={"cars": cars_df}
)

CPU times: user 1.54 s, sys: 383 ms, total: 1.92 s
Wall time: 1.93 s

%%time
s = json.dumps(tx_spec)

CPU times: user 832 ms, sys: 32.3 ms, total: 865 ms
Wall time: 869 ms

With arrow-ipc-base64, the pre_transform_spec spec takes 300ms-400ms, and the JSON serialization takes 100ms-200ms. The final serialized string length is 38,274,026 characters.

%%time
(tx_spec, warnings) = vf.runtime.pre_transform_spec(
    spec, 
    data_encoding_format="arrow-ipc-base64", 
    data_encoding_threshold=0,
    inline_datasets={"cars": cars_df}
)

CPU times: user 303 ms, sys: 96.6 ms, total: 400 ms
Wall time: 393 ms

%%time
s = json.dumps(tx_spec)

CPU times: user 117 ms, sys: 7.77 ms, total: 125 ms
Wall time: 138 ms

Combined, this is ~5x faster and ~2.2x smaller. While not benchmarked here, the JavaScript deserialization time (from base64 encoded string to arrow table) should also be substantially faster than the default path.

It looks like the arrow-ipc and pyarrow serialization formats are ~100ms faster than arrow-base64, so the expected performance improvement for the Jupyter Widget case (which supports direct binary serialization over websockets) is even greater.

to pre_transform_extract

pre_transform_spec. With these, it's possible to include large datasets in encoded formats inside the spec dictionary.

…n64 test

jonmmease added 11 commits July 23, 2023 08:10

Add extract_threshold and extract_format options

c66d197

to pre_transform_extract

Add arrow-ipc-base64 extraction format

d391ecc

Add keep variables support to pre_transform_extract

82900ba

Add data_encoding_threshold and data_encoding_format to

4bd9a88

pre_transform_spec. With these, it's possible to include large datasets in encoded formats inside the spec dictionary.

Add Python test of data_encoding_threshold and data_encoding_format

26a4834

Fix tests

1cad1b3

clippy

2cb845c

Fix test extract threshold

851f93b

Pin windows CI machine

450f0bd

Comment other tests, disable rust cache on build-vegafusion-server-wi…

91b1272

…n64 test

re-enable all tests

34f4d60

jonmmease merged commit 79ddf3f into main Jul 24, 2023
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre_transform_spec support for inlining transformed data in arrow format #365

Add pre_transform_spec support for inlining transformed data in arrow format #365

jonmmease commented Jul 23, 2023 •

edited

Loading

Add pre_transform_spec support for inlining transformed data in arrow format #365

Add pre_transform_spec support for inlining transformed data in arrow format #365

Conversation

jonmmease commented Jul 23, 2023 • edited Loading

Overview

Applications

Serialization performance comparison

jonmmease commented Jul 23, 2023 •

edited

Loading