Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pre_transform_spec support for inlining transformed data in arrow format #365

Merged
merged 11 commits into from
Jul 24, 2023

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Jul 23, 2023

Overview

This PR updates the Python pre_transform_spec function to support including transformed inline datasets in the resulting specification dictionary in arrow format as:

  • pyarrow Table
  • arrow-ipc bytes
  • Base64 encoded arrow-ipc format

The pre_transform_spec function has two new arguments. From the new docstring

        :param data_encoding_threshold: threshold for encoding datasets
            When length of pre-transformed datasets exceeds data_encoding_threshold, datasets
            are encoded into an alternative format (as determined by the data_encoding_format
            argument). When None (the default), pre-transformed datasets are never encoded and
            are always included as JSON compatible lists of dictionaries.
        :param data_encoding_format: format of encoded datasets
            Format to use to encode datasets with length exceeding the data_encoding_threshold
            argument.
                - "pyarrow": Encode datasets as pyarrow Tables. Not JSON compatible.
                - "arrow-ipc": Encode datasets as bytes in Arrow IPC format. Not JSON compatible.
                - "arrow-ipc-base64": Encode datasets as strings in base64 encoded Arrow IPC format.
                    JSON compatible.

Applications

The motivation for these changes is to improve the efficiency of specs that include large inline dataset in certain environments:

  1. For environments that support binary serialization, the "arrow-ipc" format can be used from Python. Then on the JavaScript side the buffers can be deserialized as arrow tables using the apache-arrow JavaScript library. One motivating example is the Altair ChartWidget that's in development, as the Jupyter Widget protocol makes it fairly easy to serialize dictionaries that include inline binary buffers.
  2. For environments that support only JSON serialization, the "arrow-ipc-base64" format can be used to replace the usual list-of-object representation of tables with a base64-encoded string. The JavaScript side can then base64 decode the string and deserialize the table using the arrow JavaScript library. The could be used by the Altair html renderer. As well as in dashboarding environments that support only JSON serializations (e.g. Dash).

Serialization performance comparison

Here's an example of an unaggregated spec with a 1.2 million row dataset.

import vegafusion as vf
import json
import pandas as pd
import pyarrow as pa
import json

cars_df = pd.read_json("https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/cars.json")
cars_df = pd.concat([cars_df]*3000).reset_index()
len(cars_df)

spec = json.loads(r"""
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "background": "white",
  "padding": 5,
  "width": 300,
  "height": 300,
  "style": "cell",
  "data": [
    {
      "name": "source_0",
      "url": "table://cars",
      "format": {"type": "json"},
      "transform": [
        {
          "type": "filter",
          "expr": "isValid(datum[\"Horsepower\"]) && isFinite(+datum[\"Horsepower\"]) && isValid(datum[\"Miles_per_Gallon\"]) && isFinite(+datum[\"Miles_per_Gallon\"]) && isValid(datum[\"Acceleration\"]) && isFinite(+datum[\"Acceleration\"])"
        }
      ]
    }
  ],
  "marks": [
    {
      "name": "marks",
      "type": "symbol",
      "style": ["point"],
      "from": {"data": "source_0"},
      "encode": {
        "update": {
          "opacity": {"value": 0.7},
          "fill": {"value": "transparent"},
          "stroke": {"value": "#4c78a8"},
          "ariaRoleDescription": {"value": "point"},
          "description": {
            "signal": "\"Horsepower: \" + (format(datum[\"Horsepower\"], \"\")) + \"; Miles_per_Gallon: \" + (format(datum[\"Miles_per_Gallon\"], \"\")) + \"; Acceleration: \" + (format(datum[\"Acceleration\"], \"\"))"
          },
          "x": {"scale": "x", "field": "Horsepower"},
          "y": {"scale": "y", "field": "Miles_per_Gallon"},
          "size": {"scale": "size", "field": "Acceleration"}
        }
      }
    }
  ],
  "scales": [
    {
      "name": "x",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Horsepower"},
      "range": [0, {"signal": "width"}],
      "nice": true,
      "zero": true
    },
    {
      "name": "y",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Miles_per_Gallon"},
      "range": [{"signal": "height"}, 0],
      "nice": true,
      "zero": true
    },
    {
      "name": "size",
      "type": "linear",
      "domain": {"data": "source_0", "field": "Acceleration"},
      "range": [0, 361],
      "zero": true
    }
  ],
  "axes": [
    {
      "scale": "x",
      "orient": "bottom",
      "gridScale": "y",
      "grid": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "gridScale": "x",
      "grid": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "domain": false,
      "labels": false,
      "aria": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "x",
      "orient": "bottom",
      "grid": false,
      "title": "Horsepower",
      "labelFlush": true,
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(width/40)"},
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "grid": false,
      "title": "Miles_per_Gallon",
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "zindex": 0
    }
  ],
  "legends": [
    {
      "size": "size",
      "symbolType": "circle",
      "title": "Acceleration",
      "encode": {
        "symbols": {
          "update": {
            "fill": {"value": "transparent"},
            "stroke": {"value": "#4c78a8"},
            "opacity": {"value": 0.7}
          }
        }
      }
    }
  ]
}
""")

With default pre_transform_spec takes 1.8s-2.0s to convert the result to a dictionary. And an additional 800ms-1s to convert the dictionary to a string (with the default Python json module). The final serialized string length is 82,802,805 characters.

%%time
(tx_spec, warnings) = vf.runtime.pre_transform_spec(
    spec, 
    inline_datasets={"cars": cars_df}
)
CPU times: user 1.54 s, sys: 383 ms, total: 1.92 s
Wall time: 1.93 s
%%time
s = json.dumps(tx_spec)
CPU times: user 832 ms, sys: 32.3 ms, total: 865 ms
Wall time: 869 ms

With arrow-ipc-base64, the pre_transform_spec spec takes 300ms-400ms, and the JSON serialization takes 100ms-200ms. The final serialized string length is 38,274,026 characters.

%%time
(tx_spec, warnings) = vf.runtime.pre_transform_spec(
    spec, 
    data_encoding_format="arrow-ipc-base64", 
    data_encoding_threshold=0,
    inline_datasets={"cars": cars_df}
)
CPU times: user 303 ms, sys: 96.6 ms, total: 400 ms
Wall time: 393 ms
%%time
s = json.dumps(tx_spec)
CPU times: user 117 ms, sys: 7.77 ms, total: 125 ms
Wall time: 138 ms

Combined, this is ~5x faster and ~2.2x smaller. While not benchmarked here, the JavaScript deserialization time (from base64 encoded string to arrow table) should also be substantially faster than the default path.

It looks like the arrow-ipc and pyarrow serialization formats are ~100ms faster than arrow-base64, so the expected performance improvement for the Jupyter Widget case (which supports direct binary serialization over websockets) is even greater.

@jonmmease jonmmease merged commit 79ddf3f into main Jul 24, 2023
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant