Add pre_transform_datasets functionality #141

jonmmease · 2022-07-22T10:23:33Z

Overview

This PR adds a new high-level method to the Python Runtime called pre_transform_datasets. Docstring:

        Extract the fully evaluated form of the requested datasets from a Vega specification
        as pandas DataFrames.

        :param spec: A Vega specification
        :param datasets: A list with elements that are either:
          - The name of a top-level dataset as a string
          - A two-element tuple where the first element is the name of a dataset as a string
            and the second element is the nested scope of the dataset as a list of integers
        :param local_tz: Name of timezone to be considered local. E.g. 'America/New_York'.
            This can be computed for the local system using the tzlocal package and the
            tzlocal.get_localzone_name() function.
        :param default_input_tz: Name of timezone (e.g. 'America/New_York') that naive datetime
            strings should be interpreted in. Defaults to `local_tz`.
        :param inline_datasets: A dict from dataset names to pandas DataFrames or pyarrow
            Tables. Inline datasets may be referenced by the input specification using
            the following url syntax 'vegafusion+dataset://{dataset_name}'.
        :return:
            Two-element tuple:
                0. List of pandas DataFrames corresponding to the input datasets list
                1. A list of warnings as dictionaries. Each warning dict has a 'type'
                   key indicating the warning type, and a 'message' key containing
                   a description of the warning.

pre_transform_datasets makes it possible to extract transformed datasets in Python.

Example

# imports
import json
from altair.vega import vega
import vegafusion as vf
import pandas as pd

local_tz = "UTC"

full_spec = r"""
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "background": "white",
  "padding": 5,
  "width": 200,
  "height": 200,
  "style": "cell",
  "data": [
    {
      "name": "source_0",
      "url": "https://raw.githubusercontent.com/vega/vega-datasets/master/data/movies.json",
      "format": {"type": "json"},
      "transform": [
        {
          "type": "extent",
          "field": "IMDB Rating",
          "signal": "bin_maxbins_10_IMDB_Rating_extent"
        },
        {
          "type": "bin",
          "field": "IMDB Rating",
          "as": [
            "bin_maxbins_10_IMDB Rating",
            "bin_maxbins_10_IMDB Rating_end"
          ],
          "signal": "bin_maxbins_10_IMDB_Rating_bins",
          "extent": {"signal": "bin_maxbins_10_IMDB_Rating_extent"},
          "maxbins": 10
        },
        {
          "type": "aggregate",
          "groupby": [
            "bin_maxbins_10_IMDB Rating",
            "bin_maxbins_10_IMDB Rating_end"
          ],
          "ops": ["count"],
          "fields": [null],
          "as": ["__count"]
        },
        {
          "type": "filter",
          "expr": "isValid(datum[\"bin_maxbins_10_IMDB Rating\"]) && isFinite(+datum[\"bin_maxbins_10_IMDB Rating\"])"
        }
      ]
    }
  ],
  "marks": [
    {
      "name": "marks",
      "type": "rect",
      "style": ["bar"],
      "from": {"data": "source_0"},
      "encode": {
        "update": {
          "fill": {"value": "#4c78a8"},
          "ariaRoleDescription": {"value": "bar"},
          "description": {
            "signal": "\"IMDB Rating (binned): \" + (!isValid(datum[\"bin_maxbins_10_IMDB Rating\"]) || !isFinite(+datum[\"bin_maxbins_10_IMDB Rating\"]) ? \"null\" : format(datum[\"bin_maxbins_10_IMDB Rating\"], \"\") + \" – \" + format(datum[\"bin_maxbins_10_IMDB Rating_end\"], \"\")) + \"; Count of Records: \" + (format(datum[\"__count\"], \"\"))"
          },
          "x2": {
            "scale": "x",
            "field": "bin_maxbins_10_IMDB Rating",
            "offset": 1
          },
          "x": {"scale": "x", "field": "bin_maxbins_10_IMDB Rating_end"},
          "y": {"scale": "y", "field": "__count"},
          "y2": {"scale": "y", "value": 0}
        }
      }
    }
  ],
  "scales": [
    {
      "name": "x",
      "type": "linear",
      "domain": {
        "signal": "[bin_maxbins_10_IMDB_Rating_bins.start, bin_maxbins_10_IMDB_Rating_bins.stop]"
      },
      "range": [0, {"signal": "width"}],
      "bins": {"signal": "bin_maxbins_10_IMDB_Rating_bins"},
      "zero": false
    },
    {
      "name": "y",
      "type": "linear",
      "domain": {"data": "source_0", "field": "__count"},
      "range": [{"signal": "height"}, 0],
      "nice": true,
      "zero": true
    }
  ],
  "axes": [
    {
      "scale": "y",
      "orient": "left",
      "gridScale": "x",
      "grid": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "domain": false,
      "labels": false,
      "maxExtent": 0,
      "minExtent": 0,
      "ticks": false,
      "zindex": 0
    },
    {
      "scale": "x",
      "orient": "bottom",
      "grid": false,
      "title": "IMDB Rating (binned)",
      "labelFlush": true,
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(width/10)"},
      "zindex": 0
    },
    {
      "scale": "y",
      "orient": "left",
      "grid": false,
      "title": "Count of Records",
      "labelOverlap": true,
      "tickCount": {"signal": "ceil(height/40)"},
      "zindex": 0
    }
  ]
}
"""

vega(json.loads(full_spec))

transformed_datasets, warnings = vf.runtime.pre_transform_datasets(full_spec, ["source_0"], local_tz)
transformed_datasets[0]

   bin_maxbins_10_IMDB Rating  bin_maxbins_10_IMDB Rating_end  __count
0                         6.0                             7.0      985
1                         3.0                             4.0      100
2                         7.0                             8.0      741
3                         5.0                             6.0      633
4                         8.0                             9.0      204
5                         2.0                             3.0       43
6                         4.0                             5.0      273
7                         9.0                            10.0        4
8                         1.0                             2.0        5

Validation

An exception is raised if:

The requested dataset is not present in the specification
The requested dataset (or a dependency) contains transforms not yet supported by VegaFusion

# Conflicts: # python/vegafusion/vegafusion/runtime.py # vegafusion-python-embed/src/lib.rs # vegafusion-rt-datafusion/src/task_graph/runtime.rs

jonmmease added 13 commits June 17, 2022 12:07

WIP addition of pre_transform_values functionality

9b4bc24

WIP addition of pre_transform_values functionality

9b0aebe

Update comments

6cead9a

Add validation to pre_transform_values that requested value exists

0135cb6

Improve error message for unsupported variable

f2c6d01

update lockfile

58d29ed

Add pre_transform_values Rust tests

e936c6c

Fix warnings

bb98a10

Add Python pre_transform_datasets test

901984e

Add docstring

fc47852

Merge remote-tracking branch 'origin/get_dataset' into get_dataset

779f297

# Conflicts: # python/vegafusion/vegafusion/runtime.py # vegafusion-python-embed/src/lib.rs # vegafusion-rt-datafusion/src/task_graph/runtime.rs

clippy fix

48a86ed

cargo fmt

7c22234

jonmmease merged commit 78bb90b into main Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre_transform_datasets functionality #141

Add pre_transform_datasets functionality #141

jonmmease commented Jul 22, 2022 •

edited

Loading

Add pre_transform_datasets functionality #141

Add pre_transform_datasets functionality #141

Conversation

jonmmease commented Jul 22, 2022 • edited Loading

Overview

Example

Validation

jonmmease commented Jul 22, 2022 •

edited

Loading