-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VegaFusion data transformer with mime renderer, save, and to_dict/to_json integration #3094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonmmease! Without testing, one quick inline comment.
altair/utils/_vegafusion_data.py
Outdated
# Special URL prefix that VegaFusion uses to denote that a | ||
# dataset in a Vega spec corresponds to an entry in the `inline_datasets` | ||
# kwarg of vf.runtime.pre_transform_spec(). | ||
VEGAFUSION_PREFIX: Final = "table://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename this to vf_source://
? Since you also prefix a DataFrame as table_<x>
, here. This will keep space for the support for arrays/tensors/matrices once they will be introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prefix is build into VegaFusion. But there is an alternative prefix supported that I can switch to vegafusion+dataset://
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed in 3a5517b
Thanks @jonmmease! I've read the code-diff a few times and it looks solid to me. I've tested a bit and only have a few observations. maybe interesting for documenting purposes. I don't think these are bugs.
Tests I did were similar to: from datetime import datetime
import pandas as pd
import pyarrow as pa
import altair as alt
alt.data_transformers.enable("vegafusion")
#alt.data_transformers.enable("default")
data = {
'date': [datetime(2004, 8, 30, 23, 15), datetime(2004, 9, 1, 12, 10)],
'value': [102, 129]
}
pd_df = pd.DataFrame(data)
pa_table = pa.table(data) # pandas dataframe
c = alt.Chart(pd_df).mark_point().encode(x='date:T', y='value:Q')
c.transformed_data()
c.to_dict()
c.to_dict(format='vega')
c.save('c.png') # success print(c.to_json())
# pyarrow table
d = alt.Chart(pa_table).mark_point(tooltip=True).encode(x='date:T', y='value:Q')
d.transformed_data()
d.to_dict()
d.save('d.png') # success print(d.to_json()) # print(d.to_json(format='vega'))
# large table
alt.data_transformers.enable("vegafusion")
#alt.data_transformers.enable("default")
flights = pd.read_parquet(
"https://vegafusion-datasets.s3.amazonaws.com/vega/flights_1m.parquet"
)
f = alt.Chart(flights).mark_bar().encode(
alt.X("delay", bin=alt.Bin(maxbins=30)),
alt.Y("count()")
) f.save('f.png') # success
# print(f.to_json())
# f
f.transformed_data().head()
|
Thanks for taking the time to go through this @mattijn! I'll respond in more detail tomorrow, but wanted to mention one thing. I debated about whether to change the behavior of Also, I want to wait until @binste is back and has a chance to look at this before merging. |
Yeah, I'm not sure of the best way to handle this. When you install with
This is expected (and needs to be documented). Vega-Lite and VegaFusion treat everything as UTC internally, so on output we pretty much need to assign some timezone. I chose to have the output timezone match the local timezone that's used for timeunit calculations (this defaults to the kernel's local timezone but can be overridden with
This conversion to the local timezone works with pandas as polars, but I did see a way to perform the conversion with pyarrow. There might be a way that I didn't find.
This is expected
See note above. Because VegaFusion works at the Vega level it's not possible to inline data into a Vega-Lite spec. So there are 3 options when it comes to
I wasn't really happy with (2) or (3), but I agree that (1) is also not great. Happy to consider other alternatives as well.
Expected to be Vega. I haven't looked into the ordering of attributes. I'm not certain where this comes from, but there's probably somewhere we could sort the keys.
Right now you can accomplish this with:
But it would probably be nice to have a simpler way to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the general approach of integrating and enabling VegaFusion via a data transformer and the implementation in this PR! 🥳 I added some comments inline.
Regarding the to_dict/to_json discussion, I also think this should produce something usable but I see that it's not straightforward. What do you think about, if the vegafusion transformer is called, to:
- raising an exception if
to_dict/to_json
are called withoutformat
specified (defaultvega-lite
or withformat="vega-lite"
- Mention in error that only
format="vega"
is supported for this transformer
This would make it clear to a user that with the vegafusion
transformer a vega-lite
spec is not possible and they can explicitly opt in to receive a vega
spec in case they want to send the spec to e.g. a frontend in a web application.
I'd be ok if this is implemented in a follow-up PR and we could also continue the discussion there or in the existing discussion thread on the VegaFusion integration.
Your ideas on the documentation sound good to me.
# Check from row limit warning and convert to MaxRowsError | ||
for warning in warnings: | ||
if warning.get("type") == "RowLimitExceeded": | ||
raise MaxRowsError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also never experienced any issues with larger datasets for single charts. However, if you do some exploratory data analysis with many charts in a Jupyter notebook, that notebook can get rather slow after a while if you plot many larger charts. I still like the idea though of increasing it and 100k sounds as good as any as it would be difficult to benchmark and figure out a good compromise. Just wanted to mention this.
This makes sense to me for the public API, but a complication is that I need the I haven't dug into what the |
Raise a ValueError when the "vegafusion" transformer is enabled and format="vega-lite". Use context={"pre_transform": False} to disable pre_transforming when "vegafusion" is enabled, for internal usage.
@binste, I updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the implementation with the ValueError
and using context
! I think it needs to be extended on line 903 in api.py
(can't comment there) as it now raises this error for a layered chart:
import altair as alt
from vega_datasets import data
alt.data_transformers.enable("vegafusion")
source = data.wheat()
bar = alt.Chart(source).mark_bar().encode(
x='year:O',
y='wheat:Q'
)
rule = alt.Chart(source).mark_rule(color='red').encode(
y='mean(wheat):Q'
)
layer_chart = (bar + rule).properties(width=600)
layer_chart.to_json(format="vega")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 1
----> 1 layer_chart.to_dict(format="vega")
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:903](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:903), in TopLevelMixin.to_dict(self, validate, format, ignore, context)
898 context["top_level"] = False
900 # TopLevelMixin instance does not necessarily have to_dict defined
901 # but due to how Altair is set up this should hold.
902 # Too complex to type hint right now
--> 903 vegalite_spec = super(TopLevelMixin, copy).to_dict( # type: ignore[misc]
904 validate=validate, ignore=ignore, context=context
905 )
907 # TODO: following entries are added after validation. Should they be validated?
908 if is_top_level:
909 # since this is top-level we add $schema if it's missing
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:807](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:807), in SchemaBase.to_dict(self, validate, ignore, context)
805 if "mark" in kwds and isinstance(kwds["mark"], str):
806 kwds["mark"] = {"type": kwds["mark"]}
--> 807 result = _todict(
808 kwds,
809 context=context,
810 )
811 else:
812 raise ValueError(
813 "{} instance has both a value and properties : "
814 "cannot serialize to dict".format(self.__class__)
815 )
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:340](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:340), in _todict(obj, context)
338 return [_todict(v, context) for v in obj]
339 elif isinstance(obj, dict):
--> 340 return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
341 elif hasattr(obj, "to_dict"):
342 return obj.to_dict()
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:340](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:340), in (.0)
338 return [_todict(v, context) for v in obj]
339 elif isinstance(obj, dict):
--> 340 return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
341 elif hasattr(obj, "to_dict"):
342 return obj.to_dict()
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:338](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:338), in _todict(obj, context)
336 return obj.to_dict(validate=False, context=context)
337 elif isinstance(obj, (list, tuple, np.ndarray)):
--> 338 return [_todict(v, context) for v in obj]
339 elif isinstance(obj, dict):
340 return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:338](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:338), in (.0)
336 return obj.to_dict(validate=False, context=context)
337 elif isinstance(obj, (list, tuple, np.ndarray)):
--> 338 return [_todict(v, context) for v in obj]
339 elif isinstance(obj, dict):
340 return {k: _todict(v, context) for k, v in obj.items() if v is not Undefined}
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:336](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/utils/schemapi.py:336), in _todict(obj, context)
334 """Convert an object to a dict representation."""
335 if isinstance(obj, SchemaBase):
--> 336 return obj.to_dict(validate=False, context=context)
337 elif isinstance(obj, (list, tuple, np.ndarray)):
338 return [_todict(v, context) for v in obj]
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:2677](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:2677), in Chart.to_dict(self, validate, format, ignore, context)
2673 copy.data = core.InlineData(values=[{}])
2674 return super(Chart, copy).to_dict(
2675 validate=validate, format=format, ignore=ignore, context=context
2676 )
-> 2677 return super().to_dict(
2678 validate=validate, format=format, ignore=ignore, context=context
2679 )
File [~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:926](https://file+.vscode-resource.vscode-cdn.net/Users/stefanbinder/Library/CloudStorage/Dropbox/Programming/altair/~/Library/CloudStorage/Dropbox/Programming/altair/altair/vegalite/v5/api.py:926), in TopLevelMixin.to_dict(self, validate, format, ignore, context)
924 if context.get("pre_transform", True) and _using_vegafusion():
925 if format == "vega-lite":
--> 926 raise ValueError(
927 'When the "vegafusion" data transformer is enabled, the \n'
928 "to_dict() and to_json() chart methods must be called with "
929 'format="vega". \n'
930 "For example: \n"
931 ' >>> chart.to_dict(format="vega")\n'
932 ' >>> chart.to_json(format="vega")'
933 )
934 else:
935 return _compile_with_vegafusion(vegalite_spec)
ValueError: When the "vegafusion" data transformer is enabled, the
to_dict() and to_json() chart methods must be called with format="vega".
For example:
>>> chart.to_dict(format="vega")
>>> chart.to_json(format="vega")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great! Looking forward to the next Altair release where we can include this. 🥳 Whereas originally I wasn't sure how to integrate VegaFusion without making it confusing for users, I think this solution now is a big usability improvement and ties in well with the existing concepts of Altair (data_transformers, to_dict, ...).
Overview
Following on from the discussion in #3054, this PR introduces a new data transformer named "vegafusion". When activated, we use VegaFusion to pre-evaluate data transformations before displaying charts with mime renderers, and before saving charts with
Chart.save
.The Data Transformer
The "vegafusion" data transformer replaces DataFrames in Vega-Lite specs with URLs of the form "table://{uuid}". These DataFrames are stashed in a local
WeakValueDictionary
, with the UUID as the key. Thetable://
URL is a VegaFusion convention used to mean that the associated data will be provided using theinline_datasets
argument to thepre_transform_spec
function.Mimerenderer and Save
A function is defined in
_vegafusion_data.py
calledcompile_with_vegafusion
that inputs a Vega-Lite spec containing thesetable://
URLs, compiles it to a Vega spec, then invokes VegaFusion'spre_transform_spec
function. The stashed DataFrames that correspond to thetable://
URLs are extracted from theWeakValueDictionary
and passed topre_transform_spec
as theinline_datasets
argument.compile_with_vegafusion
then returns the pre-transformed Vega spec as a dictionary.When the "vegafusion" data transformer is enabled, the
default_renderer_base()
function callscompile_with_vegafusion()
before returning a Vega mime bundle. Similarly, thespec_to_mimebundle
function has been updated to callcompile_with_vegafusion()
when the "vegafusion" data transformer is enabled. This way, saved charts have their transforms pre-evaluated before saving.Transformed Data
This PR also updates the
chart.transformed_data
functionality added in #3081 and #3084 to rely on the new "vegafusion" data transformere rather than the "vegafusion-inline" transformer that is provided by thevegafusion
Python package. ("vegafusion-inline" will be deprecated invegafusion
once this is released).MaxRowsError
VegaFusion still has a notion of a maximum number of rows. But unlike the regular row limit, VegaFusion's row limit is applied after all supported data transformations have been applied. So you may hit it in the case of a large scatter chart. I decided to reuse the existing infrastructure for setting and disabling max_rows (e.g.
alt.data_transformers.disable_max_rows()
). When this post-transformed limit is exceeded, we raise the sameMaxRowsError
exception, but with a message that explains that the limit is applied after data transformations.This is open for discussion, but I set the "vegafusion" data transformer's row limit to 100k. This is partly justified by the fact that VegaFusion will remove unused data columns. But also, practically speaking I haven't run into issues at this scale. One case where a high default limit is helpful is for creating fine-grained heatmaps (e.g. 300x300 get's you to 90k).
Testing
I added a test for the
spec_to_mimebundle
updates, but still need to find a place to test the renderer path.Follow-up
After this PR, I want to work on how to document this. We can talk more about it, but I'm thinking of adding a section to the top of the "Large Datasets" section that recommends enabling this data transformer as the first step. And perhaps also updating the default MaxRowsError with instructions on how to enable it.