Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.1.0rc4 hangs after repeated calls to pre_transform_datasets #268

Closed
jonmmease opened this issue Mar 20, 2023 · 1 comment · Fixed by #269
Closed

1.1.0rc4 hangs after repeated calls to pre_transform_datasets #268

jonmmease opened this issue Mar 20, 2023 · 1 comment · Fixed by #269
Labels
bug Something isn't working

Comments

@jonmmease
Copy link
Collaborator

Ran into a regression in pre_transform_datasets in 1.1.0rc4 (introduced after 1.1.0rc3 I believe). Here's a repro

import vegafusion as vf
import pandas as pd
import json
vf.__version__
'1.1.0-rc4'
movies = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/movies.json")
spec = json.loads(r""" 
{
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "data": [
    {
      "name": "interval_intervalselection__store"
    },
    {
      "name": "legend_pointselection__store"
    },
    {
      "name": "pivot_hover_32f2e9aa_f08a_4fb5_aa8a_ab3f2cc94a1d_store"
    },
    {
      "name": "movies_clean",
      "url": "vegafusion+dataset://movies_clean"
    },
    {
      "name": "data_0",
      "source": "movies_clean",
      "transform": [
        {
          "type": "formula",
          "expr": "toDate(datum[\"Release Date\"])",
          "as": "Release Date"
        }
      ]
    },
    {
      "name": "data_2",
      "source": "data_0",
      "transform": [
        {
          "field": "Release Date",
          "type": "timeunit",
          "units": [
            "year"
          ],
          "as": [
            "year_Release Date",
            "year_Release Date_end"
          ]
        }
      ]
    },
    {
      "name": "data_3",
      "source": "data_2",
      "transform": [
        {
          "type": "filter",
          "expr": "!length(data(\"interval_intervalselection__store\")) || vlSelectionTest(\"interval_intervalselection__store\", datum)"
        },
        {
          "type": "filter",
          "expr": "time('1986-11-09T18:28:05.617') <= time(datum[\"year_Release Date\"]) && time(datum[\"year_Release Date\"]) <= time('2001-09-16T22:23:39.144')"
        },
        {
          "type": "filter",
          "expr": "!length(data(\"legend_pointselection__store\")) || vlSelectionTest(\"legend_pointselection__store\", datum)"
        }
      ]
    }
  ]
}
""")
for i in range(20):
    print(f"pre_transform_datasets: {i}")
    vf.runtime.pre_transform_datasets(spec, ["data_3"], "UTC", inline_datasets=dict(movies_clean=movies))
    print("done")
for i in range(20):
    print(f"pre_transform_datasets: {i}")
    vf.runtime.pre_transform_datasets(spec, ["data_3"], "UTC", inline_datasets=dict(movies_clean=movies))
    print("done")
for i in range(20):
    print(f"pre_transform_datasets: {i}")
    vf.runtime.pre_transform_datasets(spec, ["data_3"], "UTC", inline_datasets=dict(movies_clean=movies))
    print("done")
pre_transform_datasets: 0
done
pre_transform_datasets: 1
done
pre_transform_datasets: 2
done
pre_transform_datasets: 3
done
pre_transform_datasets: 4
done
pre_transform_datasets: 5
done
pre_transform_datasets: 6
done
pre_transform_datasets: 7

The calls to pre_transform_dataset are very quick up until pre_transform_datasets: 7, then the process hangs indefinitely.

With 1.1.0rc3, the loop completes in a couple of seconds without issue.

Of the changes between 1.1.0rc3 and 1.1.0rc4, these two look like the only PRs that touched relevant code:

So i'm going to try reverting each of these locally to narrow in on what caused the regression.

@jonmmease jonmmease added the bug Something isn't working label Mar 20, 2023
@jonmmease
Copy link
Collaborator Author

Hmm, looks like #264 introduced the regression

jonmmease added a commit that referenced this issue Mar 20, 2023
This works around #268 by copying the input pyarrow table through the IPC bytes representations. It also allows us to properly hash the input PyArrow table, which allows the cache to work properly.
jonmmease added a commit that referenced this issue Mar 20, 2023
* work around #268 and fix table fingerprint

This works around #268 by copying the input pyarrow table through the IPC bytes representations. It also allows us to properly hash the input PyArrow table, which allows the cache to work properly.

* Use to_pyarrow instead of ipc bytes for output of pre_transform_datasets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant