Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SnowparkDataset and date_to_utc_timestamp support across dialects #374

Merged
merged 3 commits into from
Aug 11, 2023

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Aug 11, 2023

This PR adds an initial Python SnowparkDataset implementation that derives from SqlDataset. It uses the snowflake SQL dialect and evaluates queries using snowflake-snowpark-python.

Here is some example usage with Altair using:

  • Altair 5.1.0 dev from git
  • snowflake-connector-python==3.1.0a2 (This version doesn't pin pyarrow)
  • pyarrow 12
import altair as alt
from vegafusion.dataset.snowpark import SnowparkDataset
from snowflake.snowpark import Session, Table

connection_parameters = {
  "account": "<your snowflake account>",
  "user": "<your snowflake user>",
  "password": "<your snowflake password>",
  "role": "<snowflake user role>",
  "warehouse": "<snowflake warehouse>",
  "database": "<snowflake database>",
  "schema": "<snowflake schema>"
}

session = Session.builder.configs(connection_parameters).create()
session

movies = session.table('"DEMO_DATA"."DEMOS"."MOVIES"')

alt.data_transformers.enable("vegafusion")

# Build histogram with SnowparkDataset in verbose mode so that it prints out the queries
chart = alt.Chart(SnowparkDataset(movies, verbose=True, fallback=False)).mark_bar().encode(
    alt.X("IMDB_RATING").bin(),
    alt.Y("count():Q")
)
chart
Snowflake Query:
WITH _DEMO_DATA___DEMOS___MOVIES__0 AS (SELECT "TITLE", "SOURCE", "DIRECTOR", "US_GROSS", "IMDB_VOTES", "DISTRIBUTOR", "IMDB_RATING", "MPAA_RATING", "MAJOR_GENRE", "RELEASE_DATE", "US_DVD_SALES", "CREATIVE_TYPE", "WORLDWIDE_GROSS", "RUNNING_TIME_MIN", "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING" FROM "DEMO_DATA"."DEMOS"."MOVIES"), _DEMO_DATA___DEMOS___MOVIES__1 AS (SELECT seq8() AS "_vf_order", * FROM _DEMO_DATA___DEMOS___MOVIES__0), _DEMO_DATA___DEMOS___MOVIES__2 AS (SELECT "_vf_order" AS "_vf_order", "TITLE" AS "TITLE", "SOURCE" AS "SOURCE", "DIRECTOR" AS "DIRECTOR", "US_GROSS" AS "US_GROSS", "IMDB_VOTES" AS "IMDB_VOTES", "DISTRIBUTOR" AS "DISTRIBUTOR", "IMDB_RATING" AS "IMDB_RATING", "MPAA_RATING" AS "MPAA_RATING", "MAJOR_GENRE" AS "MAJOR_GENRE", "RELEASE_DATE" AS "RELEASE_DATE", "US_DVD_SALES" AS "US_DVD_SALES", "CREATIVE_TYPE" AS "CREATIVE_TYPE", "WORLDWIDE_GROSS" AS "WORLDWIDE_GROSS", "RUNNING_TIME_MIN" AS "RUNNING_TIME_MIN", "PRODUCTION_BUDGET" AS "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING" AS "ROTTEN_TOMATOES_RATING" FROM _DEMO_DATA___DEMOS___MOVIES__1) SELECT min("IMDB_RATING") AS "__min_val", max("IMDB_RATING") AS "__max_val" FROM _DEMO_DATA___DEMOS___MOVIES__2

Snowflake Query:
WITH _DEMO_DATA___DEMOS___MOVIES__0 AS (SELECT "TITLE", "SOURCE", "DIRECTOR", "US_GROSS", "IMDB_VOTES", "DISTRIBUTOR", "IMDB_RATING", "MPAA_RATING", "MAJOR_GENRE", "RELEASE_DATE", "US_DVD_SALES", "CREATIVE_TYPE", "WORLDWIDE_GROSS", "RUNNING_TIME_MIN", "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING" FROM "DEMO_DATA"."DEMOS"."MOVIES"), _DEMO_DATA___DEMOS___MOVIES__1 AS (SELECT seq8() AS "_vf_order", * FROM _DEMO_DATA___DEMOS___MOVIES__0), _DEMO_DATA___DEMOS___MOVIES__2 AS (SELECT "_vf_order" AS "_vf_order", "TITLE" AS "TITLE", "SOURCE" AS "SOURCE", "DIRECTOR" AS "DIRECTOR", "US_GROSS" AS "US_GROSS", "IMDB_VOTES" AS "IMDB_VOTES", "DISTRIBUTOR" AS "DISTRIBUTOR", "IMDB_RATING" AS "IMDB_RATING", "MPAA_RATING" AS "MPAA_RATING", "MAJOR_GENRE" AS "MAJOR_GENRE", "RELEASE_DATE" AS "RELEASE_DATE", "US_DVD_SALES" AS "US_DVD_SALES", "CREATIVE_TYPE" AS "CREATIVE_TYPE", "WORLDWIDE_GROSS" AS "WORLDWIDE_GROSS", "RUNNING_TIME_MIN" AS "RUNNING_TIME_MIN", "PRODUCTION_BUDGET" AS "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING" AS "ROTTEN_TOMATOES_RATING" FROM _DEMO_DATA___DEMOS___MOVIES__1), _DEMO_DATA___DEMOS___MOVIES__3 AS (SELECT *, FLOOR(((("IMDB_RATING" - 1.0) / 1.0) + 0.00000000000001)) AS "__bin_index" FROM _DEMO_DATA___DEMOS___MOVIES__2), _DEMO_DATA___DEMOS___MOVIES__4 AS (SELECT "_vf_order", "TITLE", "SOURCE", "DIRECTOR", "US_GROSS", "IMDB_VOTES", "DISTRIBUTOR", "IMDB_RATING", "MPAA_RATING", "MAJOR_GENRE", "RELEASE_DATE", "US_DVD_SALES", "CREATIVE_TYPE", "WORLDWIDE_GROSS", "RUNNING_TIME_MIN", "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING", "__bin_index", CASE WHEN ("__bin_index" < 0.0) THEN CAST('-inf' AS DOUBLE) WHEN ((abs(("IMDB_RATING" - 10.0)) < 0.00000000000001) AND ("__bin_index" = 9)) THEN ((("__bin_index" - 1) * 1.0) + 1.0) WHEN ("__bin_index" >= 9) THEN CAST('inf' AS DOUBLE) ELSE (("__bin_index" * 1.0) + 1.0) END AS "bin_maxbins_10_IMDB_RATING" FROM _DEMO_DATA___DEMOS___MOVIES__3), _DEMO_DATA___DEMOS___MOVIES__5 AS (SELECT "_vf_order", "TITLE", "SOURCE", "DIRECTOR", "US_GROSS", "IMDB_VOTES", "DISTRIBUTOR", "IMDB_RATING", "MPAA_RATING", "MAJOR_GENRE", "RELEASE_DATE", "US_DVD_SALES", "CREATIVE_TYPE", "WORLDWIDE_GROSS", "RUNNING_TIME_MIN", "PRODUCTION_BUDGET", "ROTTEN_TOMATOES_RATING", "bin_maxbins_10_IMDB_RATING", ("bin_maxbins_10_IMDB_RATING" + 1.0) AS "bin_maxbins_10_IMDB_RATING_end" FROM _DEMO_DATA___DEMOS___MOVIES__4), _DEMO_DATA___DEMOS___MOVIES__6 AS (SELECT count(0) AS "__count", min("_vf_order") AS "_vf_order", "bin_maxbins_10_IMDB_RATING", "bin_maxbins_10_IMDB_RATING_end" FROM _DEMO_DATA___DEMOS___MOVIES__5 GROUP BY "bin_maxbins_10_IMDB_RATING", "bin_maxbins_10_IMDB_RATING_end"), _DEMO_DATA___DEMOS___MOVIES__7 AS (SELECT "_vf_order", "bin_maxbins_10_IMDB_RATING", "bin_maxbins_10_IMDB_RATING_end", "__count" FROM _DEMO_DATA___DEMOS___MOVIES__6), _DEMO_DATA___DEMOS___MOVIES__8 AS (SELECT * FROM _DEMO_DATA___DEMOS___MOVIES__7 WHERE coalesce(("bin_maxbins_10_IMDB_RATING" IS NOT NULL AND (NOT "bin_maxbins_10_IMDB_RATING" IN (CAST('-inf' AS DOUBLE), CAST('inf' AS DOUBLE), CAST('NaN' AS DOUBLE)))), false)), _DEMO_DATA___DEMOS___MOVIES__9 AS (SELECT "_vf_order", "__count", "bin_maxbins_10_IMDB_RATING", "bin_maxbins_10_IMDB_RATING_end" FROM _DEMO_DATA___DEMOS___MOVIES__8), _DEMO_DATA___DEMOS___MOVIES__10 AS (SELECT * FROM _DEMO_DATA___DEMOS___MOVIES__9 ORDER BY "_vf_order" ASC NULLS LAST) SELECT "__count", "bin_maxbins_10_IMDB_RATING", "bin_maxbins_10_IMDB_RATING_end" FROM _DEMO_DATA___DEMOS___MOVIES__10

visualization

While testing things, I realized that date_to_utc_timestamp wasn't implemented for most of the dialects, and wasn't tested. So this PR implements it for all dialects and adds testing.

@jonmmease jonmmease changed the title Add SnowparkDataset Add SnowparkDataset and date_to_utc_timestamp support across dialects Aug 11, 2023
@jonmmease jonmmease merged commit c01850d into main Aug 11, 2023
30 checks passed
@jonmmease jonmmease removed the request for review from FreddieLindsey August 11, 2023 14:56
Comment on lines +10 to +32
SNOWPARK_TO_PYARROW_TYPES: Dict[SnowparkDataType, pa.DataType] = {}


def get_snowpark_to_pyarrow_types():
if not SNOWPARK_TO_PYARROW_TYPES:
import snowflake.snowpark.types as sp_types

SNOWPARK_TO_PYARROW_TYPES.update(
{
sp_types.LongType: pa.int64(),
sp_types.BinaryType: pa.binary(),
sp_types.BooleanType: pa.bool_(),
sp_types.ByteType: pa.int8(),
sp_types.StringType: pa.string(),
sp_types.DateType: pa.date32(),
sp_types.DoubleType: pa.float64(),
sp_types.FloatType: pa.float32(),
sp_types.IntegerType: pa.int32(),
sp_types.ShortType: pa.int16(),
sp_types.TimestampType: pa.timestamp("ms"),
}
)
return SNOWPARK_TO_PYARROW_TYPES

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonmmease given we don't mind importing PyArrow or Snowflake here, this could be statically initialised rather than lazily now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants