Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support executing polars SQL against pandas and pyarrow objects #16746

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jun 5, 2024

@ritchie46 - as suggested 😉

Notably expands the scope of the Polars SQL interface .

  • The pl.sql function can now transparently identify and register additional frame/object types when referenced from a Polars SQL query.
  • SQLContext can also now operate on these additional types.

Compatible objects include:

  • Polars DataFrame, LazyFrame, and (new) Series.
  • Pandas DataFrame and Series.
  • PyArrow Table and RecordBatch.

(💡If you have more types that you think would be a good fit for this, let me know).

Objects of these types are transparently converted/registered as LazyFrame iff their variable name appears in the SQL query (this prevents unnecessary conversions of compatible objects that aren't actually referenced).

Example

import pandas as pd
import pyarrow as pa
import polars as pl

# polars
pl_frame = pl.LazyFrame({
  "a": [1, 2, 3],
  "b": [6, 7, 8],
  "c": ["z", "y", "x"],
})

# pandas
pd_frame = pd.DataFrame({
  "a": [2, 3, 4],
  "d": [-0.5, 0.0, 0.5],
})

# pyarrow
pa_table = pa.Table.from_arrays([
    pa.array([1, 2, 3]),
    pa.array(["x", "y", "z"])
  ],
  names=["a", "e"],
)

Join polars LazyFrame with a pandas DataFrame and a pyarrow Table:

pl.sql("""
  SELECT pl_frame.*, d, e
    FROM pl_frame
    JOIN pd_frame USING(a)
    JOIN pa_table USING(a)
""").collect()

# shape: (2, 5)
# ┌─────┬─────┬─────┬──────┬─────┐
# │ a   ┆ b   ┆ c   ┆ d    ┆ e   │
# │ --- ┆ --- ┆ --- ┆ ---  ┆ --- │
# │ i64 ┆ i64 ┆ str ┆ f64  ┆ str │
# ╞═════╪═════╪═════╪══════╪═════╡
# │ 2   ┆ 7   ┆ y   ┆ -0.5 ┆ y   │
# │ 3   ┆ 8   ┆ x   ┆ 0.0  ┆ z   │
# └─────┴─────┴─────┴──────┴─────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 5, 2024
@alexander-beedie alexander-beedie added the A-sql Area: Polars SQL functionality label Jun 5, 2024
@alexander-beedie alexander-beedie force-pushed the sql-extended-query-capability branch 2 times, most recently from 30ed1b7 to 45b1d19 Compare June 5, 2024 14:52
@alexander-beedie alexander-beedie changed the title feat: Support executing the polars SQL engine against pandas and pyarrow objects feat: Support executing the polars SQL interface against pandas and pyarrow objects Jun 5, 2024
@alexander-beedie alexander-beedie force-pushed the sql-extended-query-capability branch from 45b1d19 to 62069d7 Compare June 5, 2024 15:01
@alexander-beedie alexander-beedie changed the title feat: Support executing the polars SQL interface against pandas and pyarrow objects feat: Support executing polars SQL against pandas and pyarrow objects Jun 5, 2024
@alexander-beedie alexander-beedie added the A-interop Area: interoperability with other libraries label Jun 5, 2024
@alexander-beedie alexander-beedie force-pushed the sql-extended-query-capability branch from 62069d7 to 6027482 Compare June 5, 2024 15:38
Copy link

codecov bot commented Jun 5, 2024

Codecov Report

Attention: Patch coverage is 82.14286% with 10 lines in your changes missing coverage. Please review.

Project coverage is 81.45%. Comparing base (6f3fd8e) to head (6027482).

Files Patch % Lines
py-polars/polars/_utils/various.py 50.00% 3 Missing and 4 partials ⚠️
py-polars/polars/sql/context.py 92.68% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16746      +/-   ##
==========================================
- Coverage   81.45%   81.45%   -0.01%     
==========================================
  Files        1413     1413              
  Lines      186306   186343      +37     
  Branches     2777     2784       +7     
==========================================
+ Hits       151750   151780      +30     
- Misses      34036    34040       +4     
- Partials      520      523       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lucazanna
Copy link

Outstanding

Could Spark dataframes be included too?

I never remember the syntax to convert Spark to Polars (below Stack Overflow answer from Ritchie)
https://stackoverflow.com/questions/73203318/how-to-transform-spark-dataframe-to-polars-dataframe

It would make it quite easy to use Polars for any Spark dataframe that fits into memory

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jun 6, 2024

Could Spark dataframes be included too?

Unfortunately that conversion uses a private method (_collect_as_arrow ), which would be a little dicey to rely on in our own API - if PySpark added an "official" Arrow conversion path then we could definitely think about it though✌️

@lucazanna
Copy link

Hi @alexander-beedie

It looks like it was just added to Spark
apache/spark#45481

Better if I open a separate request for this?

@ritchie46
Copy link
Member

Collecting a pyspark dataframe is also very dangerous. We would force it to bring all the data to one machine. This should not be done implictly.

@ritchie46 ritchie46 merged commit 89edcd2 into pola-rs:main Jun 6, 2024
15 checks passed
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jun 6, 2024

Collecting a pyspark dataframe is also very dangerous. We would force it to bring all the data to one machine. This should not be done implictly.

Good point; a from_spark would probably make more sense if we wanted to go that way if there are some meaningful options there (above & beyond what you'd get from just passing the new pyarrow Table object into from_arrow). Or maybe they'll consider adding a native toPolars now they expose Arrow properly ;)

@alexander-beedie alexander-beedie deleted the sql-extended-query-capability branch June 6, 2024 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop Area: interoperability with other libraries A-sql Area: Polars SQL functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants