Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet predicate pushdown doesn't seem to be working with timestamps #343

Closed
kujon opened this issue Dec 10, 2018 · 7 comments · Fixed by #721
Closed

Parquet predicate pushdown doesn't seem to be working with timestamps #343

kujon opened this issue Dec 10, 2018 · 7 comments · Fixed by #721
Assignees

Comments

@kujon
Copy link
Contributor

kujon commented Dec 10, 2018

Vanilla Spark:

val df: DataFrame = ???

val filtered = df.filter(df("value") <= new Timestamp(1512558961000000L))

filtered.explain(true);

Both, IsNotNull and GreaterThanOrEqual get pushed down nicely.

Frameless:

case class Foo(value: SQLTimestamp)

val ds: TypedDataset[Foo] = ???

val filtered = ds.filter(ds('value) >= SQLTimestamp(1512558961000000L))

filtered.explain(true)

IsNotNull gets pushed down, while GreaterThanOrEqual doesn't.

@imarios
Copy link
Contributor

imarios commented Dec 11, 2018

@kujon thanks for opening the ticket!

@imarios imarios added the bug label Dec 11, 2018
@imarios imarios self-assigned this Jan 22, 2019
@imarios
Copy link
Contributor

imarios commented Jan 22, 2019

Notes on comparing encoders of Timestamps using Frameless and vanilla Spark:

TypedExpressionEncoder.apply[SQLTimestamp].schema
> org.apache.spark.sql.types.StructType = StructType(StructField(_1,TimestampType,false))

whereas

org.apache.spark.sql.Encoders.TIMESTAMP.schema
org.apache.spark.sql.types.StructType = StructType(StructField(value,TimestampType,true))

One difference is that they generate schema with different nullable flags.

Changing the encoder to return the same schema, didn't change the behavior.

...

Might be a FramelessLit issue. Investigating.

Looking at the Logical Plan:

Vanilla has:

+- *(1) Filter (isnotnull(value#17) && (value#17 >= 1000))

Frameless has:

*(1) Filter (isnotnull(value#1) && (value#1 >= FramelessLit(SQLTimestamp(1512558961000000))))

If the literal is encoded as FramelessLit, then Parquet will have no idea on how to compare that and hence the predicate is never pushed down.

Verifying that this is a FramelessLit issue by comparing two columns that don't involve literals. --> Does't make sense since predicate push down doesn't show when two columns are compared. Same for both Frameless and Vanilla.

Checked that the predicate push down works for Long and other constants. This seem to be an issue only with SQLTimestamp for now.

Conclusion (as of 01/21/2019): The parquet predicate push down will only work for specific literals. Better if we stick to a java.sql.Timestamp representation for now (?)

@imarios
Copy link
Contributor

imarios commented Jan 22, 2019

I tried to replace the SQLTimestamp with an encoder that operates on java.sql.Timestamp to make it closer to the encoding of native types. The predicate was pushed down as expected in this case.

As long as in this code here:

if (ScalaReflection.isNativeType(encoder.jvmRepr) && encoder.catalystRepr == encoder.jvmRepr) {
val expr = Literal(value, encoder.catalystRepr)
new TypedColumn(expr)
} else {
val expr = FramelessLit(value, encoder)
new TypedColumn(expr)
we don't go into the FramelessLit path and instead generate a regular Spark Literal, then the predicate works. The new encoder didn't work for serde at this point, so we can't really replace what we have with this temp workaround, but at least it increased my confidence that the predicate push down fails when it encounters a FramelessLit, so I will be focusing more on understanding the why.

@chris-twiner
Copy link
Contributor

The reason for this not working with FramelessLit is that spark cannot recognise it as a Literal so it won't get past this code: which only pushes down Literals

Possibly the only way to remediate this is to have a plan substitute the above code, I may attempt this, but it probably requires a full blown spark.sql.extensions plugin so not exactly transparent (e.g. it and all of frameless with dependencies has to be on the spark cluster classpath).

@pomadchin
Copy link
Member

pomadchin commented Jun 4, 2023

Yea, we could add an optimizer rule, it should not be hard.

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 5, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 5, 2023
@chris-twiner
Copy link
Contributor

Indeed it was not, as it's so simple it even works with session.sqlContext.experimental.extraOptimizations
#721 raised for it but it comes with extra baggage to enable windows local dev without winutils.

chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 5, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 5, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 5, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 6, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 7, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 7, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 7, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 7, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 7, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 8, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 8, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 8, 2023
chris-twiner added a commit to chris-twiner/frameless that referenced this issue Jun 8, 2023
@chris-twiner
Copy link
Contributor

fyi / NB for searchers - 3.2.4 cannot work with push down predicates for anything that uses InvokeLike (as SQLTimestamp, structs etc. does) Jira's SPARK-40380 and SPARK-39106 aren't present in them. Frameless built against spark versions greater than 3.3.2 will push down the predicates.

pomadchin added a commit that referenced this issue Jun 10, 2023
* #343 - unpack to Literals

* #343 - add struct test showing difference between extension and experimental rules

* #343 - toString test to stop the patch complaint

* #343 - sample docs

* #343 - package rename and adding logging that the extension is injected

* Apply suggestions from code review

Co-authored-by: Cédric Chantepie <cchantep@users.noreply.github.com>

* Refactor LitRule and LitRules tests by making them slightly more generic, adjust docs, add negative tests

* #343 - disable the rule, foldable and eval evals

* #343 - cleaned up

* #343 - true with link for 3.2 support

* #343 - bring back code gen with lazy to stop recompiles

* #343 - more compat and a foldable only backport of SPARK-39106 and SPARK-40380

* #343 - option 3 - let 3.2 fail as per oss impl, seperate tests

---------

Co-authored-by: Cédric Chantepie <cchantep@users.noreply.github.com>
Co-authored-by: Grigory Pomadchin <grigory.pomadchin@disneystreaming.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants