Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolve #427- Spark 3 support #433

Merged
merged 18 commits into from
Aug 31, 2020
Merged

Conversation

chris-twiner
Copy link
Contributor

Key changes in the ExpressionEncoder take over much of what the TypedExpressionEncoder was doing but also makes some fundamental changes such as only one GetColumnByOrdinal is allowed in a deserializer. Overall tests from dataset are now 49 failed 305 passed.

There are other locations where it looks like _1 is used such as TypedDataset which should now probably be value and would explain some star expression failures.

Other key changes - scale cannot be negative for decimals, type coercion seems to have changed* and many join tests fail due to spark.sql.analyzer.failAmbiguousSelfJoin see the migration guide for details.

  • I can't seem to find the place that say 1 is now bigint Long by default not Int.

@chris-twiner
Copy link
Contributor Author

RecordEncoder fromCatalyst paths aren't working as before, for nested types like X1 the path is X1 but needs to move past that path, similar problem for Option - the check-in / changes with class names is not intended to be merged but to indicate where the change is observable. So X1(Person) will work as newinstance respects the path, but WrapOption complains as the path is the X1 type and never accepts the "a". That can be forced but then it won't type check - odd.

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2020

Codecov Report

Merging #433 into master will decrease coverage by 0.61%.
The diff coverage is 73.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #433      +/-   ##
==========================================
- Coverage   96.83%   96.22%   -0.62%     
==========================================
  Files          60       60              
  Lines        1044     1034      -10     
  Branches        3        4       +1     
==========================================
- Hits         1011      995      -16     
- Misses         33       39       +6     
Impacted Files Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala 100.00% <ø> (ø)
...taset/src/main/scala/frameless/functions/Udf.scala 86.95% <50.00%> (-13.05%) ⬇️
...taset/src/main/scala/frameless/RecordEncoder.scala 100.00% <100.00%> (ø)
...ataset/src/main/scala/frameless/TypedEncoder.scala 100.00% <100.00%> (ø)
.../main/scala/frameless/TypedExpressionEncoder.scala 100.00% <100.00%> (ø)
...scala/frameless/functions/AggregateFunctions.scala 100.00% <100.00%> (ø)
...c/main/scala/frameless/TypedDatasetForwarded.scala 73.52% <0.00%> (-0.76%) ⬇️
...l/src/main/scala/frameless/ml/TypedEstimator.scala 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5eef64d...4713716. Read the comment docs.

@imarios
Copy link
Contributor

imarios commented Jul 10, 2020

This looks great! I have to go over some parts in more details. Let me try to wrap the review up this weekend. Thank you again for all the hard work!

@imarios
Copy link
Contributor

imarios commented Jul 15, 2020

@chris-twiner all looks good! I am going through some minot tests on my side and I think we are almost ready to merge. Thank you again for your time and patience.

@chris-twiner
Copy link
Contributor Author

@chris-twiner all looks good! I am going through some minot tests on my side and I think we are almost ready to merge. Thank you again for your time and patience.

great stuff, welcome for the time and reciprocal thanks for getting this official and frameless in the first instance!

Incidentally and as an fyi there does seem some strangeness on nested types ArrayType[Binary] for compiled encoders vs interpreted (case class has Array[Array[Byte]] but NewInstance is called with Array[Object]). This is a Spark 3 issue and I don't think much can or should be done on it within frameless. I'll try to get a small reproducible case made for it against spark proper and raise an issue separately if it is frameless rather than stop the 3 build out.

@imarios
Copy link
Contributor

imarios commented Jul 26, 2020

@chris-twiner can you quickly resolve the conflicts with master. I think we made some small update changes in the build. It should be easy. Thanks!

README.md Outdated
@@ -74,12 +74,12 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API.
* [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html)

## Quick Start
Frameless is compiled against Scala 2.11.x (and Scala 2.12.x since Frameless 0.8.0)
Frameless is compiled against Scala 2.12.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention that "Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

@@ -162,23 +161,13 @@ class RecordEncoder[F, G <: HList, H <: HList]

def fromCatalyst(path: Expression): Expression = {
val exprs = fields.value.value.map { field =>
val fieldPath = path match {
case BoundReference(ordinal, dataType, nullable) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-twiner was this part of the code causing issues? I am trying to think if this has any unexpected side effects. I am already seeing some difference with the previous version here:

In frameless 0.8

scala> val x = TypedDataset.create(Array(1,2,3,1))
x: frameless.TypedDataset[Int] = [_1: int]

In new PR

scala> val x = TypedDataset.create(Array(1,2,3,1))
x: frameless.TypedDataset[Int] = [value: int]

At least the field name seems to be printed differently (was _1 before but not it's value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is it's not set by frameless code, rather by the expression encoder. You can't inject the schema any more.

But you are right its probably breaking - I've updated the readme to mention this.

*/
def targetStructType[A](encoder: TypedEncoder[A]): StructType = {
encoder.catalystRepr match {
case x: StructType =>
if (encoder.nullable) StructType(x.fields.map(_.copy(nullable = true)))
else x
case dt => new StructType().add("_1", dt, nullable = encoder.nullable)
Copy link
Contributor

@imarios imarios Jul 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I should have seen this before my last comment. Was there a reason going back to value? Unfortunately, I don't remember why we change this from value to _1 ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per ExpressionEncoder it's no longer possible to inject this schema, so this reflects the "correct", if breaking, change for any other code that uses it.

@@ -8,19 +8,19 @@ import org.scalacheck.Prop._
import org.scalacheck.{Gen, Prop}

class PivotTest extends TypedDatasetSuite {
def withCustomGenX4: Gen[Vector[X4[String, String, Int, Boolean]]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there an issue with Int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just retested it and it's working, previously it was not due to interim changes on the ExpressionEncoder integration, reverted.

@imarios
Copy link
Contributor

imarios commented Aug 31, 2020

@chris-twiner I am so sorry for the late review. Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants