resolve #427- Spark 3 support #433

chris-twiner · 2020-07-06T10:36:13Z

Key changes in the ExpressionEncoder take over much of what the TypedExpressionEncoder was doing but also makes some fundamental changes such as only one GetColumnByOrdinal is allowed in a deserializer. Overall tests from dataset are now 49 failed 305 passed.

There are other locations where it looks like _1 is used such as TypedDataset which should now probably be value and would explain some star expression failures.

Other key changes - scale cannot be negative for decimals, type coercion seems to have changed* and many join tests fail due to spark.sql.analyzer.failAmbiguousSelfJoin see the migration guide for details.

I can't seem to find the place that say 1 is now bigint Long by default not Int.

…o replace schema directly so some nullable checks just won't work

…work though

chris-twiner · 2020-07-06T20:34:18Z

RecordEncoder fromCatalyst paths aren't working as before, for nested types like X1 the path is X1 but needs to move past that path, similar problem for Option - the check-in / changes with class names is not intended to be merged but to indicate where the change is observable. So X1(Person) will work as newinstance respects the path, but WrapOption complains as the path is the X1 type and never accepts the "a". That can be forced but then it won't type check - odd.

… to minimal changes

codecov-commenter · 2020-07-07T23:15:42Z

Codecov Report

Merging #433 into master will decrease coverage by 0.61%.
The diff coverage is 73.91%.

@@            Coverage Diff             @@
##           master     #433      +/-   ##
==========================================
- Coverage   96.83%   96.22%   -0.62%     
==========================================
  Files          60       60              
  Lines        1044     1034      -10     
  Branches        3        4       +1     
==========================================
- Hits         1011      995      -16     
- Misses         33       39       +6

Impacted Files	Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala	`100.00% <ø> (ø)`
...taset/src/main/scala/frameless/functions/Udf.scala	`86.95% <50.00%> (-13.05%)`	⬇️
...taset/src/main/scala/frameless/RecordEncoder.scala	`100.00% <100.00%> (ø)`
...ataset/src/main/scala/frameless/TypedEncoder.scala	`100.00% <100.00%> (ø)`
.../main/scala/frameless/TypedExpressionEncoder.scala	`100.00% <100.00%> (ø)`
...scala/frameless/functions/AggregateFunctions.scala	`100.00% <100.00%> (ø)`
...c/main/scala/frameless/TypedDatasetForwarded.scala	`73.52% <0.00%> (-0.76%)`	⬇️
...l/src/main/scala/frameless/ml/TypedEstimator.scala	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5eef64d...4713716. Read the comment docs.

imarios · 2020-07-10T23:39:22Z

This looks great! I have to go over some parts in more details. Let me try to wrap the review up this weekend. Thank you again for all the hard work!

imarios · 2020-07-15T16:52:15Z

@chris-twiner all looks good! I am going through some minot tests on my side and I think we are almost ready to merge. Thank you again for your time and patience.

chris-twiner · 2020-07-16T14:30:06Z

@chris-twiner all looks good! I am going through some minot tests on my side and I think we are almost ready to merge. Thank you again for your time and patience.

great stuff, welcome for the time and reciprocal thanks for getting this official and frameless in the first instance!

Incidentally and as an fyi there does seem some strangeness on nested types ArrayType[Binary] for compiled encoders vs interpreted (case class has Array[Array[Byte]] but NewInstance is called with Array[Object]). This is a Spark 3 issue and I don't think much can or should be done on it within frameless. I'll try to get a small reproducible case made for it against spark proper and raise an issue separately if it is frameless rather than stop the 3 build out.

imarios · 2020-07-26T17:25:54Z

@chris-twiner can you quickly resolve the conflicts with master. I think we made some small update changes in the build. It should be easy. Thanks!

imarios · 2020-07-26T17:28:35Z

README.md

@@ -74,12 +74,12 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API.
 * [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html)

 ## Quick Start
-Frameless is compiled against Scala 2.11.x (and Scala 2.12.x since Frameless 0.8.0)
+Frameless is compiled against Scala 2.12.x


Let's mention that "Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x"

imarios · 2020-07-26T18:06:00Z

dataset/src/main/scala/frameless/RecordEncoder.scala

@@ -162,23 +161,13 @@ class RecordEncoder[F, G <: HList, H <: HList]

    def fromCatalyst(path: Expression): Expression = {
      val exprs = fields.value.value.map { field =>
-        val fieldPath = path match {
-          case BoundReference(ordinal, dataType, nullable) =>


@chris-twiner was this part of the code causing issues? I am trying to think if this has any unexpected side effects. I am already seeing some difference with the previous version here:

In frameless 0.8

scala> val x = TypedDataset.create(Array(1,2,3,1)) x: frameless.TypedDataset[Int] = [_1: int]

In new PR

scala> val x = TypedDataset.create(Array(1,2,3,1)) x: frameless.TypedDataset[Int] = [value: int]

At least the field name seems to be printed differently (was _1 before but not it's value).

The problem is it's not set by frameless code, rather by the expression encoder. You can't inject the schema any more.

But you are right its probably breaking - I've updated the readme to mention this.

imarios · 2020-07-26T18:08:53Z

dataset/src/main/scala/frameless/TypedExpressionEncoder.scala

    */
  def targetStructType[A](encoder: TypedEncoder[A]): StructType = {
   encoder.catalystRepr match {
      case x: StructType =>
        if (encoder.nullable) StructType(x.fields.map(_.copy(nullable = true)))
        else x
-      case dt => new StructType().add("_1", dt, nullable = encoder.nullable)


Ok, I should have seen this before my last comment. Was there a reason going back to value? Unfortunately, I don't remember why we change this from value to _1 ...

per ExpressionEncoder it's no longer possible to inject this schema, so this reflects the "correct", if breaking, change for any other code that uses it.

imarios · 2020-07-26T18:11:15Z

dataset/src/test/scala/frameless/ops/PivotTest.scala

@@ -8,19 +8,19 @@ import org.scalacheck.Prop._
 import org.scalacheck.{Gen, Prop}

 class PivotTest extends TypedDatasetSuite {
-  def withCustomGenX4: Gen[Vector[X4[String, String, Int, Boolean]]] = {


Was there an issue with Int?

I've just retested it and it's working, previously it was not due to interim changes on the ExpressionEncoder integration, reverted.

… - no longer necessary

imarios · 2020-08-31T04:13:23Z

@chris-twiner I am so sorry for the late review. Looks good!

chris-twiner added 7 commits July 4, 2020 00:03

fix typelevel#422, base for typelevel#427

23d93f9

better handling, schema on the serializer needs repacking for nulls

fe4798c

parity udf code, complex types still not working

0d27d43

tests using minus scales fixed

dd4eb67

tests using minus scales fixed

af34ebf

get rid of debug entry, doesn't work either way

fdccfb8

value is the new _1 and set by ExpressionEncoder, it's not possible t…

23868dc

…o replace schema directly so some nullable checks just won't work

chris-twiner mentioned this pull request Jul 6, 2020

Add support for Spark 3 Preview #427

Closed

chris-twiner added 2 commits July 6, 2020 19:54

nested types can't use ordinals directly in 3, bizarrely

f1e6b72

the paths are wrong for options, the nested types 'solution' doesn't …

a1d006b

…work though

chris-twiner added 6 commits July 7, 2020 17:50

got the basics, wrong starting place for record encoders, will fix up…

729c56d

… to minimal changes

encoders ported, 347 tests passed, 7 failed

a1a7d5b

sum and sumdistinct have the wrong types for the zero literal

7896e8a

add spark 2.4 join behaviour and enable ignoring of nullable

f39f550

version bump and readme change

c5a7078

remove 2.11 travis build bits

918ca8c

typelevel#427 - Remove dead code

9e0de29

chris-twiner mentioned this pull request Jul 9, 2020

UDF Performance Improvement #422

Closed

imarios reviewed Jul 26, 2020

View reviewed changes

chris-twiner added 2 commits July 27, 2020 11:31

typelevel#427 - Based on feedback on the pull, reverting test changes…

69ae591

… - no longer necessary

Merge branch 'master' into temp/Spark3

4713716

imarios merged commit 614986b into typelevel:master Aug 31, 2020

This was referenced Sep 3, 2020

Scalapb sparksql does not work with Spark 3 (preview) scalapb/sparksql-scalapb#97

Closed

Version 0.9 does not appear to be published #440

Closed

chris-twiner mentioned this pull request Jun 2, 2023

Spark 3.4.0 and DBR 12.2 support #699

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolve #427- Spark 3 support #433

resolve #427- Spark 3 support #433

chris-twiner commented Jul 6, 2020

chris-twiner commented Jul 6, 2020

codecov-commenter commented Jul 7, 2020 •

edited

Loading

imarios commented Jul 10, 2020

imarios commented Jul 15, 2020

chris-twiner commented Jul 16, 2020

imarios commented Jul 26, 2020

imarios Jul 26, 2020

chris-twiner Jul 27, 2020

imarios Jul 26, 2020

chris-twiner Jul 27, 2020

imarios Jul 26, 2020 •

edited

Loading

chris-twiner Jul 27, 2020

imarios Jul 26, 2020

chris-twiner Jul 27, 2020

imarios commented Aug 31, 2020

resolve #427- Spark 3 support #433

resolve #427- Spark 3 support #433

Conversation

chris-twiner commented Jul 6, 2020

chris-twiner commented Jul 6, 2020

codecov-commenter commented Jul 7, 2020 • edited Loading

Codecov Report

imarios commented Jul 10, 2020

imarios commented Jul 15, 2020

chris-twiner commented Jul 16, 2020

imarios commented Jul 26, 2020

imarios Jul 26, 2020

Choose a reason for hiding this comment

chris-twiner Jul 27, 2020

Choose a reason for hiding this comment

imarios Jul 26, 2020

Choose a reason for hiding this comment

chris-twiner Jul 27, 2020

Choose a reason for hiding this comment

imarios Jul 26, 2020 • edited Loading

Choose a reason for hiding this comment

chris-twiner Jul 27, 2020

Choose a reason for hiding this comment

imarios Jul 26, 2020

Choose a reason for hiding this comment

chris-twiner Jul 27, 2020

Choose a reason for hiding this comment

imarios commented Aug 31, 2020

codecov-commenter commented Jul 7, 2020 •

edited

Loading

imarios Jul 26, 2020 •

edited

Loading