-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolve #427- Spark 3 support #433
Conversation
…o replace schema directly so some nullable checks just won't work
RecordEncoder fromCatalyst paths aren't working as before, for nested types like X1 the path is X1 but needs to move past that path, similar problem for Option - the check-in / changes with class names is not intended to be merged but to indicate where the change is observable. So X1(Person) will work as newinstance respects the path, but WrapOption complains as the path is the X1 type and never accepts the "a". That can be forced but then it won't type check - odd. |
… to minimal changes
Codecov Report
@@ Coverage Diff @@
## master #433 +/- ##
==========================================
- Coverage 96.83% 96.22% -0.62%
==========================================
Files 60 60
Lines 1044 1034 -10
Branches 3 4 +1
==========================================
- Hits 1011 995 -16
- Misses 33 39 +6
Continue to review full report at Codecov.
|
This looks great! I have to go over some parts in more details. Let me try to wrap the review up this weekend. Thank you again for all the hard work! |
@chris-twiner all looks good! I am going through some minot tests on my side and I think we are almost ready to merge. Thank you again for your time and patience. |
great stuff, welcome for the time and reciprocal thanks for getting this official and frameless in the first instance! Incidentally and as an fyi there does seem some strangeness on nested types ArrayType[Binary] for compiled encoders vs interpreted (case class has Array[Array[Byte]] but NewInstance is called with Array[Object]). This is a Spark 3 issue and I don't think much can or should be done on it within frameless. I'll try to get a small reproducible case made for it against spark proper and raise an issue separately if it is frameless rather than stop the 3 build out. |
@chris-twiner can you quickly resolve the conflicts with master. I think we made some small update changes in the build. It should be easy. Thanks! |
README.md
Outdated
@@ -74,12 +74,12 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API. | |||
* [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html) | |||
|
|||
## Quick Start | |||
Frameless is compiled against Scala 2.11.x (and Scala 2.12.x since Frameless 0.8.0) | |||
Frameless is compiled against Scala 2.12.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention that "Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
@@ -162,23 +161,13 @@ class RecordEncoder[F, G <: HList, H <: HList] | |||
|
|||
def fromCatalyst(path: Expression): Expression = { | |||
val exprs = fields.value.value.map { field => | |||
val fieldPath = path match { | |||
case BoundReference(ordinal, dataType, nullable) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-twiner was this part of the code causing issues? I am trying to think if this has any unexpected side effects. I am already seeing some difference with the previous version here:
In frameless 0.8
scala> val x = TypedDataset.create(Array(1,2,3,1))
x: frameless.TypedDataset[Int] = [_1: int]
In new PR
scala> val x = TypedDataset.create(Array(1,2,3,1))
x: frameless.TypedDataset[Int] = [value: int]
At least the field name seems to be printed differently (was _1
before but not it's value
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is it's not set by frameless code, rather by the expression encoder. You can't inject the schema any more.
But you are right its probably breaking - I've updated the readme to mention this.
*/ | ||
def targetStructType[A](encoder: TypedEncoder[A]): StructType = { | ||
encoder.catalystRepr match { | ||
case x: StructType => | ||
if (encoder.nullable) StructType(x.fields.map(_.copy(nullable = true))) | ||
else x | ||
case dt => new StructType().add("_1", dt, nullable = encoder.nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I should have seen this before my last comment. Was there a reason going back to value
? Unfortunately, I don't remember why we change this from value
to _1
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per ExpressionEncoder it's no longer possible to inject this schema, so this reflects the "correct", if breaking, change for any other code that uses it.
@@ -8,19 +8,19 @@ import org.scalacheck.Prop._ | |||
import org.scalacheck.{Gen, Prop} | |||
|
|||
class PivotTest extends TypedDatasetSuite { | |||
def withCustomGenX4: Gen[Vector[X4[String, String, Int, Boolean]]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there an issue with Int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just retested it and it's working, previously it was not due to interim changes on the ExpressionEncoder integration, reverted.
@chris-twiner I am so sorry for the late review. Looks good! |
Key changes in the ExpressionEncoder take over much of what the TypedExpressionEncoder was doing but also makes some fundamental changes such as only one GetColumnByOrdinal is allowed in a deserializer. Overall tests from dataset are now 49 failed 305 passed.
There are other locations where it looks like _1 is used such as TypedDataset which should now probably be value and would explain some star expression failures.
Other key changes - scale cannot be negative for decimals, type coercion seems to have changed* and many join tests fail due to spark.sql.analyzer.failAmbiguousSelfJoin see the migration guide for details.