Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate column referencing syntax for TypeDataset #39

Closed
OlivierBlanvillain opened this issue Jun 6, 2016 · 10 comments
Closed

Alternate column referencing syntax for TypeDataset #39

OlivierBlanvillain opened this issue Jun 6, 2016 · 10 comments

Comments

@OlivierBlanvillain
Copy link
Contributor

It would be nice to add an alternate column referencing syntax to the TypedDataset API which is closer to the vanilla syntax, similarly to the way it's done in TypeDataFrame.

Currently it looks like: (source)

val dataset = TypedDataset.create(data)
val A = dataset.col[A]('a)
val B = dataset.col[B]('b)

val dataset2 = dataset.select(A, B).collect().run().toVector

I think it should be possible to change it to something like:

val dataset = TypedDataset.create(data)
val dataset2 = dataset.select('a, 'b).collect().run().toVector

It would be also interesting to investigate an alternate syntax for td.colMany('b, 'b) (equivalent to accessing _.b.b).

@kanterov
Copy link
Contributor

kanterov commented Jun 7, 2016

I was playing with this by trying implicit conversion from Symbol to TypedColumn but I wasn't able to capture symbol value on type level this way. It should be possible with implicit macro, but we should somehow minimize an amount of macro code we write ourselves and rely on tools from shapeless.

@imarios
Copy link
Contributor

imarios commented Oct 10, 2016

Hey guys, I think this is a great feature and it will make writing expressions much cleaner. I was able to get this working:

def select[A](column: Witness.Lt[Symbol])(
    implicit
    exists: TypedColumn.Exists[T, column.T, A],
    encoder: TypedEncoder[A]): TypedDataset[A] = select(col(column))

It combines what the col method does to get the typed column and then feed the column to the existing select.

With this you can do select('foo) and it works.

Unfortunately this cause a strange issue when passing a TypedAggregateAndColumn to select. For example, test("count") in AggregateFunctionsTests.scala stopped compiling.

Obviously, the solution is not perfect (since it's causing an issue), but the direction might be promising? What do you guys think? Any ideas?

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Oct 10, 2016

The biggest challenge with this syntax (besides IDE support) is the support full scope of Spark Column expressions, that is, being able to write stuff like select('foo + 1).

Early work on the lib made use of shapeless' SingletonProductArgs macro to solve the non expression part of this problem, if you are interested here is the select implementation and test from git's history.

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Oct 10, 2016

I think we could make @kanterov idea of "implicit conversion from Symbol to TypedColumn" work in typelevel-scala:

scala> :paste
// Entering paste mode (ctrl-D to finish)

trait TypedColumn[S <: Singleton]

implicit def lift[S <: Singleton](s: S): TypedColumn[S] = new TypedColumn[S] {}

def select[A <: Singleton](p: TypedColumn[A]) = p

implicit class AddIntToTypedColumn(i: Int) {
  def plus[S <: Singleton](s: TypedColumn[S]) = s
}

// Exiting paste mode, now interpreting.

defined trait TypedColumn
lift: [S <: Singleton](s: S)TypedColumn[S]
select: [A <: Singleton](p: TypedColumn[A])TypedColumn[A]
defined class AddIntToTypedColumn

scala> select("hello")
res0: TypedColumn["hello"] = $anon$1@7e40c3aa

scala> select(1 plus "hello")
res1: TypedColumn["hello"] = $anon$1@5d864a5

@imarios
Copy link
Contributor

imarios commented Oct 10, 2016

@OlivierBlanvillain yes, supporting expressions with select should definetly be part of ay solution.
Btw the above snippet gives this error for me:

// Exiting paste mode, now interpreting.

defined trait TypedColumn
lift: [S <: Singleton](s: S)TypedColumn[S]
select: [A <: Singleton](p: TypedColumn[A])TypedColumn[A]
defined class AddIntToTypedColumn

scala> select("hello")
<console>:19: error: type mismatch;
 found   : String("hello")
 required: TypedColumn[?]
       select("hello")

@OlivierBlanvillain
Copy link
Contributor Author

On my setup it works with the following:

$ cat build.sbt
scalaVersion := "2.11.8"

scalaOrganization := "org.typelevel"

libraryDependencies ++= Seq(
  "org.typelevel" %% "cats"      % "0.7.2",
  "com.chuusai"   %% "shapeless" % "2.3.2")

scalacOptions := Seq(
  "-deprecation",
  "-encoding", "UTF-8",
  "-feature",
  "-language:implicitConversions",
  "-unchecked",
  "-Xfuture",
  "-Xlint",
  "-Yinline-warnings",
  "-Yno-adapted-args",
  "-Ywarn-dead-code",
  "-Ywarn-numeric-widen",
  "-Ypartial-unification",
  "-Yliteral-types",
  "-Ywarn-value-discard")
$ cat project/build.properties 
sbt.version=0.13.13-RC2
$ sbt console
[...]

@kanterov
Copy link
Contributor

@OlivierBlanvillain This is awesome. Does it require Typelevel Scala to compile user code? In this case, we might still want to investigate if we can reuse macro from shapeless somehow, it would be nice if we find cheap solution instead of forcing users to switch Scala compiler :).

@OlivierBlanvillain
Copy link
Contributor Author

OlivierBlanvillain commented Oct 11, 2016

Yes, we would need this PR to be merged to have singelton types in Lightbend Scala.

@joan38
Copy link

joan38 commented Jul 18, 2019

Wow, I'm looking forward to have this if one day Spark compiles with 2.13.0

@cchantep
Copy link
Collaborator

cchantep commented Sep 7, 2021

Hi, closing it for now with #449 merged.

@cchantep cchantep closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants