Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cats ecosystem to CE3 #616

Merged
merged 2 commits into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ node_modules
.history
.metals
.vscode
.bloop
metals.sbt
60 changes: 32 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
[![Maven Badge](https://img.shields.io/maven-central/v/org.typelevel/frameless-core_2.12?color=blue)](https://search.maven.org/search?q=g:org.typelevel%20and%20frameless)
[![Snapshots Badge](https://img.shields.io/nexus/s/https/oss.sonatype.org/org.typelevel/frameless-core_2.12)](https://oss.sonatype.org/content/repositories/snapshots/org/typelevel/)

Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types.
Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types.
It consists of the following modules:

* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API
* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API
* `frameless-ml` for a more strongly typed Spark ML API based on `frameless-dataset`
* `frameless-cats` for using Spark's `RDD` API with [cats](https://github.com/typelevel/cats)

Expand All @@ -20,11 +20,10 @@ The Frameless project and contributors support the
[Typelevel](http://typelevel.org/) [Code of Conduct](http://typelevel.org/conduct.html) and want all its
associated channels (e.g. GitHub, Discord) to be a safe and friendly environment for contributing and learning.


## Versions and dependencies

The compatible versions of [Spark](http://spark.apache.org/) and
[cats](https://github.com/typelevel/cats) are as follows:
The compatible versions of [Spark](http://spark.apache.org/) and
[cats](https://github.com/typelevel/cats) are as follows:

| Frameless | Spark | Cats | Cats-Effect | Scala
| --------- | ----- | -------- | ----------- | ---
Expand All @@ -38,10 +37,12 @@ The compatible versions of [Spark](http://spark.apache.org/) and
| 0.10.1 | 3.1.0 | 2.x | 2.x | 2.12
| 0.11.0* | 3.2.0 / 3.1.2 / 3.0.1| 2.x | 2.x | 2.12 / 2.13
| 0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13
| 0.12.0 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 3.x | 2.12 / 2.13

_\* 0.11.0 has broken Spark 3.1.2 and 3.0.1 artifacts published._

Starting 0.11 we introduced Spark cross published artifacts:
Starting 0.11 we introduced Spark cross published artifacts:

* By default, frameless artifacts depend on the most recent Spark version
* Suffix `-spark{major}{minor}` is added to artifacts that are released for the previous Spark version(s)

Expand All @@ -51,35 +52,35 @@ Artifact names examples:
* `frameless-dataset-spark31` (Spark 3.1.x dependency)
* `frameless-dataset-spark30` (Spark 3.0.x dependency)

Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.
Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.

The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2.
Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar.
Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s,
you may choose not to depend on `frameless-cats`.
The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2.
Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar.
Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s,
you may choose not to depend on `frameless-cats`.

Frameless intentionally **does not** have a compile dependency on Spark.
This essentially allows you to use any version of Frameless with any version of Spark.
The aforementioned table simply provides the versions of Spark we officially compile
and test Frameless with, but other versions may probably work as well.
Frameless intentionally **does not** have a compile dependency on Spark.
This essentially allows you to use any version of Frameless with any version of Spark.
The aforementioned table simply provides the versions of Spark we officially compile
and test Frameless with, but other versions may probably work as well.

### Breaking changes in 0.9
### Breaking changes in 0.9

* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1".
* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1".

## Why?

Frameless introduces a new Spark API, called `TypedDataset`.
Frameless introduces a new Spark API, called `TypedDataset`.
The benefits of using `TypedDataset` compared to the standard Spark `Dataset` API are as follows:

* Typesafe columns referencing (e.g., no more runtime errors when accessing non-existing columns)
* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile)
* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you
* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile)
* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you
get a compilation error)
* Typesafe casting and projections

Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a
detailed comparison of `TypedDataset` with Spark's `Dataset` API.
Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a
detailed comparison of `TypedDataset` with Spark's `Dataset` API.

## Documentation

Expand All @@ -93,6 +94,7 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API.
* [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html)

## Quick Start

Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x.

To use Frameless in your project add the following in your `build.sbt` file as needed:
Expand All @@ -103,17 +105,18 @@ val framelessVersion = "0.9.0" // for Spark 3.0.0
libraryDependencies ++= List(
"org.typelevel" %% "frameless-dataset" % framelessVersion,
"org.typelevel" %% "frameless-ml" % framelessVersion,
"org.typelevel" %% "frameless-cats" % framelessVersion
"org.typelevel" %% "frameless-cats" % framelessVersion
)
```

An easy way to bootstrap a Frameless sbt project:

- if you have [Giter8][g8] installed then simply:
* if you have [Giter8][g8] installed then simply:

```bash
g8 imarios/frameless.g8
```

- with sbt >= 0.13.13:

```bash
Expand All @@ -125,12 +128,12 @@ and all its dependencies loaded (including Spark).

## Need help?

Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt)
Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt)
channel for any issues/questions.


## Development
We require at least *one* sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers

We require at least _one_ sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers
(people who can merge pull requests) are:

* [adelbertc](https://github.com/adelbertc)
Expand All @@ -151,7 +154,8 @@ be set to adjust the size of generated collections in the `TypedDataSet` suite:
| FRAMELESS_GEN_SIZE_RANGE | 20 |

## License
Code is provided under the Apache 2.0 license available at http://opensource.org/licenses/Apache-2.0,

Code is provided under the Apache 2.0 license available at <http://opensource.org/licenses/Apache-2.0>,
as well as in the LICENSE file. This is the same license used as Spark.

[g8]: http://www.foundweekends.org/giter8/
16 changes: 9 additions & 7 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,19 @@ val sparkVersion = "3.2.1"
val spark31Version = "3.1.3"
val spark30Version = "3.0.3"
val catsCoreVersion = "2.7.0"
val catsEffectVersion = "2.4.0"
val catsMtlVersion = "0.7.1"
val catsEffectVersion = "3.3.5"
val catsMtlVersion = "1.2.0"
val scalatest = "3.2.11"
val scalatestplus = "3.1.0.0-RC2"
val shapeless = "2.3.7"
val scalacheck = "1.15.4"
val scalacheckEffect = "1.0.3"
val refinedVersion = "0.9.28"

val Scala212 = "2.12.15"
val Scala213 = "2.13.8"

ThisBuild / tlBaseVersion := "0.11"
ThisBuild / tlBaseVersion := "0.12"

ThisBuild / crossScalaVersions := Seq(Scala213, Scala212)
ThisBuild / scalaVersion := Scala212
Expand Down Expand Up @@ -160,10 +161,11 @@ def sparkMlDependencies(sparkVersion: String, scope: Configuration = Provided) =
lazy val catsSettings = framelessSettings ++ Seq(
addCompilerPlugin("org.typelevel" % "kind-projector" % "0.13.2" cross CrossVersion.full),
libraryDependencies ++= Seq(
"org.typelevel" %% "cats-core" % catsCoreVersion,
"org.typelevel" %% "cats-effect" % catsEffectVersion,
"org.typelevel" %% "cats-mtl-core" % catsMtlVersion,
"org.typelevel" %% "alleycats-core" % catsCoreVersion
"org.typelevel" %% "cats-core" % catsCoreVersion,
"org.typelevel" %% "cats-effect" % catsEffectVersion,
"org.typelevel" %% "cats-mtl" % catsMtlVersion,
"org.typelevel" %% "alleycats-core" % catsCoreVersion,
"org.typelevel" %% "scalacheck-effect" % scalacheckEffect % Test
)
)

Expand Down
8 changes: 4 additions & 4 deletions cats/src/main/scala/frameless/cats/FramelessSyntax.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ package frameless
package cats

import _root_.cats.effect.Sync
import _root_.cats.implicits._
import _root_.cats.mtl.ApplicativeAsk
import _root_.cats.syntax.all._
import _root_.cats.mtl.Ask
import org.apache.spark.sql.SparkSession

trait FramelessSyntax extends frameless.FramelessSyntax {
implicit class SparkJobOps[F[_], A](fa: F[A])(implicit S: Sync[F], A: ApplicativeAsk[F, SparkSession]) {
implicit class SparkJobOps[F[_], A](fa: F[A])(implicit S: Sync[F], A: Ask[F, SparkSession]) {
import S._, A._

def withLocalProperty(key: String, value: String): F[A] =
Expand All @@ -19,6 +19,6 @@ trait FramelessSyntax extends frameless.FramelessSyntax {

def withGroupId(groupId: String): F[A] = withLocalProperty("spark.jobGroup.id", groupId)

def withDescription(description: String) = withLocalProperty("spark.job.description", description)
def withDescription(description: String): F[A] = withLocalProperty("spark.job.description", description)
}
}
2 changes: 1 addition & 1 deletion cats/src/main/scala/frameless/cats/implicits.scala
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package cats

import _root_.cats._
import _root_.cats.kernel.{CommutativeMonoid, CommutativeSemigroup}
import _root_.cats.implicits._
import _root_.cats.syntax.all._
import alleycats.Empty

import scala.reflect.ClassTag
Expand Down
39 changes: 21 additions & 18 deletions cats/src/test/scala/frameless/cats/FramelessSyntaxTests.scala
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,14 @@ package cats

import _root_.cats.data.ReaderT
import _root_.cats.effect.IO
import frameless.{ TypedDataset, TypedDatasetSuite, TypedEncoder, X2 }
import _root_.cats.effect.unsafe.implicits.global
import org.apache.spark.sql.SparkSession
import org.scalatest.matchers.should.Matchers
import org.scalacheck.{Test => PTest}
import org.scalacheck.Prop, Prop._
import org.scalacheck.effect.PropF, PropF._

class FramelessSyntaxTests extends TypedDatasetSuite {
class FramelessSyntaxTests extends TypedDatasetSuite with Matchers {
override val sparkDelay = null

def prop[A, B](data: Vector[X2[A, B]])(
Expand All @@ -30,21 +33,21 @@ class FramelessSyntaxTests extends TypedDatasetSuite {

test("properties can be read back") {
import implicits._
import _root_.cats.implicits._
import _root_.cats.mtl.implicits._

check {
forAll { (k:String, v: String) =>
val scopedKey = "frameless.tests." + k
1.pure[ReaderT[IO, SparkSession, *]].withLocalProperty(scopedKey,v).run(session).unsafeRunSync()
sc.getLocalProperty(scopedKey) ?= v

1.pure[ReaderT[IO, SparkSession, *]].withGroupId(v).run(session).unsafeRunSync()
sc.getLocalProperty("spark.jobGroup.id") ?= v

1.pure[ReaderT[IO, SparkSession, *]].withDescription(v).run(session).unsafeRunSync()
sc.getLocalProperty("spark.job.description") ?= v
}
}
import _root_.cats.syntax.all._

forAllF { (k: String, v: String) =>
val scopedKey = "frameless.tests." + k
1
.pure[ReaderT[IO, SparkSession, *]]
.withLocalProperty(scopedKey, v)
.withGroupId(v)
.withDescription(v)
.run(session)
.map { _ =>
sc.getLocalProperty(scopedKey) shouldBe v
sc.getLocalProperty("spark.jobGroup.id") shouldBe v
sc.getLocalProperty("spark.job.description") shouldBe v
}.void
}.check().unsafeRunSync().status shouldBe PTest.Passed
}
}
2 changes: 1 addition & 1 deletion cats/src/test/scala/frameless/cats/test.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ package frameless
package cats

import _root_.cats.Foldable
import _root_.cats.implicits._
import _root_.cats.syntax.all._

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
Expand Down
18 changes: 10 additions & 8 deletions docs/Cats.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ System.setProperty("spark.cleaner.ttl", "300")

import spark.implicits._

import cats.implicits._
import cats.syntax.all._
import cats.effect.{IO, Sync}
import cats.data.ReaderT
```
Expand All @@ -31,18 +31,18 @@ All the examples below assume you have previously imported `cats.implicits` and
*Note that you should not import `frameless.syntax._` together with `frameless.cats.implicits._`.*

```scala mdoc
import cats.implicits._
import cats.syntax.all._
import frameless.cats.implicits._
```

## Effect Suspension in typed datasets

As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of
operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`,
for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of
delaying a computation and capturing an implicit `SparkSession`.
As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of
operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`,
for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of
delaying a computation and capturing an implicit `SparkSession`.

In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these
In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these
effect types - namely, we provide an implicit `SparkDelay` instance for all `F[_]` for which exists
an instance of `cats.effect.Sync[F]`.

Expand Down Expand Up @@ -70,6 +70,8 @@ As with `Job`, note that nothing has been run yet. The effect has been properly
run our program, we must first supply the `SparkSession` to the `ReaderT` layer and then
run the `IO` effect:
```scala mdoc
import cats.effect.unsafe.implicits.global

result.run(spark).unsafeRunSync()
```

Expand Down Expand Up @@ -132,7 +134,7 @@ println(data.cmax)
println(data.cmaxOption)
println(data.cmin)
println(data.cminOption)
```
```

The following example aggregates all the elements with a common key.

Expand Down