Skip to content

Commit

Permalink
Update cats ecosystem to CE3
Browse files Browse the repository at this point in the history
  • Loading branch information
Daenyth committed Feb 2, 2022
1 parent f205a15 commit 4960366
Show file tree
Hide file tree
Showing 8 changed files with 60 additions and 54 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ node_modules
.history
.metals
.vscode
.bloop
metals.sbt
60 changes: 32 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
[![Maven Badge](https://img.shields.io/maven-central/v/org.typelevel/frameless-core_2.12?color=blue)](https://search.maven.org/search?q=g:org.typelevel%20and%20frameless)
[![Snapshots Badge](https://img.shields.io/nexus/s/https/oss.sonatype.org/org.typelevel/frameless-core_2.12)](https://oss.sonatype.org/content/repositories/snapshots/org/typelevel/)

Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types.
Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types.
It consists of the following modules:

* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API
* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API
* `frameless-ml` for a more strongly typed Spark ML API based on `frameless-dataset`
* `frameless-cats` for using Spark's `RDD` API with [cats](https://github.com/typelevel/cats)

Expand All @@ -20,11 +20,10 @@ The Frameless project and contributors support the
[Typelevel](http://typelevel.org/) [Code of Conduct](http://typelevel.org/conduct.html) and want all its
associated channels (e.g. GitHub, Discord) to be a safe and friendly environment for contributing and learning.


## Versions and dependencies

The compatible versions of [Spark](http://spark.apache.org/) and
[cats](https://github.com/typelevel/cats) are as follows:
The compatible versions of [Spark](http://spark.apache.org/) and
[cats](https://github.com/typelevel/cats) are as follows:

| Frameless | Spark | Cats | Cats-Effect | Scala
| --------- | ----- | -------- | ----------- | ---
Expand All @@ -38,10 +37,12 @@ The compatible versions of [Spark](http://spark.apache.org/) and
| 0.10.1 | 3.1.0 | 2.x | 2.x | 2.12
| 0.11.0* | 3.2.0 / 3.1.2 / 3.0.1| 2.x | 2.x | 2.12 / 2.13
| 0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13
| 0.12.0 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 3.x | 2.12 / 2.13

_\* 0.11.0 has broken Spark 3.1.2 and 3.0.1 artifacts published._

Starting 0.11 we introduced Spark cross published artifacts:
Starting 0.11 we introduced Spark cross published artifacts:

* By default, frameless artifacts depend on the most recent Spark version
* Suffix `-spark{major}{minor}` is added to artifacts that are released for the previous Spark version(s)

Expand All @@ -51,35 +52,35 @@ Artifact names examples:
* `frameless-dataset-spark31` (Spark 3.1.x dependency)
* `frameless-dataset-spark30` (Spark 3.0.x dependency)

Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.
Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.

The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2.
Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar.
Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s,
you may choose not to depend on `frameless-cats`.
The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2.
Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar.
Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s,
you may choose not to depend on `frameless-cats`.

Frameless intentionally **does not** have a compile dependency on Spark.
This essentially allows you to use any version of Frameless with any version of Spark.
The aforementioned table simply provides the versions of Spark we officially compile
and test Frameless with, but other versions may probably work as well.
Frameless intentionally **does not** have a compile dependency on Spark.
This essentially allows you to use any version of Frameless with any version of Spark.
The aforementioned table simply provides the versions of Spark we officially compile
and test Frameless with, but other versions may probably work as well.

### Breaking changes in 0.9
### Breaking changes in 0.9

* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1".
* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1".

## Why?

Frameless introduces a new Spark API, called `TypedDataset`.
Frameless introduces a new Spark API, called `TypedDataset`.
The benefits of using `TypedDataset` compared to the standard Spark `Dataset` API are as follows:

* Typesafe columns referencing (e.g., no more runtime errors when accessing non-existing columns)
* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile)
* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you
* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile)
* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you
get a compilation error)
* Typesafe casting and projections

Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a
detailed comparison of `TypedDataset` with Spark's `Dataset` API.
Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a
detailed comparison of `TypedDataset` with Spark's `Dataset` API.

## Documentation

Expand All @@ -93,6 +94,7 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API.
* [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html)

## Quick Start

Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x.

To use Frameless in your project add the following in your `build.sbt` file as needed:
Expand All @@ -103,17 +105,18 @@ val framelessVersion = "0.9.0" // for Spark 3.0.0
libraryDependencies ++= List(
"org.typelevel" %% "frameless-dataset" % framelessVersion,
"org.typelevel" %% "frameless-ml" % framelessVersion,
"org.typelevel" %% "frameless-cats" % framelessVersion
"org.typelevel" %% "frameless-cats" % framelessVersion
)
```

An easy way to bootstrap a Frameless sbt project:

- if you have [Giter8][g8] installed then simply:
* if you have [Giter8][g8] installed then simply:

```bash
g8 imarios/frameless.g8
```

- with sbt >= 0.13.13:

```bash
Expand All @@ -125,12 +128,12 @@ and all its dependencies loaded (including Spark).

## Need help?

Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt)
Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt)
channel for any issues/questions.


## Development
We require at least *one* sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers

We require at least _one_ sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers
(people who can merge pull requests) are:

* [adelbertc](https://github.com/adelbertc)
Expand All @@ -151,7 +154,8 @@ be set to adjust the size of generated collections in the `TypedDataSet` suite:
| FRAMELESS_GEN_SIZE_RANGE | 20 |

## License
Code is provided under the Apache 2.0 license available at http://opensource.org/licenses/Apache-2.0,

Code is provided under the Apache 2.0 license available at <http://opensource.org/licenses/Apache-2.0>,
as well as in the LICENSE file. This is the same license used as Spark.

[g8]: http://www.foundweekends.org/giter8/
10 changes: 5 additions & 5 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
val sparkVersion = "3.2.1"
val spark31Version = "3.1.2"
val spark30Version = "3.0.3"
val catsCoreVersion = "2.6.1"
val catsEffectVersion = "2.4.0"
val catsMtlVersion = "0.7.1"
val catsCoreVersion = "2.7.0"
val catsEffectVersion = "3.3.5"
val catsMtlVersion = "1.2.0"
val scalatest = "3.2.11"
val scalatestplus = "3.1.0.0-RC2"
val shapeless = "2.3.7"
Expand All @@ -13,7 +13,7 @@ val refinedVersion = "0.9.28"
val Scala212 = "2.12.15"
val Scala213 = "2.13.8"

ThisBuild / tlBaseVersion := "0.11"
ThisBuild / tlBaseVersion := "0.12"

ThisBuild / crossScalaVersions := Seq(Scala213, Scala212)
ThisBuild / scalaVersion := Scala212
Expand Down Expand Up @@ -160,7 +160,7 @@ lazy val catsSettings = framelessSettings ++ Seq(
libraryDependencies ++= Seq(
"org.typelevel" %% "cats-core" % catsCoreVersion,
"org.typelevel" %% "cats-effect" % catsEffectVersion,
"org.typelevel" %% "cats-mtl-core" % catsMtlVersion,
"org.typelevel" %% "cats-mtl" % catsMtlVersion,
"org.typelevel" %% "alleycats-core" % catsCoreVersion
)
)
Expand Down
2 changes: 1 addition & 1 deletion cats/src/main/scala/frameless/cats/FramelessSyntax.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ package frameless
package cats

import _root_.cats.effect.Sync
import _root_.cats.implicits._
import _root_.cats.syntax.all._
import _root_.cats.mtl.ApplicativeAsk
import org.apache.spark.sql.SparkSession

Expand Down
2 changes: 1 addition & 1 deletion cats/src/main/scala/frameless/cats/implicits.scala
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ package cats

import _root_.cats._
import _root_.cats.kernel.{CommutativeMonoid, CommutativeSemigroup}
import _root_.cats.implicits._
import _root_.cats.syntax.all._
import alleycats.Empty

import scala.reflect.ClassTag
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class FramelessSyntaxTests extends TypedDatasetSuite {

test("properties can be read back") {
import implicits._
import _root_.cats.implicits._
import _root_.cats.syntax.all._
import _root_.cats.mtl.implicits._

check {
Expand Down
12 changes: 6 additions & 6 deletions cats/src/test/scala/frameless/cats/test.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ package frameless
package cats

import _root_.cats.Foldable
import _root_.cats.implicits._
import _root_.cats.syntax.all._

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
Expand Down Expand Up @@ -39,7 +39,7 @@ trait SparkTests {

object Tests {
def innerPairwise(mx: Map[String, Int], my: Map[String, Int], check: (Any, Any) => Assertion)(implicit sc: SC): Assertion = {
import frameless.cats.implicits._
import frameless.cats.syntax.all._
import frameless.cats.inner._
val xs = sc.parallelize(mx.toSeq)
val ys = sc.parallelize(my.toSeq)
Expand Down Expand Up @@ -79,7 +79,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with
}

property("rdd simple numeric commutative semigroup") {
import frameless.cats.implicits._
import frameless.cats.syntax.all._

forAll { seq: List[Int] =>
val expectedSum = if (seq.isEmpty) None else Some(seq.sum)
Expand All @@ -100,15 +100,15 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with
}

property("rdd of SortedMap[Int,Int] commutative monoid") {
import frameless.cats.implicits._
import frameless.cats.syntax.all._
forAll { seq: List[SortedMap[Int, Int]] =>
val rdd = seq.toRdd
rdd.csum shouldBe Foldable[List].fold(seq)
}
}

property("rdd tuple commutative semigroup example") {
import frameless.cats.implicits._
import frameless.cats.syntax.all._
forAll { seq: List[(Int, Int)] =>
val expectedSum = if (seq.isEmpty) None else Some(Foldable[List].fold(seq))
val rdd = seq.toRdd
Expand All @@ -119,7 +119,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with
}

property("pair rdd numeric commutative semigroup example") {
import frameless.cats.implicits._
import frameless.cats.syntax.all._
val seq = Seq( ("a",2), ("b",3), ("d",6), ("b",2), ("d",1) )
val rdd = seq.toRdd
rdd.cminByKey.collect().toSeq should contain theSameElementsAs Seq( ("a",2), ("b",2), ("d",1) )
Expand Down
24 changes: 12 additions & 12 deletions docs/Cats.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ System.setProperty("spark.cleaner.ttl", "300")

import spark.implicits._

import cats.implicits._
import cats.syntax.all._
import cats.effect.{IO, Sync}
import cats.data.ReaderT
```
Expand All @@ -28,21 +28,21 @@ There are two main parts to the `cats` integration offered by Frameless:

All the examples below assume you have previously imported `cats.implicits` and `frameless.cats.implicits`.

*Note that you should not import `frameless.syntax._` together with `frameless.cats.implicits._`.*
*Note that you should not import `frameless.syntax._` together with `frameless.cats.syntax.all._`.*

```scala mdoc
import cats.implicits._
import frameless.cats.implicits._
import cats.syntax.all._
import frameless.cats.syntax.all._
```

## Effect Suspension in typed datasets

As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of
operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`,
for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of
delaying a computation and capturing an implicit `SparkSession`.
As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of
operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`,
for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of
delaying a computation and capturing an implicit `SparkSession`.

In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these
In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these
effect types - namely, we provide an implicit `SparkDelay` instance for all `F[_]` for which exists
an instance of `cats.effect.Sync[F]`.

Expand Down Expand Up @@ -75,7 +75,7 @@ result.run(spark).unsafeRunSync()

### Convenience methods for modifying Spark thread-local variables

The `frameless.cats.implicits._` import also provides some syntax enrichments for any monad
The `frameless.cats.syntax.all._` import also provides some syntax enrichments for any monad
stack that has the same capabilities as `Action` above. Namely, the ability to provide an
instance of `SparkSession` and the ability to suspend effects.

Expand Down Expand Up @@ -110,7 +110,7 @@ leveraging a large collection of Type Classes for ordering and aggregating data.
Cats offers ways to sort and aggregate tuples of arbitrary arity.

```scala mdoc
import frameless.cats.implicits._
import frameless.cats.syntax.all._

val data: RDD[(Int, Int, Int)] = sc.makeRDD((1, 2, 3) :: (1, 5, 3) :: (8, 2, 3) :: Nil)

Expand All @@ -132,7 +132,7 @@ println(data.cmax)
println(data.cmaxOption)
println(data.cmin)
println(data.cminOption)
```
```

The following example aggregates all the elements with a common key.

Expand Down

0 comments on commit 4960366

Please sign in to comment.