diff --git a/.gitignore b/.gitignore index de05a51e1..f4d3cd309 100644 --- a/.gitignore +++ b/.gitignore @@ -32,3 +32,5 @@ node_modules .history .metals .vscode +.bloop +metals.sbt diff --git a/README.md b/README.md index 3bc908dc5..c0556e56b 100644 --- a/README.md +++ b/README.md @@ -6,10 +6,10 @@ [![Maven Badge](https://img.shields.io/maven-central/v/org.typelevel/frameless-core_2.12?color=blue)](https://search.maven.org/search?q=g:org.typelevel%20and%20frameless) [![Snapshots Badge](https://img.shields.io/nexus/s/https/oss.sonatype.org/org.typelevel/frameless-core_2.12)](https://oss.sonatype.org/content/repositories/snapshots/org/typelevel/) -Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types. +Frameless is a Scala library for working with [Spark](http://spark.apache.org/) using more expressive types. It consists of the following modules: -* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API +* `frameless-dataset` for a more strongly typed `Dataset`/`DataFrame` API * `frameless-ml` for a more strongly typed Spark ML API based on `frameless-dataset` * `frameless-cats` for using Spark's `RDD` API with [cats](https://github.com/typelevel/cats) @@ -20,11 +20,10 @@ The Frameless project and contributors support the [Typelevel](http://typelevel.org/) [Code of Conduct](http://typelevel.org/conduct.html) and want all its associated channels (e.g. GitHub, Discord) to be a safe and friendly environment for contributing and learning. - ## Versions and dependencies -The compatible versions of [Spark](http://spark.apache.org/) and -[cats](https://github.com/typelevel/cats) are as follows: +The compatible versions of [Spark](http://spark.apache.org/) and +[cats](https://github.com/typelevel/cats) are as follows: | Frameless | Spark | Cats | Cats-Effect | Scala | --------- | ----- | -------- | ----------- | --- @@ -38,10 +37,12 @@ The compatible versions of [Spark](http://spark.apache.org/) and | 0.10.1 | 3.1.0 | 2.x | 2.x | 2.12 | 0.11.0* | 3.2.0 / 3.1.2 / 3.0.1| 2.x | 2.x | 2.12 / 2.13 | 0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 +| 0.12.0 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 3.x | 2.12 / 2.13 _\* 0.11.0 has broken Spark 3.1.2 and 3.0.1 artifacts published._ -Starting 0.11 we introduced Spark cross published artifacts: +Starting 0.11 we introduced Spark cross published artifacts: + * By default, frameless artifacts depend on the most recent Spark version * Suffix `-spark{major}{minor}` is added to artifacts that are released for the previous Spark version(s) @@ -51,35 +52,35 @@ Artifact names examples: * `frameless-dataset-spark31` (Spark 3.1.x dependency) * `frameless-dataset-spark30` (Spark 3.0.x dependency) -Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0. +Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0. -The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2. -Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar. -Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s, -you may choose not to depend on `frameless-cats`. +The **only** dependency of the `frameless-dataset` module is on [shapeless](https://github.com/milessabin/shapeless) 2.3.2. +Therefore, depending on `frameless-dataset`, has a minimal overhead on your Spark's application jar. +Only the `frameless-cats` module depends on cats and cats-effect, so if you prefer to work just with `Datasets` and not with `RDD`s, +you may choose not to depend on `frameless-cats`. -Frameless intentionally **does not** have a compile dependency on Spark. -This essentially allows you to use any version of Frameless with any version of Spark. -The aforementioned table simply provides the versions of Spark we officially compile -and test Frameless with, but other versions may probably work as well. +Frameless intentionally **does not** have a compile dependency on Spark. +This essentially allows you to use any version of Frameless with any version of Spark. +The aforementioned table simply provides the versions of Spark we officially compile +and test Frameless with, but other versions may probably work as well. -### Breaking changes in 0.9 +### Breaking changes in 0.9 -* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1". +* Spark 3 introduces a new ExpressionEncoder approach, the schema for single value DataFrame's is now ["value"](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L270) not "_1". ## Why? -Frameless introduces a new Spark API, called `TypedDataset`. +Frameless introduces a new Spark API, called `TypedDataset`. The benefits of using `TypedDataset` compared to the standard Spark `Dataset` API are as follows: * Typesafe columns referencing (e.g., no more runtime errors when accessing non-existing columns) -* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile) -* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you +* Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile) +* Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you get a compilation error) * Typesafe casting and projections -Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a -detailed comparison of `TypedDataset` with Spark's `Dataset` API. +Click [here](http://typelevel.org/frameless/TypedDatasetVsSparkDataset.html) for a +detailed comparison of `TypedDataset` with Spark's `Dataset` API. ## Documentation @@ -93,6 +94,7 @@ detailed comparison of `TypedDataset` with Spark's `Dataset` API. * [Proof of Concept: TypedDataFrame](http://typelevel.org/frameless/TypedDataFrame.html) ## Quick Start + Since the 0.9.x release, Frameless is compiled only against Scala 2.12.x. To use Frameless in your project add the following in your `build.sbt` file as needed: @@ -103,17 +105,18 @@ val framelessVersion = "0.9.0" // for Spark 3.0.0 libraryDependencies ++= List( "org.typelevel" %% "frameless-dataset" % framelessVersion, "org.typelevel" %% "frameless-ml" % framelessVersion, - "org.typelevel" %% "frameless-cats" % framelessVersion + "org.typelevel" %% "frameless-cats" % framelessVersion ) ``` An easy way to bootstrap a Frameless sbt project: -- if you have [Giter8][g8] installed then simply: +* if you have [Giter8][g8] installed then simply: ```bash g8 imarios/frameless.g8 ``` + - with sbt >= 0.13.13: ```bash @@ -125,12 +128,12 @@ and all its dependencies loaded (including Spark). ## Need help? -Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt) +Feel free to messages us on our [discord](https://discord.gg/ZDZsxWcBJt) channel for any issues/questions. - ## Development -We require at least *one* sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers + +We require at least _one_ sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers (people who can merge pull requests) are: * [adelbertc](https://github.com/adelbertc) @@ -151,7 +154,8 @@ be set to adjust the size of generated collections in the `TypedDataSet` suite: | FRAMELESS_GEN_SIZE_RANGE | 20 | ## License -Code is provided under the Apache 2.0 license available at http://opensource.org/licenses/Apache-2.0, + +Code is provided under the Apache 2.0 license available at , as well as in the LICENSE file. This is the same license used as Spark. [g8]: http://www.foundweekends.org/giter8/ diff --git a/build.sbt b/build.sbt index ccd52fe33..7fc93e19c 100644 --- a/build.sbt +++ b/build.sbt @@ -1,9 +1,9 @@ val sparkVersion = "3.2.1" val spark31Version = "3.1.2" val spark30Version = "3.0.3" -val catsCoreVersion = "2.6.1" -val catsEffectVersion = "2.4.0" -val catsMtlVersion = "0.7.1" +val catsCoreVersion = "2.7.0" +val catsEffectVersion = "3.3.5" +val catsMtlVersion = "1.2.0" val scalatest = "3.2.11" val scalatestplus = "3.1.0.0-RC2" val shapeless = "2.3.7" @@ -13,7 +13,7 @@ val refinedVersion = "0.9.28" val Scala212 = "2.12.15" val Scala213 = "2.13.8" -ThisBuild / tlBaseVersion := "0.11" +ThisBuild / tlBaseVersion := "0.12" ThisBuild / crossScalaVersions := Seq(Scala213, Scala212) ThisBuild / scalaVersion := Scala212 @@ -160,7 +160,7 @@ lazy val catsSettings = framelessSettings ++ Seq( libraryDependencies ++= Seq( "org.typelevel" %% "cats-core" % catsCoreVersion, "org.typelevel" %% "cats-effect" % catsEffectVersion, - "org.typelevel" %% "cats-mtl-core" % catsMtlVersion, + "org.typelevel" %% "cats-mtl" % catsMtlVersion, "org.typelevel" %% "alleycats-core" % catsCoreVersion ) ) diff --git a/cats/src/main/scala/frameless/cats/FramelessSyntax.scala b/cats/src/main/scala/frameless/cats/FramelessSyntax.scala index 5e616bba3..ea7fcd0ed 100644 --- a/cats/src/main/scala/frameless/cats/FramelessSyntax.scala +++ b/cats/src/main/scala/frameless/cats/FramelessSyntax.scala @@ -2,7 +2,7 @@ package frameless package cats import _root_.cats.effect.Sync -import _root_.cats.implicits._ +import _root_.cats.syntax.all._ import _root_.cats.mtl.ApplicativeAsk import org.apache.spark.sql.SparkSession diff --git a/cats/src/main/scala/frameless/cats/implicits.scala b/cats/src/main/scala/frameless/cats/implicits.scala index 90f7ceeca..1fa869a7f 100644 --- a/cats/src/main/scala/frameless/cats/implicits.scala +++ b/cats/src/main/scala/frameless/cats/implicits.scala @@ -3,7 +3,7 @@ package cats import _root_.cats._ import _root_.cats.kernel.{CommutativeMonoid, CommutativeSemigroup} -import _root_.cats.implicits._ +import _root_.cats.syntax.all._ import alleycats.Empty import scala.reflect.ClassTag diff --git a/cats/src/test/scala/frameless/cats/FramelessSyntaxTests.scala b/cats/src/test/scala/frameless/cats/FramelessSyntaxTests.scala index c549bb31a..74fadce06 100644 --- a/cats/src/test/scala/frameless/cats/FramelessSyntaxTests.scala +++ b/cats/src/test/scala/frameless/cats/FramelessSyntaxTests.scala @@ -30,7 +30,7 @@ class FramelessSyntaxTests extends TypedDatasetSuite { test("properties can be read back") { import implicits._ - import _root_.cats.implicits._ + import _root_.cats.syntax.all._ import _root_.cats.mtl.implicits._ check { diff --git a/cats/src/test/scala/frameless/cats/test.scala b/cats/src/test/scala/frameless/cats/test.scala index 614f1c7e2..205ded68e 100644 --- a/cats/src/test/scala/frameless/cats/test.scala +++ b/cats/src/test/scala/frameless/cats/test.scala @@ -2,7 +2,7 @@ package frameless package cats import _root_.cats.Foldable -import _root_.cats.implicits._ +import _root_.cats.syntax.all._ import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession @@ -39,7 +39,7 @@ trait SparkTests { object Tests { def innerPairwise(mx: Map[String, Int], my: Map[String, Int], check: (Any, Any) => Assertion)(implicit sc: SC): Assertion = { - import frameless.cats.implicits._ + import frameless.cats.syntax.all._ import frameless.cats.inner._ val xs = sc.parallelize(mx.toSeq) val ys = sc.parallelize(my.toSeq) @@ -79,7 +79,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with } property("rdd simple numeric commutative semigroup") { - import frameless.cats.implicits._ + import frameless.cats.syntax.all._ forAll { seq: List[Int] => val expectedSum = if (seq.isEmpty) None else Some(seq.sum) @@ -100,7 +100,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with } property("rdd of SortedMap[Int,Int] commutative monoid") { - import frameless.cats.implicits._ + import frameless.cats.syntax.all._ forAll { seq: List[SortedMap[Int, Int]] => val rdd = seq.toRdd rdd.csum shouldBe Foldable[List].fold(seq) @@ -108,7 +108,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with } property("rdd tuple commutative semigroup example") { - import frameless.cats.implicits._ + import frameless.cats.syntax.all._ forAll { seq: List[(Int, Int)] => val expectedSum = if (seq.isEmpty) None else Some(Foldable[List].fold(seq)) val rdd = seq.toRdd @@ -119,7 +119,7 @@ class Test extends AnyPropSpec with Matchers with ScalaCheckPropertyChecks with } property("pair rdd numeric commutative semigroup example") { - import frameless.cats.implicits._ + import frameless.cats.syntax.all._ val seq = Seq( ("a",2), ("b",3), ("d",6), ("b",2), ("d",1) ) val rdd = seq.toRdd rdd.cminByKey.collect().toSeq should contain theSameElementsAs Seq( ("a",2), ("b",2), ("d",1) ) diff --git a/docs/Cats.md b/docs/Cats.md index addb21d17..b500cab23 100644 --- a/docs/Cats.md +++ b/docs/Cats.md @@ -17,7 +17,7 @@ System.setProperty("spark.cleaner.ttl", "300") import spark.implicits._ -import cats.implicits._ +import cats.syntax.all._ import cats.effect.{IO, Sync} import cats.data.ReaderT ``` @@ -28,21 +28,21 @@ There are two main parts to the `cats` integration offered by Frameless: All the examples below assume you have previously imported `cats.implicits` and `frameless.cats.implicits`. -*Note that you should not import `frameless.syntax._` together with `frameless.cats.implicits._`.* +*Note that you should not import `frameless.syntax._` together with `frameless.cats.syntax.all._`.* ```scala mdoc -import cats.implicits._ -import frameless.cats.implicits._ +import cats.syntax.all._ +import frameless.cats.syntax.all._ ``` ## Effect Suspension in typed datasets -As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of -operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`, -for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of -delaying a computation and capturing an implicit `SparkSession`. +As noted in the section about `Job`, all operations on `TypedDataset` are lazy. The results of +operations that would normally block on plain Spark APIs are wrapped in a type constructor `F[_]`, +for which there exists an instance of `SparkDelay[F]`. This typeclass represents the operation of +delaying a computation and capturing an implicit `SparkSession`. -In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these +In the `cats` module, we utilize the typeclasses from `cats-effect` for abstracting over these effect types - namely, we provide an implicit `SparkDelay` instance for all `F[_]` for which exists an instance of `cats.effect.Sync[F]`. @@ -75,7 +75,7 @@ result.run(spark).unsafeRunSync() ### Convenience methods for modifying Spark thread-local variables -The `frameless.cats.implicits._` import also provides some syntax enrichments for any monad +The `frameless.cats.syntax.all._` import also provides some syntax enrichments for any monad stack that has the same capabilities as `Action` above. Namely, the ability to provide an instance of `SparkSession` and the ability to suspend effects. @@ -110,7 +110,7 @@ leveraging a large collection of Type Classes for ordering and aggregating data. Cats offers ways to sort and aggregate tuples of arbitrary arity. ```scala mdoc -import frameless.cats.implicits._ +import frameless.cats.syntax.all._ val data: RDD[(Int, Int, Int)] = sc.makeRDD((1, 2, 3) :: (1, 5, 3) :: (8, 2, 3) :: Nil) @@ -132,7 +132,7 @@ println(data.cmax) println(data.cmaxOption) println(data.cmin) println(data.cminOption) -``` +``` The following example aggregates all the elements with a common key.