###Introduction
Product-collections is a standard Scala immutable collection specialized to hold homogenous tuples.
Use product-collections as a fully type safe, immutable dataframe or datatable that supports all the standard Scala collections methods.
Product-collections also has a very neat and fully type safe CSV reader/parser.
###Contents
- Introduction
- Contents
- What's new
- Philosophy
- Sample Project
- Dependency Info
- Scaladoc
- REPL Session
- Using CollSeq
- I/O
- Statistics
- Examples
- Architecture
- Status
- Future
- Scaleability
- Build Dependencies
- Runtime Dependencies
- Pull Requests
- Licence
- Alternatives
- Testimonials
###What's new
#####v1.3.0
- Supports Scala-js
- DateConverter depreciated (no java.util.SimpleDateFormat in scala-js).
- Built in CSV parser (scala-js only) JVM stays with opencsv.
- Testing framework switched to uTest.
- Converters overhauled.
- Option[Long] converter.
- Misc doc improvements.
- Improved error message on missing converter.
#####v1.2.0
- Custom csv rendering.
- Csv output as a non experimental feature.
#####v1.1.1
- Fixes a csv output memory leak.
#####v1.1
- Use CsvParser to parse any data.
- Experimental csvoutput feature.
#####v1.0
- Removes CollSeq23 / Tuple23 kludge.
- Uses sbt-boilerplate 0.5.9.
- Publish to maven central.
#####v0.0.4.4-Snapshot
- Add support for Option[T] converters.
###Philosophy
The scala collection library has a logic revolving around zip
and unzip
. Product-collections extends that logic
to arities greater than 2 in a logical and consistent way. unzip3
and similar methods seem a bit kludgy compared to
flatZip
and _1
... _N
which perform a similar function across all arities.
The addition of flatZip
and _1
... _N
makes the standard collection library a functional, idiomatic and easy to use dataframe alternative.
The learning curve is negligible because you simply need to think of the collection either as a tuple of sequences or a sequence of tuples, whichever you require at the time.
Product-collections' design makes type safe CSV I/O a natural and free by product.
Product-collections extends a standard Scala collection and all standard collection methods are available.
###Sample Project
See product-collections-example. The example is about 25 lines of code; it loads stock prices from csv and plots these prices against the 250 period moving average.
product-collections is available from Maven Central.
Using SBT:
libraryDependencies += "com.github.marklister" %% "product-collections" % "1.3.0"
or for scala-js
libraryDependencies += "com.github.marklister" %%% "product-collections" % "1.3.0"
Using Maven:
<dependency>
<groupId>com.github.marklister</groupId>
<artifactId>product-collections_2.11</artifactId>
<version>1.3.0</version>
</dependency>
View the Scaladoc.
The Scaladoc packages contain examples and REPL sessions. The scaladoc on github is prefered to a locally generated
variant: I've used a hacked version of scala to generate it. If you want a local copy you can clone the gh-pages branch.
This document contains fragments of a REPL session which may not be entirely consistent. The full repl session is available. You can reproduce the repl session by pasting the repl source in the doc directory.
###Using CollSeq
You already know how to use a product-collection. Think of a product-collection as a Sequence of homogeneous
tuples and, at the same time a tuple of Seqences. There is only one novel feature to learn: flatZip
####Imports
import com.github.marklister.collections._
import com.github.marklister.collections.io._
####Create a CollSeq
Let the compiler infer the appropriate implementation:
scala> CollSeq(("A",2,3.1),("B",3,4.0),("C",4,5.2))
res1: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((A,2,3.1),
(B,3,4.0),
(C,4,5.2))
Notice that the correct types are inferred for each column. Consistent tuple length is guaranteed by the compiler. You can't have a CollSeq comprising mixed Product2 and Product3 types for example.
####Extract a column
A CollSeqN is also a ProductN (essentially a tuple). To extract a column:
scala> CollSeq(("A",2,3.1),("B",3,4.0),("C",4,5.2))
res0: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((A,2,3.1),
(B,3,4.0),
(C,4,5.2))
scala> res0._1
res1: Seq[String] = List(A, B, C)
Repeatedly extracting the same column will return a cached copy of the same Seq.
####Extract a row
CollSeq is an IndexedSeq so you can extract a row in the normal manner:
scala> res1(1)
res4: Product3[java.lang.String,Int,Int] = (B,3,4)
####Extract a cell
For best performance you should extract a cell by row then column:
scala> CollSeq(("A",2,3.1),("B",3,4.0),("C",4,5.2))
res1: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((A,2,3.1),
(B,3,4.0),
(C,4,5.2))
scala> res1(1)._2
res2: Int = 3
Although, an interesting feature is that you can access the column first
scala> res1._2(1)
res3: Int = 3
####Add a column
You can use the flatZip method to add a column:
scala> res1.flatZip(res1._2.map(_ *2)) //double the second column and append the result as a new column.
res14: com.github.marklister.collections.immutable.CollSeq4[String,Int,Double,Int] =
CollSeq((A,2,3.1,4),
(B,3,4.0,6),
(C,4,5.2,8))
####Access the row 'above'
Using scala's sliding method you can access the preceeding n rows. Here we calculate the difference between the values in the 4th column:
scala> res14._4.sliding(2).toList.map(z=>z(1)-z(0))
res21: List[Int] = List(2, 2)
Append the result:
scala> res14.flatZip(0::res21)
res22: com.github.marklister.collections.immutable.CollSeq5[java.lang.String,Int,Int,Int,Int] =
(A,2,3,4,0)
(B,3,4,6,2)
(C,4,5,8,2)
####Splice columns together
This uses the implicit conversions in the collections package object.
scala> CollSeq((1,2,3),(2,3,4),(3,4,5))
res0: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((1,2,3),
(2,3,4),
(3,4,5))
scala> res0._3 flatZip res0._1 flatZip res0._2
res2: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((3,1,2),
(4,2,3),
(5,3,4))
####Map
Map and similar methods (where possible) produce another CollSeq:
scala> CollSeq((3,1,2),
| (4,2,3),
| (5,3,4))
res0: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((3,1,2),
(4,2,3),
(5,3,4))
scala> res0.map(t=>(t._1+1,t._2-1,t._3.toDouble))
res1: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Double] =
CollSeq((4,0,2.0),
(5,1,3.0),
(6,2,4.0))
####Lookup a row
You can lookup values by constructing a Map:
scala> val data= CollSeq(("Zesa",10,20),
| ("Eskom",5,11),
| ("Sars",16,13))
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Int] =
CollSeq((Zesa,10,20),
(Eskom,5,11),
(Sars,16,13))
scala> val lookupByRow= data._1.zip(data).toMap
lookupRow: scala.collection.immutable.Map[String,Product3[String,Int,Int]] = Map(Zesa -> (Zesa,10,20), Eskom -> (Eskom,5,11), Sars -> (Sars,16,13))
scala> lookupByRow("Sars")
res4: Product3[String,Int,Int] = (Sars,16,13)
You can also lookup a column by constructing a Map
scala> val lookupColumn= (Seq("Company","Rating","FwdPE") zip data.productIterator.toSeq).toMap
lookupColumn: scala.collection.immutable.Map[String,Seq[Any]] = Map(Company -> List(Zesa, Eskom, Sars), Rating -> List(10, 5, 16), FwdPE -> List(20, 11, 13))
scala> lookupColumn("Company")
res6: Seq[Any] = List(Zesa, Eskom, Sars)
scala> lookupColumn("FwdPE")
res7: Seq[Any] = List(20, 11, 13)
Unfortunately the underlying type is a Seq[Any], which is the most specific type productIterator can return.
####Sorting
Sorting is a natural and free by product of being a standard scala collection.
scala> val unsorted= CollSeq((3,2,1),
| (2,2,1),
| (1,1,1))
unsorted: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((3,2,1),
(2,2,1),
(1,1,1))
scala> unsorted.sortBy(_._2)
res19: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((1,1,1),
(3,2,1),
(2,2,1))
scala> unsorted.sortBy(x=>(x._2,x._1))
res20: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((1,1,1),
(2,2,1),
(3,2,1))
You can use unary minus to sort in descending order.
###I/O
The CsvParser class (and its concrete sub-classes) allow you to easily read Tuples or CollSeqs from the filesystem.
####Construct a Parser
scala> val parser=CsvParser[String,Int,Int,Int]
parser: com.github.marklister.collections.io.CsvParser4[String,Int,Int,Int] = com.github.marklister.collections.io.CsvParser4@1203c6e
####Read and Parse a file
scala> parser.parseFile("abil.csv",hasHeader=true,delimiter="\t")
res2: com.github.marklister.collections.immutable.CollSeq4[String,Int,Int,Int] =
CollSeq((30-APR-12,3885,3922,3859),
(02-MAY-12,3880,3915,3857),
(03-MAY-12,3920,3948,3874),
(04-MAY-12,3909,3952,3885),
(07-MAY-12,3853,3900,3825),
(08-MAY-12,3770,3851,3755),
(09-MAY-12,3700,3782,3666),
(10-MAY-12,3732,3745,3658),
(11-MAY-12,3760,3765,3703),
(14-MAY-12,3660,3750,3655),
(15-MAY-12,3650,3685,3627),
(16-MAY-12,3661,3663,3555),
(17-MAY-12,3620,3690,3600),
(18-MAY-12,3545,3595,3542),
(21-MAY-12,3602,3608,3546),
(22-MAY-12,3650,3675,3615),
(23-MAY-12,3566,3655,3566),
(24-MAY-12,3632,3645,3586),
(25-MAY-12,3610,3665,3583),
(28-MAY-12,3591,3647,3582),
...
####Read and parse a java.io.Reader
scala> val stringData="""10,20,"hello"
| |20,30,"world"""".stripMargin
stringData: String =
10,20,"hello"
20,30,"world"
scala> CsvParser[Int,Int,String].parse(new java.io.StringReader(stringData))
res6: com.github.marklister.collections.immutable.CollSeq3[Int,Int,String] =
CollSeq((10,20,hello),
(20,30,world))
####Parsing additional types To parse additional types (like dates) simply provide a converter as an implicit parameter. See the examples.
####Field parse errors To avoid an exception specify your field type as Option[T] where T is Int, Double etc.
####Lazy parsing and non - CollSeqs Product-collections exposes an Iterator[TupleN] should you prefer not to work with a CollSeq.
scala> val stringData="""10,20,"hello"
| 20,30,"world"""".stripMargin
stringData: String =
10,20,"hello"
20,30,"world"
scala> CsvParser[Int,Int,String].iterator(new java.io.StringReader(stringData))
res0: Iterator[(Int, Int, String)] = non-empty iterator
scala> res0.toList
res2: List[(Int, Int, String)] = List((10,20,hello), (20,30,world))
####Convert tuple to case class You can convert tuples to case classes like this:
scala> case class Foo(a:Int,b:Int,s:String)
defined class Foo
scala> res2 map Foo.tupled
res12: Iterator[Foo] = non-empty iterator
scala> res12.toList
res13: List[Foo] = List(Foo(10,20,hello), Foo(20,30,world))
An implict class adds a writeCsv
and csvIterator
method to any Iterable[Product]
: writeCsv
takes a java.io.Writer
. Importing the io
package brings the conversion into scope or you can import io.Utils.CsvOutput
manually.
#####Using writeCsv
scala> val w= new java.io.StringWriter
w: java.io.StringWriter =
scala> CollSeq((1,2,3.5,hello),
(5,6,7.7,"dude")).writeCsv(w)
scala> w.toString
res7: String =
"1,2,3.5,"hello"
5,6,7.7,"""dude"""
"
scala> CsvParser[Int,Int,Double,String].parse(new java.io.StringReader(res7)) // Parse the csv we just generated.
res8: com.github.marklister.collections.immutable.CollSeq4[Int,Int,Double,String] =
CollSeq((1,2,3.5,hello),
(5,6,7.7,"dude"))
#####Using the Iterator:
scala> CollSeq((1,2,3.5,"hello"),
| (5,6,7.7,"\"dude\"")).csvIterator.toList
res2: List[String] = List(1,2,3.5,"hello", 5,6,7.7,"""dude""")
#####Custom rendering You can control how your data is rendered by supplying a CsvRenderer which is a PartialFunction[Any,String]. Your renderer is applied first and if there is no match the build in renderer is applied.
scala> CollSeq((1,2,3.5,Some("hello")),
| (5,6,7.7,None)).csvIterator.foreach(println _)
1,2,3.5,"hello"
5,6,7.7,
val myRenderer:CsvRenderer = {case None=> "N/A"}
myRenderer: com.github.marklister.collections.io.CsvRenderer = <function1>
scala> CollSeq((1,2,3.5,Some("hello")),
| (5,6,7.7,None)).csvIterator(renderer=myRenderer).foreach(println _)
1,2,3.5,"hello"
5,6,7.7,N/A
There are some prebuilt renderers in the Utils object: naRenderer and singleQuoteRenderer.
package com.github.marklister.collections.util
contains some basic statistics routines accessed by implict conversions on a Seq[Numeric]
or a Seq[(Numeric,Numeric)]
The import is accomplished when importing the collections package object
import com.github.marklister.collections.io._
import com.github.marklister.collections._
Welcome to Scala version 2.11.2 (OpenJDK Server VM, Java 1.7.0_65).
Type in expressions to have them evaluated.
Type :help for more information.
scala> CollSeq((1,2,3),
| (2,3,4),
| (3,4,5))
res0: com.github.marklister.collections.immutable.CollSeq3[Int,Int,Int] =
CollSeq((1,2,3),
(2,3,4),
(3,4,5))
scala> res0._2
res1: Seq[Int] = List(2, 3, 4)
scala> res1.mean
res2: Double = 3.0
scala> res1.stdDev
res3: Double = 0.816496580927726
scala> res1.variance
res4: Double = 0.6666666666666666
Weighted Statistics work like this:
scala> val w=res0._2 flatZip res0._3
w: com.github.marklister.collections.immutable.CollSeq2[Int,Int] =
CollSeq((2,3),
(3,4),
(4,5))
scala> w.mean
res5: Double = 3.1666666666666665
scala> w.stdDev
res6: Double = 0.7993052538854533
scala> w.variance
res7: Double = 0.6388888888888888
scala> (2*3+3*4+4*5).toDouble/(3+4+5)
res9: Double = 3.1666666666666665
###Examples
####Read Stock prices and calculate moving average An example REPL session. Let's read some stock prices and calculate the 5 period moving average:
scala> import java.util.Date
import java.util.Date
scala> implicit val dmy = new DateConverter("dd-MMM-yy") // tell the parser how to read your dates
dmy: com.github.marklister.collections.io.DateConverter = com.github.marklister.collections.io.DateConverter@26d606
scala> val p=CsvParser[Date,Int,Int,Int,Int] //Date, close, High, Low, Volume
p: com.github.marklister.collections.io.CsvParser5[java.util.Date,Int,Int,Int,Int] = com.github.marklister.collections.io.CsvParser5@1584d9
scala> val prices=p.parseFile("abil.csv", hasHeader=true, delimiter="\t")
prices: com.github.marklister.collections.immutable.CollSeq5[java.util.Date,Int,Int,Int,Int] =
(Mon Apr 30 00:00:00 AST 2012,3885,3922,3859,4296459)
(Wed May 02 00:00:00 AST 2012,3880,3915,3857,3127464)
(Thu May 03 00:00:00 AST 2012,3920,3948,3874,3080823)
(Fri May 04 00:00:00 AST 2012,3909,3952,3885,2313354)
(Mon ....
scala> val ma= prices._2.sliding(5).toList.map(_.mean)
ma: List[Double] = List(3889.4, 3866.4, 3830.4, 3792.8, 3763.0, 3724.4, 3700.4, 3692.6, 3670.2, 3627.2, 3615.6, 3615.6, 3596.6, 3599.0, 3612.0, 3609.8, 3605.6, 3611.0, 3611.0, 3606.0, 3614.2, 3612.4, 3629.0, 3634.6, 3659.4, 3661.0, 3657.2, 3645.2, 3628.4, 3616.4, 3632.8, 3668.8, 3702.6, 3745.4, 3781.0, 3779.6, 3755.4, 3727.4, 3689.4, 3650.2, 3638.8, 3641.8, 3648.2, 3663.2, 3671.0, 3649.4, 3624.4, 3595.0, 3559.0, 3518.0, 3505.8, 3495.8, 3505.8, 3531.2, 3570.8, 3589.0, 3613.0, 3620.8, 3624.4, 3635.4, 3661.0, 3667.0, 3686.6, 3703.6, 3720.0, 3722.4, 3692.4, 3619.0, 3553.4, 3473.4, 3413.2, 3400.0, 3422.8, 3427.4, 3433.6, 3434.0, 3425.6, 3403.8, 3396.6, 3388.6, 3376.0, 3353.6, 3318.6, 3291.8, 3260.6, 3240.0, 3225.0, 3226.0, 3218.2, 3232.2, 3219.6, 3226.0, 3234.0, 3251.0, 3271.0, 3312.4, 3341....
scala> prices._1.drop(5).zip(ma) //moving average zipped with date
res0: Seq[(java.util.Date, Double)] = List((Tue May 08 00:00:00 AST 2012,3889.4), (Wed May 09 00:00:00 AST 2012,3866.4), (Thu May 10 00:00:00 AST 2012,3830.4), (Fri May 11 00:00:00 AST 2012,3792.8), (Mon May 14 00:00:00 AST 2012,3763.0), (Tue May 15 00:00:00 AST 2012,3724.4), (Wed May 16 00:00:00 AST 2012,3700.4), (Thu May 17 00:00:00 AST 2012,3692.6), (Fri May 18 00:00:00 AST 2012,3670.2), (Mon May 21 00:00:00 AST 2012,3627.2), (Tue May 22 00:00:00 AST 2012,3615.6), (Wed May 23 00:00:00 AST 2012,3615.6), (Thu May 24 00:00:00 AST 2012,3596.6), (Fri May 25 00:00:00 AST 2012,3599.0), (Mon May 28 00:00:00 AST 2012,3612.0), (Tue May 29 00:00:00 AST 2012,3609.8), (Wed May 30 00:00:00 AST 2012,3605.6), (Thu May 31 00:00:00 AST 2012,3611.0), (Fri Jun 01 00:00:00 AST 2012,3611.0), (Mon Jun 04 0...
scala>
Note: this converter is now provided as standard in the distribution.
scala> import scala.util.Try
import scala.util.Try
scala> implicit object optionIntConverter extends GeneralConverter[Option[Int]]{
| def convert(x:String)=Try(x.trim.toInt).toOption
| }
defined module optionIntConverter
scala> CsvParser[String,Option[Int]].parseFile("badly-formed.csv")
res3: com.github.marklister.collections.immutable.CollSeq2[String,Option[Int]] =
CollSeq((Jan,Some(10)),
(Feb,None),
(Mar,Some(25)))
#####Calculate an aircraft's moment in in-lb
scala> val aircraftLoading=CollSeq(("Row1",86,214),("Row4",168,314),("FwdCargo",204,378)) //Flight Station, Mass kg, Arm in
aircraftLoading: com.github.marklister.collections.immutable.CollSeq3[java.lang.String,Int,Int] =
(Row1,86,214)
(Row4,168,314)
(FwdCargo,204,378)
scala> val pounds = aircraftLoading._2.map(_ * 2.2) //convert kg -> lb
pounds: Seq[Double] = List(189.20000000000002, 369.6, 448.8)
scala> val moment = pounds.zip(aircraftLoading._3).map(x=>x._1*x._2)
moment: Seq[Double] = List(40488.8, 116054.40000000001, 169646.4)
scala> moment.sum
res1: Double = 326189.6
#####CollSeq
CollSeq
is a wrapper around IndexedSeq[Product]
. CollSeq
also implements Product
itself.
#####CollSeqN
CollSeqN
are concrete implementations of CollSeq
. They extend IndexedSeq[ProductN[T1,..,TN]]
and implement ProductN
. CollSeqN
has only one novel method: flatZip (s:Seq[A]): CollSeqN+1[T1,..TN,A]
#####CsvParser
CsvParser
is a simple Csv reader/parser that returns a CollSeqN.
There are concrete parsers implemented for each arity. The actual gruntwork is done by opencsv.
#####Implicit Conversions
Seq[Product1[T]] => CollSeq1[T]
Seq[Product2[T1,T2]] => CollSeq2[T1,T2]
Seq[T] => CollSeq1[T]
The methods introduced are few: flatZip
and _1
... _N
.
###Status
Stable.
###Future
In no particular order:
- How to incorporate classes that implement ProductN (future case classes)? This bug was originally milestoned for scala 2.11 but seems to have been pushed back a bit.
- Column access by named method (using macros?)
Non-goals:
- A mutable version
- Exceeding scala arity limits
###Scaleability
CollSeq is known to scale to thousands of rows without difficulty. CollSeq is a thin wrapper around a scala IndexedSeq so should scala in exactly the same way. CsvParser's Iterator has been reported to process millions of rows without spiking the JVM's memory.
###Build Dependencies
product-collections relies heavily on sbt-boilerplate. sbt-boilerplate is a cleverly designed yet simple code generating sbt-plugin.
###Runtime Dependencies
- Scala 2.11 or 2.10. If you want it for 2.9 you'll need to clone the project and downgrade the Specs version.
- opencsv (Apache 2 licence).
###Pull Requests
Pull requests are welcome. Please keep in mind the KISS character if you extend the project. Feel free to discuss your ideas on the issue tracker.
###Licence
###Alternatives
Product-collections is around 400 lines of code (before template expansion). The alternatives are substantially larger and have far more features.
HLists are similar in concept. Shapeless allows one to abstract over arity.
Backed by arrays. Heavily specialized. Matrix operations.
Simple abstractions for working with ordered series data (eg. time series), as well as heterogeneous data tables (similar to R's data frame). Based on Spire and Shapeless.
With Framian you specify the data type at retrieval time (weakly typed).
Simple immutable data structure. Weakly typed. Quite a young project with emphasis on sorting.
###Testimonials
The brilliance of [product-collections] is the tight focus on being really good at one or two things, which, in my opinion, includes not just the powerful type-safe column- and row-oriented operations, but the extensible use of implicit string converters...In product-collections you've hit the ultimate sweet-spot from an idiomatic Scala point of view."
Simeon H.K. Fitch, Director of Software Engineering, Elder Research, Inc.