Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Latest commit

 

History

History
177 lines (114 loc) · 5.32 KB

devel.md

File metadata and controls

177 lines (114 loc) · 5.32 KB

Developer Notes for Krangl

NaN vs Non in pandas: https://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none

// //class Foo{ // fun bar(predicate: (Int) -> String){} // // fun bar(predicate: (Int) -> Boolean){} //}

Interactive shell

kscript -i - <<"EOF"
//DEPS de.mpicbg.scicomp:krangl:0.9-SNAPSHOT
EOF

Potentially useful libraries

Design

https://stackoverflow.com/questions/45090808/intarray-vs-arrayint-in-kotlin --> bottom line: Array<*> can be null

Receiver vs parameter functions vs properties

How to write vector utilties?

dataFrame.summarize("mean_salary") { mean(it["salaray"]) }    // function parameter 
dataFrame.summarize("mean_salary") { it["salaray"].mean() }   // extension/member function
dataFrame.summarize("mean_salary") { it["salaray"].mean }     // extension property

???

Don't overload operator Any?.plus --> Confusion

https://kotlinlang.org/docs/reference/operator-overloading.html

gradle

create fresh gradle wrapper with:

gradle wrapper --gradle-version 4.2.1

From twosigma/beakerx#5135: Split repos?

It is a bad idea. Many different repos are hard to maintain. And you do not need this. Gradle allows to publish separate artifacts without splitting repository.
you can use gradle :kernel:base:<whatever> instead of cd.


http://stackoverflow.com/questions/29268526/how-to-overcome-same-jvm-signature-error-when-implementing-a-java-interface

To Improve JVM compatibility use JvmName to allow for more strongly typed

@JvmName("mutateString")
fun DataFrame.mutate(name: String, formula: (DataFrame) -> List<String>): DataFrame {
    if(this is SimpleDataFrame){
        return addColumn(StringCol(name, formula(this)))
    }else
        throw UnsupportedOperationException()
}

Comparison to other APIs

And the same in pandas. {PR needed here}

Known differences to dplyr package in R

  • rename() will preserve column positions whereas dplyr::rename add renamed columns to the end of the table
  • The mapping order is inverted in rename(). Instead of
    dplyr::rename(data, new_name=old_name)
    
    the krangl syntax is inverted to be more readible
    data.rename("old_name" to "new_name")
    
  • sortedBy() will sort by grouping attributes first, and then per group with the provided sorting attributes.
  • select() does not silently ignore multiple selections of the same column, but throws an error instead
  • select() will throw an error if a grouping column is being removed (see dplyr ticket)

Spark

From spark release notes:

Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.

tablesaw

Feature Krangl TableSaw
Kotlin API Yes Yes
Add column df.
Select columns by type

Select columns by type

  • krangl
df.select( 
  • tablesaw
val df = Dataframe(df.structure().target.selectWhere(column("Column Type").isEqualTo("INTEGER")))

Jupyter Integration

dev scratchpad

export KRANGL_HOME=/d/projects/misc/krangl/
cd $KRANGL_HOME/..

# start kernel
cmd.exe "/K" C:\Users\brandl\Anaconda3\Scripts\activate.bat C:\Users\brandl\Anaconda3

# no longer needed becaue no part of ipynb preamble
#rm -rf ~/.ivy2/cache/com.systema/
#rm -rf ~/.ivy2/cache/org.kalasim/
#rm -rf ~/.ivy2/cache/com.github.holgerbrandl/kravis/

#conda install -c jetbrains kotlin-jupyter-kernel
# interactive use
jupyter notebook --kernel=kotlin 
#jupyter notebook --kernel=kotlin examples/jupyter/letsplot_example.ipynb

References