R interface for MLeap ================ [![Travis build status](https://travis-ci.org/rstudio/mleap.svg?branch=master)](https://travis-ci.org/rstudio/mleap) [![Coverage status](https://codecov.io/gh/rstudio/mleap/branch/master/graph/badge.svg)](https://codecov.io/github/rstudio/mleap?branch=master) [![CRAN status](https://www.r-pkg.org/badges/version/mleap)](https://cran.r-project.org/package=mleap) **mleap** is a [sparklyr](http://spark.rstudio.com/) extension that provides an interface to [MLeap](https://github.com/combust/mleap), which allows us to take Spark pipelines to production. ## Getting started **mleap** can be installed from CRAN via ``` r install.packages("mleap") ``` or, for the latest development version from GitHub, using ``` r devtools::install_github("rstudio/mleap") ``` Once mleap has been installed, we can install the external dependencies using ``` r library(mleap) install_maven() # Alternatively, if you already have Maven installed, you can # set options(maven.home = "path/to/maven") install_mleap() ``` We can now export Spark ML pipelines from sparklyr. ``` r library(sparklyr) sc <- spark_connect(master = "local", version = "2.3.0") mtcars_tbl <- sdf_copy_to(sc, mtcars, overwrite = TRUE) # Create a pipeline and fit it pipeline <- ml_pipeline(sc) %>% ft_binarizer("hp", "big_hp", threshold = 100) %>% ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>% ml_gbt_regressor(label_col = "mpg") pipeline_model <- ml_fit(pipeline, mtcars_tbl) # A transformed data frame with the appropriate schema is required # for exporting the pipeline model transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl) # Export model model_path <- file.path(tempdir(), "mtcars_model.zip") ml_write_bundle(pipeline_model, transformed_tbl, model_path) # Disconnect from Spark spark_disconnect(sc) ``` At this point, we can share `mtcars_model.zip` with our deployment/implementation engineers, and they would be able to embed the model in another application. See the [MLeap docs](http://mleap-docs.combust.ml/) for details. We also provide R functions for testing that the saved models behave as expected. Here we load the previously saved model: ``` r model <- mleap_load_bundle(model_path) model ``` ## MLeap Transformer ## <97ff1e90-5c3e-40fc-99dd-1919276e76be> ## Name: pipeline_1b49362281ef ## Format: json ## MLeap Version: 0.12.0 We can retrieve the schema associated with the model: ``` r mleap_model_schema(model) ``` ## # A tibble: 6 x 4 ## name type nullable dimension ## <chr> <chr> <lgl> <chr> ## 1 qsec double TRUE <NA> ## 2 hp double FALSE <NA> ## 3 wt double TRUE <NA> ## 4 big_hp double FALSE <NA> ## 5 features double TRUE (3) ## 6 prediction double FALSE <NA> Then, we create a new data frame to be scored, and make predictions using our model: ``` r newdata <- tibble::tribble( ~qsec, ~hp, ~wt, 16.2, 101, 2.68, 18.1, 99, 3.08 ) # Transform the data frame transformed_df <- mleap_transform(model, newdata) dplyr::glimpse(transformed_df) ``` ## Observations: 2 ## Variables: 6 ## $ qsec <dbl> 16.2, 18.1 ## $ hp <dbl> 101, 99 ## $ wt <dbl> 2.68, 3.08 ## $ big_hp <dbl> 1, 0 ## $ features <list> [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]] ## $ prediction <dbl> 21.00084, 20.56445