Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan to support rms package linear models? #15

Open
wei-wu-nyc opened this issue Mar 3, 2017 · 21 comments
Open

Any plan to support rms package linear models? #15

wei-wu-nyc opened this issue Mar 3, 2017 · 21 comments

Comments

@wei-wu-nyc
Copy link

Hi,
Do you have plan to support the linear models that are included in rms package? More specifically, the features I am interested in are the support for transformation of restricted cubic spline fit, through rcs() function.

Thanks.

@vruusmann
Copy link
Member

At a first glance, the rms::lrm() model type and the rms::rcs() function type both seem doable.

Could you provide a small R code example about what functionality do you exactly need? The example could be based on the Auto-MGP dataset (mpg ~ .).

@wei-wu-nyc
Copy link
Author

wei-wu-nyc commented Mar 3, 2017 via email

@wei-wu-nyc
Copy link
Author

wei-wu-nyc commented Mar 3, 2017 via email

@vruusmann
Copy link
Member

@wei-wu-nyc Can't find your R script.

Perhaps GitHub "ate" it because of bad file extension? You could rename it to myscript.R.txt, and re-attach via GitHub web UI.

@wei-wu-nyc
Copy link
Author

wei-wu-nyc commented Mar 3, 2017

Here is the attachment. Just to be safe. I also copied the line below.

library(rms)

mpgdata=read.csv('Auto.csv')

model1=ols(mpg~weight,data=mpgdata)
model2=ols(mpg~rcs(weight), data=mpgdata)
model3=ols(mpg~rcs(weight, nk=5), data=mpgdata)
model4=ols(mpg~rcs(weight, knots=c(2000,2500,3000,3500,4000,4500,5000)), data=mpgdata)

plot(mpgdata$weight, mpgdata$mpg)
lines(mpgdata$weight, predict(model1, mpgdata), col='red')

mpgdata=mpgdata[order(mpgdata$weight),]
lines(mpgdata$weight, predict(model2, mpgdata), col='green')
lines(mpgdata$weight, predict(model3, mpgdata), col='blue')
lines(mpgdata$weight, predict(model4, mpgdata), col='yellow')

I don't know how the over-stoke got into the text. But I think you can see the text.

r2pmml_rcs_example.R.txt

@vruusmann
Copy link
Member

@wei-wu-nyc Thanks for clarifying the "minimum viable product".

I hope to find time to work on this next week. GitHub should keep you notified about my progress as you've been auto-subscribed to this issue.

@wei-wu-nyc
Copy link
Author

Thanks. Looking forward to testing it out. Currently, I have to save a rms model which results to a 300+MB RData file. I have to load that into R just to do predictions on new data.

One thing to note is that when I save a rms model file, all the parameters (the explicit or implicit parameters, like the automatically determined number of knots and the knot points etc.) are saved. I am not quite sure how pmml works exactly. I suggest, these parameters should be saved together in the generated pmml file.

@vruusmann
Copy link
Member

vruusmann commented Mar 3, 2017

Currently, I have to save a rms model which results to a 300+MB RData file.

R's lm model type and all its subtypes (such as lrm) include the full training dataset. It shouldn't affect the functionality of your model in any way if you simply nullify this attribute before RDS save. A typical lm model object shouldn't exceed 1 MB then.

I am not quite sure how pmml works exactly.

During the conversion to PMML, I need to determine which fields are passed through the rcs() function. Then I need to collect the corresponding knot counts and coefficients, and generate DerivedField elements that reproduce R's knot evaluation algorithm. Should be fully compliant with the PMML specification.

@wei-wu-nyc
Copy link
Author

Thanks you for the tip on trimming the object.

Another thing to remind you about specifics on rms package is: In rms, the linear regression model is ols() and the lrm() model is the classification model "Logistic Regression Model", hence the name.

vruusmann added a commit to jpmml/jpmml-r that referenced this issue Mar 10, 2017
vruusmann added a commit to jpmml/jpmml-r that referenced this issue Mar 10, 2017
@vruusmann
Copy link
Member

Spline interpolation is represented by mapping a value range to a function.

PMML 4.3 doesn't have a high-level transformation for "continuous" lookup tables. Something like that could be emulated using "if" functions, but it wouldn't be too human-friendly (as "if" functions would be very deeply nested).

I've opened a feature request with the PMML working group to discuss good long-term solutions:
http://mantis.dmg.org/view.php?id=176

My suggestion is to "relax" Discretize and MapValues transformations, so that they could compute the return value dynamically. Unfortunately, the PMML working group doesn't support this suggestion, and wants to come up with a completely new transformation.

So, we're going to have to wait for a couple of weeks to get a better understanding about which way to go.

@wei-wu-nyc
Copy link
Author

Thanks.

@vruusmann
Copy link
Member

@wei-wu-nyc What kind of PMML consumer software would you be using? If you intend to use something that is based on the JPMML-Evaluator library, then we could go on and "relax" Discretize and MapValues transformations for the time being.

In other words, would you be happy with a "proprietary extension" that will be available in early April?

@wei-wu-nyc
Copy link
Author

wei-wu-nyc commented Mar 15, 2017

@vruusmann To tell you the truth, I haven't explored much on what PMML software we are going to use. For my purpose, I would like to output the pmml file and later loaded it either into R (supposedly if I use r2pmml package, I will be using JPMML-Evaluator library?), or our production system, which is on spark thus mostly written in SCALA. So I assume it should be able the load java library JPMML-Evaluator? And the main goal is to be able to call the prediction function without loading the original fitted R model object and the output results from either platforms should be identical or very close given the same inputs. If JPMML-Evaluator library is available on both platforms, that will work for me.

@vruusmann
Copy link
Member

@wei-wu-nyc I was thinking that if you were going to deploy your models on some "big vendor" platform, then this proprietary extension wouldn't work (at least not within a reasonable time frame).

If you're looking to consume PMML models on Apache Spark ML, then you will be much more productive using the high-level JPMML-Spark library; it's a thin wrapper around the low-level JPMML-Evaluator library, and it takes care of plumbing Apache Spark ML and PMML systems in the best way possible. When your workflow runs on JPMML-family of libraries end-to-end, then you don't need to worry about the reproducibility of predictions - I've already taken good care of it.

@wei-wu-nyc
Copy link
Author

@vruusmann I actually am not using Spark ML for various reasons. I am quant/data scientist not a programer as you probably can figure it out from the above conversation. The decision of which machine learning library/package to use is based on the model quality and feature set of various package in the model development cycle for this particular project. Not many of linear packages support spline fit options, as well as support of variable interactions. Although it is possible to populate the spline terms and using a generic glm model that does not support spline fit. There are probably many edge cases in the spline fitting need to be debugged carefully. My decision was to use rms R package for part of my models.

The truth is in the modeling stage, I don't really use Spark (we are planning to migrate to Spark even for the modeling stage for training and cross-validation etc.). Currently we only use Spark on our production and application side of the process. So I actually haven't investigated enough into Spark ML. Last time I looked at Spark ML it was relatively inefficient both in terms of memory usage and speed, compare to H2O.

This request is only a part of my whole model system. My problem is for a pretty large dataset (10s-100GB range). I divided my models into different stages. One of the stage (for a smaller data set) is using rms package to construct spline fitting models. (If it is done on the big dataset, it would run out of memory.) For the models in other stages which are trained on the whole dataset, I use H2O's h2o.glm() model. I have portability issues over there too. I was going to ask at H2O user forum about that so I didn't mentioned here. Now that it comes up, I will mention it here too.

I also have issues with portability of H2O models for prediction. They don't currently support output of pmml file. However, what they do support is the output of a POJO file for prediction function. Since our production platform is Spark. I thought it may work for my situation. One problem with this is it won't work on the R side without reloading the H2O model in R, as there is no simple way to load the POJO file into R easily.

This is the first time I deal with PMML. So I am pretty new with PMML usage and deployment. Given the overall picture of my current project, any suggestions you have for me?

Have you heard of any support of PMML for H2O models?

Thanks a lot for your help.

@vruusmann
Copy link
Member

@wei-wu-nyc Thank you for explaining your data science process. This is very interesting/useful, and very difficult to obtain information for me.

It's also possible to "transpile" PMML documents into POJOs (currently, a private project). It brings considerable performance improvements, but is technically more difficult to maintain. With PMML documents, model deployment and undeployment are a matter of uploading and deleting a text file, whereas with POJOs you have all sorts of Java class loading/unloading complexities, long term storage issues (very likely, your H2O POJOs are tightly coupled to a particular H2O API version), etc. And PMML documents are much easier to parse/interpret if you want to understand (as a human) the computation that the model performs.

I think it shouldn't be too difficult to build my own H2O-to-PMML converter. It has been successfully achieved with R and Scikit-Learn, which are backed by non-Java languages/technologies. So, H2O, which is backed by Java (or at least designed to be heavily interoperable with it), should be a walk in the park.

What's the requirement behind (re-)loading external models into R? Ideologically, R and PMML use different concepts for representing transformations and models, so the "conversion event" should be regarded as a one way street ("easy to go from pig to sausage, hard to go back"). If you simply want to use external models for prediction, (eg. executing a model with a data.frame object, something like predict.pmml function), then it will be possible to invoke JPMML-Evaluator functionality straight from a running R session.

@wei-wu-nyc
Copy link
Author

The main reason for me to have the ability to re-loading a model from PMML model into R is for debugging or reporting purposes. What you described "invoke JPMML-Evaluator functionality straight from a running R session", is exactly what I need. Basically I want to be able to do the predictions in R and getting the same results as the JAVA JPMML client/consumer function. So in case of some discrepancies, I can easily replicate and compare with the original models. Also as a faster loading and predicting and standalone function, when I need to do some analysis/stats/graph of the prediction output of the models, loading PMML into R may be an option.

@wei-wu-nyc
Copy link
Author

@vruusmann What is the R package to use for loading a PMML model to do predictions of this PMML model in R? i.e. "invoke JPMML-Evaluator" functionality in R as you described above.

@vruusmann
Copy link
Member

@wei-wu-nyc Like may other projects/tools, it's not public yet. But it's fairly easy to achieve similar functionality on your own if you deploy a local Openscoring REST web service.

In that case, you could write a small helper R function that does the following:

  1. Save input data data.frame object to a temporary CSV/TSV file.
  2. Send this temporary input CSV file (using HTTP POST method; you can use R's RCurl or httr packages for that) to Openscoring's http://localhost:8080/openscoring/model/${id}/csv endpoint. Capture its response to another temporary CSV/TSV file.
  3. Read this temporary output CSV file to results data.frame object, and return it to user.

@vruusmann
Copy link
Member

vruusmann commented Mar 17, 2017

@wei-wu-nyc Even simpler, you don't need to bring Openscoring REST web service into play. Simply invoke Java command-line application class org.jpmml.evaluator.EvaluationExample, which takes PMML model file, input CSV file and results CSV file as arguments:

write.csv(in_df, "input.tsv")
system2("java", c("-cp", "example-1.3-SNAPSHOT.jar", "org.jpmml.evaluator.EvaluationExample", "--model", "/path/to/model.pmml", "--input", "input.tsv", "--output", "output.tsv"))
return read.csv("output.tsv")

You can obtain this example-1.3-SNAPSHOT.jar file if you build the JPMML-Evaluator project from the source checkout. The build places it into pmml-evaluator-example/target directory; further instructions are given in the README.md file.

@guleatoma
Copy link

Hello,

I'm also interested in the support of the rms package, which, to me, is the best package for logistic regressions. I wanted to add a couple of things to the discussion.

  • glm supports rms::rcs as a formula feature i.e. you don't need rms::lrm if you are only interested in rms::rcs.

  • The easiest workaround for the example provided might be to explicitely provide the spline vectors in a supported function (lm for example). The idea would be to perform the rcs in the preProcess function i.e. in your example, create the 4 vectors normally provided by your spline function as columns in the data. This is clearly not an ideal solution but if you only have a couple of rcs() in your model it is not much work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants