-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm #3519
Conversation
…luding proposed API
Can one of the admins verify this patch? |
add to whitelist |
ok to test |
test this please |
Test build #23979 has started for PR 3519 at commit
|
Test build #23979 has finished for PR 3519 at commit
|
Test FAILed. |
Test build #23990 has started for PR 3519 at commit
|
Test build #23990 has finished for PR 3519 at commit
|
Test PASSed. |
@zapletal-martin Some high-level comments:
|
…eplced by simple boolean
@mengxr , thank you very much for your feedback.
b) IsotonicRegressionModel extends RegressionModel. It implements methods predict(testData: RDD[Vector]) and predict(testData: Vector). Are these still relevant if we implement the changes in 1)? There would never be a Vector, just Double. Also we would need feature in 1) to be able to predict label. c) How do you expect the java api to look like? Unfortunately the java/scala interop here is not very helpful. When train method expects tuple of scala.Double then when called from java you get: [error] IsotonicRegressionModel model = IsotonicRegression.train(testRDD.rdd(), true); There are solutions to this problem, but most of them quite ugly. See for example http://stackoverflow.com/questions/17071061/scala-java-interoperability-how-to-deal-with-options-containing-int-long-primi or http://www.scala-notes.org/2011/04/specializing-for-primitive-types/. Is there another public java api that uses primitive type in generic that I could use as reference? |
2a) 2b) Isotonic regression is a univariate regression algorithm. It is not necessary to have its model extend RegressionModel. It should have 2c) Try |
…) and updated api
…) and updated api
Test build #26416 has finished for PR 3519 at commit
|
Test PASSed. |
Test build #26417 has finished for PR 3519 at commit
|
Test PASSed. |
Test build #26433 has started for PR 3519 at commit
|
Test build #26434 has started for PR 3519 at commit
|
Test build #26433 has finished for PR 3519 at commit
|
Test FAILed. |
Test build #26434 has finished for PR 3519 at commit
|
Test FAILed. |
fix java tests
Test build #26458 has started for PR 3519 at commit
|
Test build #26458 has finished for PR 3519 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks!! |
This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.
The Isotonic regression problem is sufficiently described in Floudas, Pardalos, Encyclopedia of Optimization, Wikipedia or Stat Wiki.
Pool adjacent violators was introduced by M. Ayer et al. in 1955. A history and development of isotonic regression algorithms is in Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods and list of available algorithms including their complexity is listed in Stout, Fastest Isotonic Regression Algorithms.
An approach to parallelize the computation of PAV was presented in Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression.
The implemented Pool adjacent violators algorithm is based on Floudas, Pardalos, Encyclopedia of Optimization (Chapter Isotonic regression problems, p. 86) and Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods, also nicely formulated in Tibshirani, Hoefling, Tibshirani, Nearly-Isotonic Regression. Implementation itself inspired by R implementations Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism and R Development Core Team, stats, 2009. I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
Algorithm (PAVA) and Active Set Methods. The implementation is also inspired and cross checked with other implementations: Ted Harding, 2007, scikit-learn, Andrew Tulloch, 2014, Julia, Andrew Tulloch, 2014, c++, described in Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x, Fabian Pedregosa, 2012, Sreangsu Acharyya. libpav and Gustav Larsson.