forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features
### What changes were proposed in this pull request? Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as: - Aggregated label is the weighted average of all labels - Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point. - Aggregated weight is the sum of all weights ### Why are the changes needed? As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`. ### Does this PR introduce _any_ user-facing change? There are no changes to the API, documentation or error messages. However, the user should expect results to change. ### How was this patch tested? Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added. Here is a python snippet that can be used to verify the results: ```python from sklearn.isotonic import IsotonicRegression def test(x, y, x_test, isotonic=True): ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y) y_test = ir.predict(x_test) def print_array(label, a): print(f"{label}: [{', '.join([str(i) for i in a])}]") print_array("boundaries", ir.X_thresholds_) print_array("predictions", ir.y_thresholds_) print_array("y_test", y_test) test( x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], y = [1, 0, 0, 1, 0, 1, 0, 0, 0], x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20] ) ``` srowen zapletal-martin Closes apache#38966 from ahmed-mahran/ml-isotonic-reg-dups. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
- Loading branch information
1 parent
5751e8c
commit 00a45c7
Showing
2 changed files
with
262 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.