[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

ahmed-mahran · 2022-12-07T12:35:20Z

What changes were proposed in this pull request?

Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match sklearn implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:

Aggregated label is the weighted average of all labels
Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point.
Aggregated weight is the sum of all weights

Why are the changes needed?

As per discussion on ticket [SPARK-41008], it is a bug and results should match sklearn.

Does this PR introduce any user-facing change?

There are no changes to the API, documentation or error messages. However, the user should expect results to change.

How was this patch tested?

Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.

Here is a python snippet that can be used to verify the results:

from sklearn.isotonic import IsotonicRegression

def test(x, y, x_test, isotonic=True):
    ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y)
    y_test = ir.predict(x_test)
    
    def print_array(label, a):
        print(f"{label}: [{', '.join([str(i) for i in a])}]")

    print_array("boundaries", ir.X_thresholds_)
    print_array("predictions", ir.y_thresholds_)
    print_array("y_test", y_test)

test(
    x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],
    y = [1, 0, 0, 1, 0, 1, 0, 0, 0],
    x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20]
)

@srowen @zapletal-martin

srowen

Looking good, if tests pass. Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here

srowen · 2022-12-07T15:34:05Z

mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala

+      // Utility object, holds a buffer of all points with unique features so far, and performs
+      // weighted sum accumulation of points. Hides these details for better readability of the
+      // main algorithm.
+      object PointsAccumulator {


Maybe make this a top-level object in this file at least, if not a separate file

srowen · 2022-12-07T15:34:22Z

mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala

+      weight > 0
+    }
+
+    if (cleanInput.isEmpty) Array.empty


Expand the if-else so bodies are on new lines

srowen · 2022-12-07T15:39:02Z

mllib/src/test/scala/org/apache/spark/mllib/regression/IsotonicRegressionSuite.scala

@@ -194,40 +241,46 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
    assert(model.predict(0.5) === 1.5)
    assert(model.predict(0.75) === 1.75)
    assert(model.predict(1) === 2)
-    assert(model.predict(2) === 10d/3)
-    assert(model.predict(9) === 10d/3)
+    assert(model.predict(2) === 10d / 3)


You're welcome to just write 10.0 / 3 here too

ahmed-mahran · 2022-12-07T15:55:34Z

Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here

Well, the behavior has changed on duplicate features cases. Specifically, only these two tests I had to change to conform with the new behavior. These tests, on master, produce different results than sklearn:

Added new tests:

srowen · 2022-12-07T16:14:02Z

Of course yeah this is going to change behavior where a bug is fixed. Other than that, do the tests now return different answers from sklearn where they agreed before? I'm only concerned about other regressions here

ahmed-mahran · 2022-12-07T16:18:39Z

I don't remember whether I've checked all tests against sklearn. I'll double check anyways.

apache#38966

srowen · 2022-12-07T16:30:51Z

Double-checking is great, though, I'm asking if you meant in your comment that you know a test currently disagrees with sklearn already

ahmed-mahran · 2022-12-07T16:47:26Z

Double-checking on this branch, all tests produce same results as sklearn.

On master, these two (already existing) tests produce different results than sklearn:

srowen · 2022-12-07T16:48:42Z

Ok if it was already varying from sklearn, and isn't the same issue as you're fixing, leave it for now

ahmed-mahran · 2022-12-07T16:52:10Z

Aha, I see your point! It was varying because of the issue I'm fixing, which is about handling duplicate features.

ahmed-mahran · 2022-12-07T16:58:38Z

... So, if we run this new code on existing (old) master's suite, those two tests will fail.

srowen · 2022-12-07T17:11:46Z

Ah OK, hm, that sounds like a regression related to this change then and probably needs to be fixed (or decide the old test was wrong)

ahmed-mahran · 2022-12-07T18:57:10Z

Regardless from this change, these tests' assumptions contradicts with what this ticket is about. See the tests' names at least: the tests are about handling duplicate features. My guess is that these were introduced to cover corner cases of lex sorting and repartitioning. However, we know that original implementation doesn't actually handle cosolidation of duplicate features. Hence, these tests by definition, are not valid.

srowen · 2022-12-07T19:23:35Z

OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?) I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result

ahmed-mahran · 2022-12-07T19:46:21Z

OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?)

I'll try to trace history of sklearn, maybe it has changed behavior.

I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result

All I can say here is that if sklearn should be the reference, then these tests are definitely wrong and the issue is because of not doing duplicate features consolidation.

srowen · 2022-12-08T14:28:56Z

Merged to master

ahmed-mahran · 2022-12-08T14:33:13Z

So, it is true: scikit-learn has changed behavior just after spark introduced isotonic regression. I've tested different scikit-learn versions (namely: 0.13.1, 0.14.1, 0.15.0, 0.15.2) since isotonic regression was first introduced until the duplicates PR applied, and interestingly none has matched master's spark.

from sklearn.isotonic import IsotonicRegression

IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # >= 0.15
IsotonicRegression(increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # = 0.14
IsotonicRegression().fit(x, y, weight=weights).predict(x_test) # = 0.13

# test("isotonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [2, 1, 4, 2, 6, 5], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], True

# spark  : array([ 1 ,  2,  4.5,  6 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([ nan,  2. ,  4.5,  5. ])
# v0.15.2: array([ 2. ,  2. ,  4.5,  6. ])
# v0.16.0: array([ 1.5 ,  2.25,  4.25,  5.5 ])

# test("antitonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [5, 6, 2, 4, 1, 2], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], False

# spark  : array([ 6 ,  4.5,  2,  1 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([  nan,  4.25,  2.25,  1.5 ])
# v0.15.2: array([ 5.5 ,  4.25,  2.25,  1.5 ])
# v0.16.0: array([ 5.5 ,  4.25,  2.25,  1.5 ])

This R package suggests that there are three methods for breaking ties: primary, secondary and tertiary. Spark and scikit-learn originally implemented "primary". Then scikit-learn replaced it with "secondary" which is implemented in this PR.

ahmed-mahran · 2022-12-08T14:36:29Z

I think we need to follow up with documentation updates https://spark.apache.org/docs/latest/mllib-isotonic-regression.html#:~:text=If%20the%20prediction,point%20are%20used.

srowen · 2022-12-08T14:39:15Z

OK, as long as it's logical and consistent, and more logical/consistent than before, I think it's OK. Feel free to update docs as you see fit

ahmed-mahran · 2022-12-08T14:40:10Z

mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala

      .mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1))))
+      // Aggregate points with equal features into a single point.
+      .map(makeUnique)
      .flatMap(poolAdjacentViolators)
      .collect()
-      .sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering.
+      // Sort again because collect() doesn't promise ordering.
+      .sortBy(x => (x._2, x._1))


I think now it is redundant to sort by labels since we already are making features unique.

You can open up a follow-up PR on the same JIRA if there are minor additional improvements or docs to suggest

ahmed-mahran · 2022-12-08T14:40:55Z

I'll follow up with PR for doc and removing the redundant sort key

AmplabJenkins · 2022-12-08T16:27:39Z

Can one of the admins verify this patch?

### What changes were proposed in this pull request? A follow-up on #38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as: - Aggregated label is the weighted average of all labels - Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point. - Aggregated weight is the sum of all weights ### Why are the changes needed? As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`. ### Does this PR introduce _any_ user-facing change? There are no changes to the API, documentation or error messages. However, the user should expect results to change. ### How was this patch tested? Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added. Here is a python snippet that can be used to verify the results: ```python from sklearn.isotonic import IsotonicRegression def test(x, y, x_test, isotonic=True): ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y) y_test = ir.predict(x_test) def print_array(label, a): print(f"{label}: [{', '.join([str(i) for i in a])}]") print_array("boundaries", ir.X_thresholds_) print_array("predictions", ir.y_thresholds_) print_array("y_test", y_test) test( x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], y = [1, 0, 0, 1, 0, 1, 0, 0, 0], x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20] ) ``` srowen zapletal-martin Closes apache#38966 from ahmed-mahran/ml-isotonic-reg-dups. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? A follow-up on apache#38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features

eb61b36

github-actions bot added the MLLIB label Dec 7, 2022

srowen reviewed Dec 7, 2022

View reviewed changes

Apply code review comments on PR 38966

107928d

apache#38966

srowen closed this in 3d05c7e Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

View reviewed changes

ahmed-mahran mentioned this pull request Dec 9, 2022

[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica… #38996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

ahmed-mahran commented Dec 7, 2022

srowen left a comment

srowen Dec 7, 2022

srowen Dec 7, 2022

srowen Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

srowen commented Dec 8, 2022

ahmed-mahran Dec 8, 2022

srowen Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

AmplabJenkins commented Dec 8, 2022

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966

Conversation

ahmed-mahran commented Dec 7, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

srowen Dec 7, 2022

Choose a reason for hiding this comment

srowen Dec 7, 2022

Choose a reason for hiding this comment

srowen Dec 7, 2022

Choose a reason for hiding this comment

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 7, 2022

ahmed-mahran commented Dec 7, 2022

srowen commented Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

ahmed-mahran commented Dec 8, 2022

srowen commented Dec 8, 2022

ahmed-mahran Dec 8, 2022

Choose a reason for hiding this comment

srowen Dec 8, 2022

Choose a reason for hiding this comment

ahmed-mahran commented Dec 8, 2022

AmplabJenkins commented Dec 8, 2022