-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features #38966
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, if tests pass. Is it fair to say this keeps existing tests and adds new ones? my only concern is about introducing new unintended behavior changes, but if it's passing all the previous tests, that's good enough for purposes here
// Utility object, holds a buffer of all points with unique features so far, and performs | ||
// weighted sum accumulation of points. Hides these details for better readability of the | ||
// main algorithm. | ||
object PointsAccumulator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a top-level object in this file at least, if not a separate file
weight > 0 | ||
} | ||
|
||
if (cleanInput.isEmpty) Array.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand the if-else so bodies are on new lines
@@ -194,40 +241,46 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w | |||
assert(model.predict(0.5) === 1.5) | |||
assert(model.predict(0.75) === 1.75) | |||
assert(model.predict(1) === 2) | |||
assert(model.predict(2) === 10d/3) | |||
assert(model.predict(9) === 10d/3) | |||
assert(model.predict(2) === 10d / 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome to just write 10.0 / 3 here too
Well, the behavior has changed on duplicate features cases. Specifically, only these two tests I had to change to conform with the new behavior. These tests, on
Added new tests: |
Of course yeah this is going to change behavior where a bug is fixed. Other than that, do the tests now return different answers from sklearn where they agreed before? I'm only concerned about other regressions here |
I don't remember whether I've checked all tests against sklearn. I'll double check anyways. |
Double-checking is great, though, I'm asking if you meant in your comment that you know a test currently disagrees with sklearn already |
Double-checking on this branch, all tests produce same results as sklearn. On master, these two (already existing) tests produce different results than sklearn: |
Ok if it was already varying from sklearn, and isn't the same issue as you're fixing, leave it for now |
Aha, I see your point! It was varying because of the issue I'm fixing, which is about handling duplicate features. |
... So, if we run this new code on existing (old) master's suite, those two tests will fail. |
Ah OK, hm, that sounds like a regression related to this change then and probably needs to be fixed (or decide the old test was wrong) |
Regardless from this change, these tests' assumptions contradicts with what this ticket is about. See the tests' names at least: the tests are about handling duplicate features. My guess is that these were introduced to cover corner cases of lex sorting and repartitioning. However, we know that original implementation doesn't actually handle cosolidation of duplicate features. Hence, these tests by definition, are not valid. |
OK so the test result is different, but maybe wasn't correct in the first place. But it doesn't match sklearn now, and did before (?) I'm still a little concerned that this changes behavior in unintended ways, but maybe you have a better judgment about whether this is still mostly an improvement in the result |
I'll try to trace history of sklearn, maybe it has changed behavior.
All I can say here is that if sklearn should be the reference, then these tests are definitely wrong and the issue is because of not doing duplicate features consolidation. |
Merged to master |
So, it is true: scikit-learn has changed behavior just after spark introduced isotonic regression. I've tested different scikit-learn versions (namely: 0.13.1, 0.14.1, 0.15.0, 0.15.2) since isotonic regression was first introduced until the duplicates PR applied, and interestingly none has matched master's spark. from sklearn.isotonic import IsotonicRegression
IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # >= 0.15
IsotonicRegression(increasing=isotonic).fit(x, y, sample_weight=weights).predict(x_test) # = 0.14
IsotonicRegression().fit(x, y, weight=weights).predict(x_test) # = 0.13
# test("isotonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [2, 1, 4, 2, 6, 5], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], True
# spark : array([ 1 , 2, 4.5, 6 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([ nan, 2. , 4.5, 5. ])
# v0.15.2: array([ 2. , 2. , 4.5, 6. ])
# v0.16.0: array([ 1.5 , 2.25, 4.25, 5.5 ])
# test("antitonic regression prediction with duplicate features")
x, y, weights, x_test, isotonic = [1, 1, 2, 2, 3, 3], [5, 6, 2, 4, 1, 2], [1, 1, 1, 1, 1, 1], [0, 1.5, 2.5, 4], False
# spark : array([ 6 , 4.5, 2, 1 ])
# v0.13.1: error
# v0.14.1: error
# v0.15.0: array([ nan, 4.25, 2.25, 1.5 ])
# v0.15.2: array([ 5.5 , 4.25, 2.25, 1.5 ])
# v0.16.0: array([ 5.5 , 4.25, 2.25, 1.5 ]) This R package suggests that there are three methods for breaking ties: primary, secondary and tertiary. Spark and scikit-learn originally implemented "primary". Then scikit-learn replaced it with "secondary" which is implemented in this PR. |
I think we need to follow up with documentation updates https://spark.apache.org/docs/latest/mllib-isotonic-regression.html#:~:text=If%20the%20prediction,point%20are%20used. |
OK, as long as it's logical and consistent, and more logical/consistent than before, I think it's OK. Feel free to update docs as you see fit |
.mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1)))) | ||
// Aggregate points with equal features into a single point. | ||
.map(makeUnique) | ||
.flatMap(poolAdjacentViolators) | ||
.collect() | ||
.sortBy(x => (x._2, x._1)) // Sort again because collect() doesn't promise ordering. | ||
// Sort again because collect() doesn't promise ordering. | ||
.sortBy(x => (x._2, x._1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now it is redundant to sort by labels since we already are making features unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can open up a follow-up PR on the same JIRA if there are minor additional improvements or docs to suggest
I'll follow up with PR for doc and removing the redundant sort key |
Can one of the admins verify this patch? |
### What changes were proposed in this pull request? A follow-up on #38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match `sklearn` implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as: - Aggregated label is the weighted average of all labels - Aggregated feature is the weighted average of all equal features. It is possible that feature values to be equal up to a resolution due to representation errors, since we cannot know which feature value to use in that case, we compute the weighted average of the features. Ideally, all feature values will be equal and the weighted average is just the value at any point. - Aggregated weight is the sum of all weights ### Why are the changes needed? As per discussion on ticket [[SPARK-41008]](https://issues.apache.org/jira/browse/SPARK-41008), it is a bug and results should match `sklearn`. ### Does this PR introduce _any_ user-facing change? There are no changes to the API, documentation or error messages. However, the user should expect results to change. ### How was this patch tested? Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added. Here is a python snippet that can be used to verify the results: ```python from sklearn.isotonic import IsotonicRegression def test(x, y, x_test, isotonic=True): ir = IsotonicRegression(out_of_bounds='clip', increasing=isotonic).fit(x, y) y_test = ir.predict(x_test) def print_array(label, a): print(f"{label}: [{', '.join([str(i) for i in a])}]") print_array("boundaries", ir.X_thresholds_) print_array("predictions", ir.y_thresholds_) print_array("y_test", y_test) test( x = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20], y = [1, 0, 0, 1, 0, 1, 0, 0, 0], x_test = [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20] ) ``` srowen zapletal-martin Closes apache#38966 from ahmed-mahran/ml-isotonic-reg-dups. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? A follow-up on apache#38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in apache#38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes apache#38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
Adding a pre-processing step to isotonic regression in mllib to handle duplicate features. This is to match
sklearn
implementation. Input points of duplicate feature values are aggregated into a single point using as label the weighted average of the labels of the points with duplicate feature values. All points for a unique feature values are aggregated as:Why are the changes needed?
As per discussion on ticket [SPARK-41008], it is a bug and results should match
sklearn
.Does this PR introduce any user-facing change?
There are no changes to the API, documentation or error messages. However, the user should expect results to change.
How was this patch tested?
Existing test cases for duplicate features failed. These tests were adjusted accordingly. Also, new tests are added.
Here is a python snippet that can be used to verify the results:
@srowen @zapletal-martin