-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaccard Dissimilarity Function #129
Conversation
…oded variables along with related unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
I left a few comments for you to address.
kmodes/tests/test_kmodes.py
Outdated
@@ -124,6 +124,101 @@ | |||
# Drop target column | |||
SOYBEAN2 = SOYBEAN2[:, :35] | |||
|
|||
SOYBEAN4 = np.array([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BikashPandey17 , what's the source of this data? This is not the soy bean data (as the name suggests), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jaccard dissimilarity was not a good distance metric for the soybean data, as a result, the whole data was considered into a single cluster which looked like this [0 0 0 0 0 ... 0 0 0]
. I wanted to put an example where distinct clusters were formed so I used this instead. SOYBEAN4 is actually a misleading name, this is just another categorical data with Label encoding(definitely not related to SOYBEAN).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename it then. I'll merge after that.
kmodes/util/dissim.py
Outdated
intersect_len[i] = len(np.intersect1d(row, b)) | ||
union_len[i] = len(row) + len(b) - intersect_len[i] | ||
i += 1 | ||
return intersect_len / union_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please handle exceptional case where union_len == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never actually happen unless both my arrays come in empty. Still, I'll put up a check, in case this situation arises.
with self.assertRaises(ValueError): | ||
jaccard_dissim_binary(a, b) | ||
|
||
def test_jaccard_dissim_label(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests where dissimilarity is 0 and 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicodv Just to be sure I should be calculating Jaccard distance instead of Jaccard coefficient? Have a look here https://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, that's correct!
It looks like you're doing the Jaccard coefficient now. You'll need to add 1 - x
and update the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did that.
with self.assertRaises(ValueError): | ||
jaccard_dissim_binary(a, b) | ||
|
||
def test_jaccard_dissim_label(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did that.
with self.assertRaises(ValueError): | ||
jaccard_dissim_binary(a, b) | ||
|
||
# test for dissimilarity = 0 both sets are same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicodv This test here between same sets clears out things.
Nice work, @BikashPandey17 , thanks! |
Looking forward to more contributions! |
Tried to cover all unit tests for Jaccard dissimilarity function for both label encoded and binary encoded variables. Looking forward to the inputs.