Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaccard Dissimilarity Function #129

Merged
merged 12 commits into from
Jul 26, 2019
Merged

Jaccard Dissimilarity Function #129

merged 12 commits into from
Jul 26, 2019

Conversation

BikashPandey17
Copy link
Contributor

Tried to cover all unit tests for Jaccard dissimilarity function for both label encoded and binary encoded variables. Looking forward to the inputs.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.4%) to 98.475% when pulling 590a51f on BikashPandey17:master into 2bc7fce on nicodv:master.

@coveralls
Copy link

coveralls commented Jul 21, 2019

Coverage Status

Coverage increased (+0.1%) to 97.204% when pulling 4f2efa0 on BikashPandey17:master into 2bc7fce on nicodv:master.

Copy link
Owner

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

I left a few comments for you to address.

kmodes/util/dissim.py Outdated Show resolved Hide resolved
kmodes/util/dissim.py Outdated Show resolved Hide resolved
kmodes/util/dissim.py Outdated Show resolved Hide resolved
kmodes/util/dissim.py Outdated Show resolved Hide resolved
kmodes/util/dissim.py Outdated Show resolved Hide resolved
kmodes/tests/test_kmodes.py Outdated Show resolved Hide resolved
kmodes/tests/test_kmodes.py Outdated Show resolved Hide resolved
kmodes/tests/test_kmodes.py Outdated Show resolved Hide resolved
kmodes/util/tests/test_dissim.py Show resolved Hide resolved
@@ -124,6 +124,101 @@
# Drop target column
SOYBEAN2 = SOYBEAN2[:, :35]

SOYBEAN4 = np.array([
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BikashPandey17 , what's the source of this data? This is not the soy bean data (as the name suggests), right?

Copy link
Contributor Author

@BikashPandey17 BikashPandey17 Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaccard dissimilarity was not a good distance metric for the soybean data, as a result, the whole data was considered into a single cluster which looked like this [0 0 0 0 0 ... 0 0 0]. I wanted to put an example where distinct clusters were formed so I used this instead. SOYBEAN4 is actually a misleading name, this is just another categorical data with Label encoding(definitely not related to SOYBEAN).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename it then. I'll merge after that.

intersect_len[i] = len(np.intersect1d(row, b))
union_len[i] = len(row) + len(b) - intersect_len[i]
i += 1
return intersect_len / union_len
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please handle exceptional case where union_len == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never actually happen unless both my arrays come in empty. Still, I'll put up a check, in case this situation arises.

kmodes/util/tests/test_dissim.py Show resolved Hide resolved
with self.assertRaises(ValueError):
jaccard_dissim_binary(a, b)

def test_jaccard_dissim_label(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests where dissimilarity is 0 and 1.

Copy link
Contributor Author

@BikashPandey17 BikashPandey17 Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicodv Just to be sure I should be calculating Jaccard distance instead of Jaccard coefficient? Have a look here https://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Copy link
Owner

@nicodv nicodv Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, that's correct!

It looks like you're doing the Jaccard coefficient now. You'll need to add 1 - x and update the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that.

with self.assertRaises(ValueError):
jaccard_dissim_binary(a, b)

def test_jaccard_dissim_label(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that.

with self.assertRaises(ValueError):
jaccard_dissim_binary(a, b)

# test for dissimilarity = 0 both sets are same
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicodv This test here between same sets clears out things.

@nicodv nicodv self-requested a review July 26, 2019 20:21
@nicodv nicodv merged commit ea85c94 into nicodv:master Jul 26, 2019
@nicodv
Copy link
Owner

nicodv commented Jul 26, 2019

Nice work, @BikashPandey17 , thanks!

@BikashPandey17
Copy link
Contributor Author

Looking forward to more contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants