-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][dask] Increase number of partitions in data #4149
Closed
Closed
Changes from 3 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
f484d0b
increase partitions to 20 in _create_data to avoid having workers wit…
jmoralez cde4fc0
merge master
jmoralez 981084f
increase top_rate for goss in test_classifier
jmoralez c6f1115
merge master
jmoralez 0a6cc22
merge master
jmoralez d85c54a
merge master
jmoralez 4df2898
Merge branch 'master' into tests/more-partitions
jmoralez File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this was added since I last reviewed (981084f). Can you please explain why it's necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_classifier
became flaky in this PR. I assume it's because previously we weren't performing distributed training or at least not everytime, so adding this generated some fails in multiclass classification for data_parallel-dart, voting_parallel-rf (this one is very surprising, given that the atol is 0.8), voting_parallel-gbdt, voting_parallel-dart, voting_parallel-goss. Most of them are for dataframe with categoricals but there are a couple with sparse matrices. I have to debug them to see what's actually happening, this is a very simple classification problem and I'd expect to get a perfect score with little effort. I'll ping you here once I'm done but it could take a bit haha.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks! Let me know if you need any help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the small dataset sizes we use in tests, I think it would be useful to set
min_data_in_leaf: 0
everywhere. That might improve the predictability of the results.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is taking so long, I haven't had much time and I'm really confused by this. The same data point makes the test fail even for
data_parallel
andgbdt
, I'm trying to figure out what's exactly going on here, I have the test in a while loop and it eventually fails because of that data point, I'm not sure what's wrong with it haha.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, setting
min_data_in_leaf=0
gives this error:LightGBMError: Check failed: (best_split_info.right_count) > (0) at /hdd/github/LightGBM/src/treelearner/serial_tree_learner.cpp, line 663
. Do you think this could be related to #4026? This data is shuffled but I think forcing few samples in a leaf gives more chance to getting an empty split in one of the workers.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's

abs(local_probas - dask_probas)
per iteration fordata_parallel
gbdt
for just that one sample (index 377):So from the 7th iteration onwards the probabilities start to increasingly differ, I think there's definitely something strange going on here.