Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] Increase number of partitions in data #4149

Closed
wants to merge 7 commits into from
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions tests/python_package_test/test_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def _create_ranking_data(n_samples=100, output='array', chunk_size=50, **kwargs)
return X, y, w, g_rle, dX, dy, dw, dg


def _create_data(objective, n_samples=1_000, output='array', chunk_size=500, **kwargs):
def _create_data(objective, n_samples=1_000, output='array', chunk_size=50, **kwargs):
if objective.endswith('classification'):
if objective == 'binary-classification':
centers = [[-4, -4], [4, 4]]
Expand Down Expand Up @@ -255,7 +255,7 @@ def test_classifier(output, task, boosting_type, tree_learner, client):
'bagging_fraction': 0.9,
})
elif boosting_type == 'goss':
params['top_rate'] = 0.5
params['top_rate'] = 0.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this was added since I last reviewed (981084f). Can you please explain why it's necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_classifier became flaky in this PR. I assume it's because previously we weren't performing distributed training or at least not everytime, so adding this generated some fails in multiclass classification for data_parallel-dart, voting_parallel-rf (this one is very surprising, given that the atol is 0.8), voting_parallel-gbdt, voting_parallel-dart, voting_parallel-goss. Most of them are for dataframe with categoricals but there are a couple with sparse matrices. I have to debug them to see what's actually happening, this is a very simple classification problem and I'd expect to get a perfect score with little effort. I'll ping you here once I'm done but it could take a bit haha.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks! Let me know if you need any help

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect to get a perfect score with little effort

Given the small dataset sizes we use in tests, I think it would be useful to set min_data_in_leaf: 0 everywhere. That might improve the predictability of the results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is taking so long, I haven't had much time and I'm really confused by this. The same data point makes the test fail even for data_parallel and gbdt, I'm trying to figure out what's exactly going on here, I have the test in a while loop and it eventually fails because of that data point, I'm not sure what's wrong with it haha.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, setting min_data_in_leaf=0 gives this error: LightGBMError: Check failed: (best_split_info.right_count) > (0) at /hdd/github/LightGBM/src/treelearner/serial_tree_learner.cpp, line 663. Do you think this could be related to #4026? This data is shuffled but I think forcing few samples in a leaf gives more chance to getting an empty split in one of the workers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's abs(local_probas - dask_probas) per iteration for data_parallel gbdt for just that one sample (index 377):
image
So from the 7th iteration onwards the probabilities start to increasingly differ, I think there's definitely something strange going on here.


dask_classifier = lgb.DaskLGBMClassifier(
client=client,
Expand Down