Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dask] Expected error randomly not raised in Dask test #4099

Closed
StrikerRUS opened this issue Mar 23, 2021 · 6 comments
Closed

[Dask] Expected error randomly not raised in Dask test #4099

StrikerRUS opened this issue Mar 23, 2021 · 6 comments

Comments

@StrikerRUS
Copy link
Collaborator

2021-03-23T11:32:00.3655481Z         error_msg = "has multiple Dask worker processes running on it"
2021-03-23T11:32:00.3656048Z         with pytest.raises(lgb.basic.LightGBMError, match=error_msg):
2021-03-23T11:32:00.3656573Z >           dask_model3.fit(dX, dy, group=dg)
2021-03-23T11:32:00.3657297Z E           Failed: DID NOT RAISE <class 'lightgbm.basic.LightGBMError'>

error_msg = "has multiple Dask worker processes running on it"
with pytest.raises(lgb.basic.LightGBMError, match=error_msg):
dask_model3.fit(dX, dy, group=dg)

Refer to #4068 (comment) and #4068 (comment) for full logs.

@StrikerRUS StrikerRUS changed the title Expected error randomly not raised in Dask test [Dask] Expected error randomly not raised in Dask test Mar 23, 2021
@jameslamb
Copy link
Collaborator

hmmm interesting. In the logs mentioned in those comments, it looks like this is a different root cause from what was fixed in #4071. I think that' what's happening here is that the data is still all ending up on one worker somehow. This is possibly the same underlying problem as #4074, actually.

Error code 104 means "connection reset by peer" (link), which could occur in distributed training if one of the Dask workers dies and is restarted.

Similarly here, if one of the workers died before training started, then it's possible that Dask would have moved the training data back to the other worker, and that then dask_model3.fit() was fitting on only a single worker.

It's possible that one of the workers died because the two previous .fit() calls before this one left behind objects that caused it to run out of memory or caused other errors on the workers.

There's no reason that this test has to be in the same test case as the other network params tests. I just did that to try to minimize the total runtime of tests (number of times we call _create_data()). I think that moving this to its own standalone test case might reduce the risk of this specific failure.

@StrikerRUS
Copy link
Collaborator Author

It's possible that one of the workers died because the two previous .fit() calls before this one left behind objects that caused it to run out of memory or caused other errors on the workers.

client = <Client: 'tcp://127.0.0.1:36289' processes=2 threads=2, memory=16.70 GB>

I believe 16Gb should be enough for toy datasets we use for tests...

I think that moving this to its own standalone test case might reduce the risk of this specific failure.

Yeah, sure. But it doesn't fix underlying issue unfortunately.

I remember I asked this question before but didn't get a clear answer. Has Dask something like "global option for reproducibility"? Similarly to deterministic param in LightGBM. We use Dockers a lot to make out testing environments deterministic and don't change hardware settings for agents, so I'm quite curious why with all seeds fixed and same machine configs we get very different results and frequent failures with Dask.

@jameslamb
Copy link
Collaborator

so I'm quite curious why with all seeds fixed and same machine configs we get very different results and frequent failures with Dask

A couple points on this:

  1. "with Dask" is a bit unfair. The Dask tests are the only automated tests that this project has had on distributed LightGBM training and I don't believe that all of the flakiness is just due to code in lightgbm.dask or its tests. Without dedicated tests on distributed training without Dask (Write tests for parallel code #3841), it's hard to know which problems are "Dask problems" and which are "LightGBM problems" (see [dask] Early stopping #3952 (review) for an example of what I think might be a "LightGBM problem").

  2. Each test case for the Dask tests involves 4 processes: the main pytest process, a dask-scheduler process, and two dask-worker processes. These processes all compete with each other and with other processes running in the CI environment (like any Azure processes that are running in the background to communicate logs and statuses) for CPU time and memory, and so it's possible for them to randomly and occasionally fail for reasons like "this task took a little too long and triggered a timeout".

Has Dask something like "global option for reproducibility"? Similarly to deterministic param in LightGBM

This would be incredibly difficult for Dask or any distributed system to achieve. If you want to write code of the form "move this exact data to this exact worker and then run this exact task on this exact worker..." you can do it with Dask's low-level APIs, but at that point you're not really getting much benefit from Dask because you are doing all the work that its higher-level APIs are intended to abstract away.

Once you get into coordinating processes and not just threads within one process, it becomes much more difficult to predict the exact behavior of the system. LightGBM is able to offer a deterministic=True parameter because it completely controls the running code within a single process.

@StrikerRUS
Copy link
Collaborator Author

This would be incredibly difficult for Dask or any distributed system to achieve.

Absolutely agree with this for the case of "real world". But I thought with only two test workers and deterministic data partitioning algorithm (looks it is wrong for Dask) given the same dataset we don't have a lot of variants.

@jameslamb
Copy link
Collaborator

I haven't seen this one at all in the last month. I hope that #4132 was the fix for it.

I think this can be closed.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants