-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask tests randomly fail with socket error code 104 #4074
Comments
Just got the same error code (104) in
|
+2 failures with UPD: +3 UPD: +4 |
Adding for context here, error code I think there is a root underlying problem to the flaky tests and to some of the instability @ffineis saw in #3952. @StrikerRUS I'd like to explore these I do think doing that might cover up some underlying reliability problems, but it would prevent the Dask tests from blocking PRs across this project, and we could then work on slowly removing the "flaky" markers as underlying problems are fixed. |
I support |
Right, that's the tradeoff. Using |
I'm a little confused by the double-negative nature of |
Hmm, quite strange that this error is progressing over time. Right now almost every PR and commit in |
Ok I'll work on at least adding the flaky test handling as soon as possible. I have time this weekend to devote to it. |
@StrikerRUS I'm really really excited to say that I now have a reliable reproducible example of this problem! Or at least, I have one reproducible example that always produces "Socket recv error code: 104". That's a very general error so there could be other situations that lead to it, but I think this one mimics what I suspected, that a worker is lost during training and that results in this situation. Haven't figured out the root cause for LightGBM's tests yet, but I'm going to keep working on it. Steps to Reproduce I ran this all on Ubuntu 18.04, using latest
Note the address in the logs. dask-scheduler
2, In two other shells, start up two workers. dask-worker --nthreads 1 tcp://10.0.0.9:8786 dask-worker --nthreads 1 tcp://10.0.0.9:8786
import dask.array as da
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression
from distributed import Client, LocalCluster, wait
client = Client(address="tcp://10.0.0.9:8786")
# make sure you're starting with clean workers every time
client.restart()
# tightly control sizes of chunks, just for reproducibility purposes
X_np, y_np = make_regression(n_samples=1000, n_features=10)
row_chunks = (100, 100, 100, 100, 100, 100, 100, 100, 100, 100)
X = da.from_array(X_np, chunks=(row_chunks, (10,)))
y = da.from_array(y_np, chunks=(row_chunks))
# persist() + wait() + rebalance() to get an even spread of the data
# across workers
X = X.persist()
y = y.persist()
_ = wait([X, y])
client.rebalance()
# train for a lot of iterations, just so it's easy to manually
# stop a worker during training
model = lgb.DaskLGBMRegressor(num_iterations=10000).fit(X, y)
model
This log message tells you a worker is rank 0
When you kill the rank 0 worker, it always results in |
One thing I noticed while doing this, with more logs...it seems like logs from rank 0
logs from rank 1
I'm looking into that more. If that's true, it might have been the reason for the behavior @ffineis and I saw on #3952 , where it seemed to matter which worker exited (decided that early stopping had been triggered) first. From #3952 (review):
Will update this thread if I make more progress! |
@jameslamb Great findings!
Hmm, so that might be a reason of failing tests in which we run consecutive runs under the same test function, right? |
For anyone subscribed to this issue, I THINK I may have found the root cause, and it might be possible to fix this without needing to mark some Dask tests as flaky. |
@jameslamb |
I think so! I haven't seen this error since that PR was merged. Let's close this for now. |
Just seen this error at Logs:
|
Full logs:
Link: https://github.com/microsoft/LightGBM/runs/2116868032
cc @jameslamb
The text was updated successfully, but these errors were encountered: