-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program crashes when running same program twice sequentially #1018
Comments
EDIT SET UP:
|
Thanks @Joachimoe for the details. How does your first run complete, does it terminate cleanly or do you see errors? This can happen if the cluster didn't shut down cleanly, but on my end it terminates cleanly which allows me to run a second time without having to delete that directory. |
The first run completes cleanly. Not errors or casualities. The only output generated is the following:
|
Could you also paste the contents of |
These files are all encoded binaries. I'll try to read the contents of a single file. The contents of the folder are all files encoded with names: (I post sizes of the files as they differ).
|
After trying to add Python runtime state: finalizing (tstate=0x557c06a28b00) |
It's really strange that you're getting no errors but still the files do not cleanup. The cluster.close()
client.shutdown() And if that doesn't work, could you try a nasty hack sleeping for a while just to see if that has any effect? E.g.: import time
...
cluster.close()
client.shutdown()
time.sleep(60) |
These are the error-messages when running the programs ABOVE in the same order AND deleting the storage-folder before each run (IGNORE the time-stamps I did it in reverse locally). WITHOUT time.sleep(60):
2022-10-18 11:54:29,805 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
|
EDIT: After having tinkered around, I tried to create another program which utilizes DASK_ML libraries. These programs run, also sequentially, with no errors at all. It seems to be a specific problem to the code above. If I however do execute the originally pasted code, all other sequential runs of anything importing DASK will crash. One such example is the below program:
|
I can't recreate the initial problem locally (with admittedly a slightly different environment). Can you post the output of |
I am a bit confused by your statements:
and then:
Are you saying you delete the directory before each run or you don't delete? You will need to delete the files before running with the changes I suggested. This is the order I'm proposing:
I'm really puzzled as to what happens in your case, I've also been running the same code as you but I do not experience errors even if I run multiple times. What is more confusing is the fact that your cluster apparently completes successfully, and that normally is an indication that the cleaning of the files should also have occurred. |
@pentschev Sorry for the confusion. I deleted the comment which caused the confusion, as it did not contribute to the discussion. I delete the directory before each run of the program. Adding either of your suggestion results in an error when running the programs the first time around. When adding:
The error is the following:
When adding:
The error is:
|
@wence- The output of the export command is the following:
|
I'm not sure if this is allowed, but here I have uploaded a GIF of the error happening live. |
I can also not run the program from within multiple times, such as:
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:38900 remote=tcp://127.0.0.1:40177>: Stream is closed
|
Thanks, I can reproduce (also with up to date import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
if __name__ == '__main__':
cluster = LocalCUDACluster('0', memory_limit="3GiB")
client = Client(cluster)
shape = (512, 512, 3000)
chunks = (100, 100, 1000)
huge_array = da.ones_like(np.array(()), shape=shape, chunks=chunks)
array_sum = da.multiply(huge_array, 17).persist()
# `persist()` only does lazy evaluation, so we must `wait()` for the
# actual compute to occur.
wait(array_sum) The problem appears to be that the disk-spilling storage ends up in |
Hmm, this is a somewhat chicken and egg situation. We create a |
Working on enabling this via dask/distributed#7151 |
Thanks a lot for all of your work. Should I close this? |
Let's leave it open until we actually have the fixes in, thanks! |
For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Since we do not have a handle on the worker when constructing the keyword arguments to DeviceHostFile or ProxifyHostFile, instead take advantage of dask/distributed#7153 and request that we are called with the worker_local_directory as an argument. Closes rapidsai#1018.
For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Since we do not have a handle on the worker when constructing the keyword arguments to DeviceHostFile or ProxifyHostFile, instead take advantage of dask/distributed#7153 and request that we are called with the worker_local_directory as an argument. Closes rapidsai#1018.
For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Since we do not have a handle on the worker when constructing the keyword arguments to DeviceHostFile or ProxifyHostFile, instead take advantage of dask/distributed#7153 and request that we are called with the worker_local_directory as an argument. Closes rapidsai#1018.
I've been following the work you guys have made, and thanks for making such rapid changes. I also see that the pull-request has been merged. Please tell me when to close this issue :-) Thanks to you both for your help. |
I am just testing the branch that I hope fixes things, when that is merged (after review) it should close this issue automatically. |
For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Since we do not have a handle on the worker when constructing the keyword arguments to DeviceHostFile or ProxifyHostFile, instead take advantage of dask/distributed#7153 and request that we are called with the worker_local_directory as an argument. Closes rapidsai#1018.
For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Since we do not have a handle on the worker when constructing the keyword arguments to DeviceHostFile or ProxifyHostFile, instead take advantage of dask/distributed#7153 and request that we are called with the worker_local_directory as an argument. Closes #1018. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1023
When running the below piece of code:
It runs perfectly the first time around. This of course creates a folder called
dask-worker-space/storage
. If I delete this folder, I can run the program again with no problem. If I do not, however, I get the following error:The text was updated successfully, but these errors were encountered: