-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC for reading cuts in background thread in dynamic bucketing #680
base: master
Are you sure you want to change the base?
Conversation
Thanks! I will test it and post the training time with and without this PR. |
It throws the following exception
|
@@ -284,6 +286,9 @@ def __init__( | |||
deque() for _ in range(len(duration_bins) + 1) | |||
] | |||
|
|||
self._cut_reading_thread = ThreadPoolExecutor(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not use a process pool? Due to the global interpreter lock, there can be only one running thread at any given time in Python, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with some setups that use IterableDatasetWrapper
you are placing the sampler in a dataloader worker process, and AFAIK you can't spawn a nested process pool there because that process is daemonic.
Anyway thread should be sufficient here as I expect the CPU to be mostly idle when running forward and backward passes on GPUs... The reason it didn't work for you is likely the thread could not populate the buckets fast enough and sampler thought they are depleted (race condition). This can be solved with a proper synchronization mechanism but unfortunately I don't have the time to add it right now. I'll return to it sometime.
So do we know how the num-frames could be zero? |
I suppose the sampler yielded an empty cutset, the dataset somehow didn't crash and collated an empty tensor. It can be fixed with proper synchronization between threads, but to check really quickly if it works, it could be enough to put
|
... when I have more time again, I'll take care of it and test it end-to-end. |
@danpovey @csukuangfj please check if it is faster now (I checked that it does synchronize correctly with the latest changes). In quick local testing I could not see any difference, but maybe you will notice some in your setup. |
I will test it when we have free GPUs. |
Hi, Is this PR merged? Maybe I have similar problems, that reading the Cuts is not so fast. |
No, it hasn’t been merged — I didn’t find any difference with this implementation in quick testing. Can you describe your environment a bit more? What’s your sampler, max_duration, num_workers, data size, are you reading audio or features, etc. Also I recommend running |
@danpovey it may address the issue described in #678; but I haven't tested it beyond running unit tests successfully. I added a background thread for collect_cuts_in_buckets. Threading should be sufficient, as I expect the main process CPU to be mostly idle during forward passes on GPU. This implementation should be stable but I don't think it covers every possible edge case of multithreading hazards.. I might transform this into a full blown thread-safe queue kind of thing if you can confirm this helps with the training speed (or ends up in a deadlock when run in real training...)
EDIT: I'm not even 100% sure that the mutex is needed at all..