-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try messing with buffer size? #666
Comments
The following logs display different time consumption for various settings. Explanations of various options
Baseline with default settings (--num-workers 2)
--num-workers 10
--num-workers 10, bufsize=2000000The following changes are made to lhotse. diff --git a/lhotse/audio.py b/lhotse/audio.py
index 442c73e..eac099a 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1700,7 +1700,7 @@ def read_opus_ffmpeg(
# Read audio samples directly as float32.
cmd += " -f f32le -threads 1 pipe:1"
# Actual audio reading.
- proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
+ proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE, bufsize=2000000)
raw_audio = proc.stdout
audio = np.frombuffer(raw_audio, dtype=np.float32)
# Determine if the recording is mono or stereo and decode accordingly.
You can see that increasing the number of workers helps a bit in reducing the training time. But there is still a large gap The following screenshot compares the training log between the model from k2-fsa/icefall#288 and the baseline. |
It seems to be waiting on the ffmpeg to finish. it's funny that ffmpeg is only getting ~25% CPU most of the time.
|
I am trying two approaches:
Both of them avoid using ffmpeg decoding during training. |
Hm. |
... even an interface where you can:
|
... actually, I think perhaps a multi-(Python)-threaded approach might make more sense here. That is: if we have to load features from a set of cuts, we could create a number of python threads with each thread being responsible for a subset of the cuts. When a thread does a blocking operation like subprocess.run(), presumably Python will yield control to another thread. |
You can do sth like that today, add Long-term I want to rework the data pipeline to use more workers for individual reads and less workers for collation, but it might come closer to end of Q2-Q3 |
Oh, OK. We are already using num_workers for the data loader, but I didn't realize you could do it for OnTheFlyFeatures. |
You might want to experiment with the num_workers value for OnTheFlyFeatures, just bear in mind that the total number of threads spawned is num_workers (DataLoader) * num_workers (OnTheFlyFeatures), so don't set it too high at first to avoid overloading the system. |
Also LMK if it helps, it will be a useful data point for me.. |
Thanks. I will post the training log here. |
Here are the logs for using
For easier reference, the logs for the the baseline (--num-workers=2) are given below:
The following table summarizes the results from #666 (comment)
Row one is using k2-fsa/icefall#288. Note: The log for the reworked model is posted below:
|
Note: If I use a larger number, e.g., 3, for
|
The following shows the logs about using a larger value for --on-the-fly-num-workers = 3
Note: I stop the training after processing 100 batches as it is clear that it is slower, i.e., 10 minutes for 100 batches. --on-the-fly-num-workers = 10
|
If you’re using speed perturbation you might want to re try as Ive just merged an optimization. Also how about 4-8 dataloader workers and 2 on the fly workers? |
Piotr, do you have any guess why it might have crashed for >2 on the fly workers? |
@csukuangfj it seems we requested 20 CPUs on k2-train-6, while the machine has 64 (virtual) CPUs so we probably could have requested more. It's not absolutely clear that this is a limitation, because based on top output you never seem to be using more than about 13 CPUs. But possibly it is limited by node affinity or something. |
BTW, looking at one of your data-loader processes, about half the time it seems to be doing nothing much:
... and the other half of the time it seems to be either (1) waiting for torchaudio_load to finish:
or (2) waiting for ffmpeg to finish:
... actually, probably most of the time it's waiting for read_opus_ffmpeg() to finish, i.e. (2). |
I don't have a good idea why the process would die when spawning too many threads. I never ran into this issue myself. Maybe some native dependency of ours is not thread-safe.. But then it should also probably die with 2 threads. I remember in the past I had a bit similar issue with OpenBLAS compiled with multi-threading supporting, when too many Python processes forked, it was dying (but up to some limit eg 8 processes it kept working fine). Regarding time spent on ffmpeg/reading-audio/waiting -- unfortunately that seems to be about right to me. Since you're not seeing too much improvement when adding more workers in DataLoader, I think that you might be hitting some I/O bottlenecks. I've seen great improvements with sequential I/O in these cases (100x speedups in audio reading). That being said, I think there is still some room to optimize the random access dataloading by redesigning the dataloading architecture to have multiple process pools -- a large pool for file reads, and a smaller one for collation/batch feature extraction/augmentation. I wrote a prototype of it that uses new PyTorch datapipes a while ago, I will need to find more time to finish it though. The draft is here (best read bottom-up): pytorch/data#102 |
@csukuangfj I noticed that sometimes your GPUs will go for a period of time unused, e.g. one at 0% and the others at 100%. At these times, if I use py-spy to see what it is doing, I find that it is in the iterator of the data-loader, as we would probably expect:
... but the very strange thing, to me is, how can the other workers be making progress at this time? I think I have figured it out. The other workers appear to be making progress (they have 100% GPU usage), but in fact if we look at their backtrace they are in backward(), and I believe what is happening is, their backward cannot complete until the straggler catches up. Perhaps we can try re-initializing our virtual machines with more CPUs; it might be starved of CPU for the ffmpeg to run. |
... one thing that caught my attention though -- if you're working with only OPUS data, the program should never go to |
Doesn't DDP have a barrier/allreduce in backward to synchronize the gradients across all processes? That would explain why some processes are idle and can't fetch the next item from dataloader (=> does nothing), especially when you have batches with different sequence lengths in different workers (some will finish sooner than others). |
It is a mixed dataset. The surprising thing to me is that one process seems to be doing nothing for 10 seconds or 20 seconds or more, I'd expect less than a second. |
It also looks strange to me. The LibriSpeech dataset is using pre-computed features so I just added a comment to the related code https://github.com/k2-fsa/icefall/pull/312/files#r850558013 Even though it is using multiple datasets, on-the-fly feature extraction is applied only to the GigaSpeech dataset, which contains |
I think I know, it's reading MUSAN wav files to mix them with GigaSpeech. If one of cuts doesn't have precomputed features, the mixing is done in audio domain. |
The bottleneck seems to be that the machine has fewer CPUs (20). After recreating a machine with 40 CPUs and using --num-workers=5, it reduces the training time a lot. Some logs for the machine with 40 CPUs --num-workers=2
--num-workers=5
|
I have managed to extract the features for the XL subset of the GigaSpeech. Will report the training time using pre-computed features later. |
BTW, since it was CPU-limited, make sure you get Piotr's latest change, regarding the resampling/subsampling algorithm. |
Thanks, I will. |
Guys, I think I may have found one part of the reason why it's so slow. It seems to be wasting a lot of the CPU on Python processes that aren't doing much. It seems the way Python's subprocess I/O works is, it uses "poll" to immediately grab the stdout output as soon as any output is ready; it doesn't wait till the buffer is full. ffmpeg seems to output data in 1280-byte segments, and these are individually read by the python process. I expect the overhead of the context switch is way more than the work done to write those 1280 bytes, and I don't know much about how Linux implements pipes but I expect those reads on the stdout buffer may end up interrupting the ffmpeg process. |
Here is the training log after using the following changes (Note: it still uses --num-workers=2 on a machine with 20 CPUs) diff --git a/lhotse/audio.py b/lhotse/audio.py
index 0752691..a1bc11f 100644
--- a/lhotse/audio.py
+++ b/lhotse/audio.py
@@ -1703,7 +1703,7 @@ def read_opus_ffmpeg(
cmd += f" -ar {force_opus_sampling_rate}"
sampling_rate = force_opus_sampling_rate
# Read audio samples directly as float32.
- cmd += " -f f32le -threads 1 pipe:1"
+ cmd += " -f f32le -threads 1 pipe:1 | cat"
# Actual audio reading.
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
raw_audio = proc.stdout
You can see that it indeed accelerates training, i.e., takes 15 minutes to process 350 batches, while the baseline (--num-workers=2) takes 27 minutes. The training log for the machine with 40 CPUs using 5 dataloader workers is as follows. It takes about 10 minutes to process 350 batches.
I am applying this change to the training with 5 dataloader workers to see whether it can futher accelerate the training. |
Great! And such a simple fix! |
I'm still seeing reads of size 1280 on k2-train-6, but not sure if you applied the change there? |
.. it looks like this has not fully worked, there are still reads of size 1280. Let's try replacing |
After doing a bunch of experimentation with a small python program calling ffmpeg, there seems to be no way to prevent it from doing a lot of small reads. I even tried asyncio; no help. The only thing I can suggest is to mess with niceness, e.g. replace ffmpeg with nice -n -20 ffmpeg, which should run it at a higher priority than the python process that is just wasting time. [Edit: it looks like you cannot do nice with negative values as a user, you get "permission denied" but it goes to stderr with the ffmpeg output and we don't see it]. |
Thanks, this is very interesting. @csukuangfj can you submit a PR with the solution that gave you the speed up? Also with future versions of torchaudio we will be able to leverage their new ffmpeg Python bindings to avoid the mess with sub processes. It is currently available in nightly PyTorch. |
Here is the training log using pre-computed features on a machine with 20 CPUs and 2 dataloader workers.
Note it takes about only 5 minutes to process 350 batches, which is very close to the training time of the reworked model from Dan. |
lhotse/lhotse/audio.py
Line 1703 in b05e344
@csukuangfj this is related to the slowness here k2-fsa/icefall#312, of reading ffmpeg data. I am trying to debug it by doing
strace
on your ffmpeg commands, like this (as you):To see which system calls are slow, we can do:
... anyway it seems to be the case that writing takes longer than reading, i.e. it's spending longer waiting to output data than to read data. "slow writes" happen generally once or twice per ffmpeg program, and I expect it corresponds to when a buffer gets full AND the python program it's writing to happens to be busy doing something that it's hard to wake up from. Now, it looks like the
bufsize
arg to subprocess.run (which is one of the generic kwargs, not specifically listed in the docs) defaults to -1 which means io.DEFAULT_BUFFER_SIZE which seems to be 8192. However I don't see any obvious periodicity in how long the syscallwrite
takes that would correspond to that buffer size.This particular ffmpeg call seems to output about 500k bytes.
One way to make this a little faster might be to just add
bufsize=2000000
to thesubprocess.run()
call in read_opus_ffmpeg(). That would buffer all the output so it never has to wait on the python program that's calling it.The text was updated successfully, but these errors were encountered: